Posts by Collection

publications

Generative-Discriminative Mean Field Distribution Approximation in Multi-agent Reinforcement Learning

Published in Coordination and Cooperation in Multi-Agent Reinforcement Learning Workshop - RL Conference, 2024

Non-cooperative and cooperative games with a very large number of players remain generally intractable when the number of players increases. Introduced by Lasry and Lions (2007) and Huang et al. (2006), Mean Field Games (MFGs) rely on a mean-field approximation to allow the number of players to grow to infinity. In Mean-field reinforcement learning, When the state space is finite but very large storing the population distribution in a tabular way for every state and computing the evolution of this distribution in an exact way is prohibitive in terms of memory and computational time. In continuous spaces, representing and updating the distribution is even more challenging, even if it is just for the purpose of implementing the RL environment and not to use it as an input to the policies. In this case, one needs to rely on approximations. This research aims to propose a model-based reinforcement learning algorithm, GD-MFRL that efficiently represents the distribution using function approximation in a two-part generative and discriminative setting; (i) one part learns to generate distributions by trial and error, and (ii) the other part tries to evaluate these distributions. The definition of such a framework requires answering several challenging research questions, including: How to evaluate the transfer quality in a Multiagent scenario?

Recommended citation: Shabbar, Aya. (2024). Generative-Discriminative Mean Field Distribution Approximation in Multi-agent Reinforcement Learning
Download Paper

Generative Adversarial Skill Estimation in Opponent Modeling

Published in Training Agents with Foundation Models Workshop - RL Conference, 2024

In opponent modeling, data about failed actions and models of opponent capabilities can be mined to improve estimates of the strategy gradient and the reliability and stochasticity of certain actions for the given opponent. We want our opponent modeling systems to reason similarly not only about what the opponent plans to do, but also about their probability of success should they choose a given action. The problem of skill estimation (Archibald and Nieves-Rivera, 2018) is closely related, and Bayesian techniques have been proposed for simulated, real-valued games including darts and billiards (Archibald and Nieves-Rivera, 2019). In this paper, I propose a novel method called generative adversarial skill estimation (GASE) to encourage the estimation and the probability of success in RL opponent modeling via introducing an intrinsic reward output from a foundation model generative adversarial network, where the generator provides fake samples of the opponent’s actions that help discriminator to identify those failed actions with their probability of success. Thus the agent can identify failed actions that the discriminator is less confident to judge as successful. This work is mainly motivated by the question: How can FMs be used for skill discovery?

Recommended citation: Shabbar, Aya. (2024). Generative Adversarial Skill Estimation in Opponent Modeling
Download Paper

Idea: Online Opponent Modeling with Foundation Models

Published in Aligning Reinforcement Learning Experimentalists and Theorists Workshop, Ideas Track - ICML Conference, 2024

Opponent modeling (OM) is the ability to use prior knowledge and observations in order to predict the behavior of an opponent. On the other hand, there has been tremendous research at the intersection of foundation models (FM) and decision making which holds tremendous promise for creating powerful new systems that can interact effectively across a diverse range of applications. This paper examines the integration of foundation models with opponent modeling and tackles one of the open problems in FMs for decision-making (i) leveraging and collecting decision-making datasets DRL; specifically datasets for the opponent modeling systems in the large-scale human demonstration, which is hard to scale., and (ii) proposing a new framework for opponent modeling: Using FMs as a guiding tool that enhances the agent capabilities in prediction. The goal is to train a policy from a given environment without reward signals. I propose using foundation models (FMs), i.e., large language models (LLMs) and vision-language models (VLMs), to achieve this goal. The LLM generates instructions that help the agent to learn features of the behavior of the opponent and ultimately enables the agent to exploit the opponent’s strategy in the current environment d(s0). In contrast, the VLM works as a policy-guided learning. The internet-scale knowledge capacity of recent FMs enables automating impractical human effort in the RL framework [1]. Existing works query pre-trained LLMs for tasks to learn [2], language-level plans [3], and language labels [4]; or use pre-trained VLMs to obtain visual feedback [5]. ELLM [6] uses LLMs to propose new tasks for agents to learn. A line of work [7] specifically focuses on using FMs for the Minecraft domain, while none of the works integrate pre-trained LLM and VLM for opponent modeling. Inspired by [8], this work is mainly motivated by two questions: How to leverage and construct datasets for decision-making DRL i.e. FMs and OM? And can we teach RL agents to predict opponents’ actions and strategies accurately in opponent modeling environments without human supervision?

Recommended citation: Shabbar, Aya. (2024). Idea: Online Opponent Modeling with Foundation Models
Download Paper

Model-Agnostic Meta-Learning with Open-Ended Reinforcement Learning

Published in Intrinsically Motivated Open-ended Learning Workshop - NeurIPS Conference, 2024

This paper is an in-progress research that builds on the Open-Ended Reinforcement Learning with Neural Reward Functions proposed by Meier and Mujika [1] which use reward functions encoded by neural networks. One key limitation of their paper is the necessity of re-learning for each new skill learned by the agent. Consequently, we propose integrating meta-learning algorithms to tackle this problem. We, therefore, study the use of MAML, Model-Agnostic Meta Learning that we believe could make policy learning more efficient. MAML operates by learning an initialization of the model parameters that can be fine-tuned with a small number of examples from a new task which allows for rapid adaptation to new tasks.

Recommended citation: Shabbar, Aya. (2024). Model-Agnostic Meta-Learning with Open-Ended Reinforcement Learning
Download Paper

Synthesizing Programmatic Reinforcement Learning Policies with Memory-Based Decision Trees

Published in Programmatic Reinforcement Learning Workshop - RL Conference, 2025

Programmatic reinforcement learning (PRL) has been explored for representing policies through programs as a means to achieve interpretability and generalization, meaning involving higher order constructs such as control loops. Despite promising outcomes, current state-of-the-art PRL methods are hindered by sample inefficiency and very little is known on the theoretical front about programmatic RL. A burning question is studying the trade-offs between sizes of programmatic policies and their performances. Hence, the motivation of this work is to construct programmatic policies with the shortest paths to the target region, ensuring near optimal behavior. Alongside this, we also investigate how we can reason with programmatic policies by using two learning systems: Reward Prediction Error (RPE) and Action Prediction Error (APE). How can we learn programmatic policies that can generalize better? The goal of this paper is to give first answers to these questions, initiating a theoretical study of programmatic RL. Our main contributions are construction a near optimal policy using memory-based decision trees, and studying the generalizability and size performance.

Recommended citation: Shabbar, Aya. (2025). Synthesizing Programmatic Reinforcement Learning Policies with Memory-Based Decision Trees
Download Paper

Hyperbolic Discounting in Hierarchical Reinforcement Learning

Published in Finding the Frame Workshop - RL Conference, 2025

Decisions often require balancing immediate gratification against long-term benefits. In Reinforcement Learning (RL), this balancing act is influenced by temporal discounting, which quantifies the devaluation of future rewards. Prior research indicates that human decision-making aligns more closely with hyperbolic discounting than the conventional exponential discounting used in RL. As artificial agents become more advanced and pervasive, particularly in multi-agent settings alongside humans, the need for appropriate discounting models becomes critical. Although hyperbolic discounting has been proposed for single-agent learning along with multi-agent reinforcement learning (MARL), it is still underexplored in more advanced settings such as the hierarchical reinforcement learning (HRL). We introduce and formulate hyperbolic discounting in HRL, establishing theoretical and practical foundations across various frameworks, including option critic and Feudal Networks methods. We evaluate hyperbolic discounting on diverse tasks, comparing it to the exponential discounting baseline. Our results show that hyperbolic discounting achieves higher returns in 50 of scenarios and performs on par with exponential discounting in 95 of tasks, with significant improvements in sparse reward and coordination-intensive environments. This work opens new avenues for robust decision-making processes in the development of advanced RL systems.

Recommended citation: Shabbar, Aya. (2025). Hyperbolic Discounting in Hierarchical Reinforcement Learning
Download Paper

Reward-Free Deep-Learning-Based Reinforcement Learning

Published in European Workshop on Reinforcement Learning, 2025

Exploration is widely regarded as one of the most challenging aspects of reinforcement learning (RL). We consider the reward-free RL problem, which operates in two phases: an exploration phase, where the agent gathers exploration trajectories over episodes irrespective of any predetermined reward function, and a subsequent planning phase, where a reward function is introduced. The agent then utilizes the episodes from the exploration phase to calculate a near-optimal policy. Existing algorithms and sample complexities for reward-free RL are limited to tabular, linear, or very smooth function approximations, leaving the problem largely open for more general cases. We consider deep-learning-based function approximations, i.e. DQNs, and propose an algorithm based on internal feedback and the agent’s own confidence and self-certainty in a graph MDP

Recommended citation: Shabbar, Aya. (2024). Reward-Free Deep-Learning-Based Reinforcement Learning
Download Paper

teaching

C++: Introduction to Programming, Data Structure, and Algorithms

Undergraduate course, Tishreen University, Mechatronics Engineering Department, 2023

MATLAB: Programmaing Applications

Undergraduate course, Tishreen University, Mechatronics Engineering Department, 2023

Aya Shabbar