Risk Sensitivity in Markov Games and Multi-Agent Reinforcement Learning

Risk Sensitivity in Markov Games and Multi-Agent Reinforcement Learning
This systematic review examines risk sensitivity in Markov games (MGs) and multi-agent reinforcement learning (MARL). It identifies and mathematically describes different risk measures used in these domains, categorizes them, and discusses related articles in detail. The review covers both explicit risk measures like exponential reward, coherent risk, and cumulative prospect theory, as well as implicit measures like variance, conditional value-at-risk, and chance c
Introduction to Markov Games and Risk-Sensitive Multi-Agent Reinforcement Learning
Markov games (MGs), also known as stochastic games, provide the foundational theoretical framework for studying multi-agent systems (MAS). This game-theoretic model, initially introduced by Shapley and later formalized for multi-agent reinforcement learning (MARL) by Littman, traditionally assumes that agents pursue a risk-neutral objective focused on maximizing expected returns.
Financial Markets and Economics  
Autonomous Driving and Traffic Systems
Robotics and Multi-Robot Systems
Social and Behavioral Simulations
Defense and Cybersecurity
Natural Language Processing and Dialogue Systems
Introduction to Markov Games
A Markov Game is a multi-agent version of an MDP where multiple agents interact in a shared environment. Each agent makes decisions to maximize its cumulative rewards, with each action influencing not only its own state but also the states of other agents.
State: Represents the environment’s status.
Action: Each agent’s choice that affects future states.
Transition Function: Probability of reaching new states based on agents' actions.
Reward: Unique to each agent, fostering competition or cooperation.
Policy: Guides an agent’s actions to maximize cumulative reward.
Multi-Agent Reinforcement Learning
MARL allows agents to adapt and improve their strategies over time within a Markov Game environment. Agents use reinforcement learning to optimize their policies based on the outcomes of past actions.
Exploration and Exploitation: Agents balance trying new actions with using known actions for rewards.
Learning from Rewards: Rewards guide strategy adjustments, refining policies.
Agent Interdependence: Each agent’s choices affect others, making the environment more complex and dynamic than single-agent settings.
Explicit Risk Measures: Exponential Reward
Exponential reward is the most frequently used risk measure in risk-sensitive MG and MARL. It can be applied in both average-reward and discounted-reward settings. In the average-reward case, studies have examined zero-sum and nonzero-sum games with denumerable state spaces and Borel action spaces. For discounted-reward, research has covered competitive zero-sum games, non-zero-sum resource extraction games, and coordination-based MARL.
1
Average-Reward Setting
Studies on zero-sum and nonzero-sum games with denumerable state spaces and Borel action spaces.
2
Discounted-Reward Setting
Research on competitive zero-sum games, non-zero-sum resource extraction games, and coordination-based MARL.
3
Continuous-Time Games
Analysis of games with continuous state and action spaces, including linear-quadratic stochastic dynamic games.
Explicit Risk Measures: Coherent Risk and CPT
Coherent risk measures are characterized by four key mathematical properties: monotonicity, sub-additivity, positive homogeneity, and translation invariance. These measures have been explored in the context of discrete-time mean-field games and non-cooperative games involving multiple players. Cumulative Prospect Theory (CPT), on the other hand, provides a framework for understanding human decision-making under uncertainty, accounting for their risk attitudes. Research has applied CPT in bounded risk-sensitive Markov games, cyber system interactions, and cooperative partially observable Markov games.
Exponential Reward/Cost
Primarily used in control, finance, and operations research.
Coherent Risk
Applied in discrete-time mean-field games and non-cooperative multi-player games.
Cumulative Prospect Theory
CPT finds application in games involving local communication between agents, capturing human risk attitudes during decision-making under uncertainty. 
Applications of Exponential Reward/Cost
Finance and Investment: Exponential reward models capture compounding growth in investments (like compound interest) and the rising cost of debt due to interest, aiding in long-term financial planning.
Resource Allocation and Energy: Exponential cost functions model the rising costs of resource depletion (like fuel) and the decay in battery output over time, supporting sustainable management practices.
Risk and Reliability: Exponential functions help predict system failures over time in engineering, and they assess the rising cost of risk in high-stakes investments.
Supply Chain and Operations: Exponential functions model inventory costs that rise over time (e.g., for perishable items) and discount future costs to their present value, enhancing decision-making over multiple periods.
Applications of Coherent Risk
Portfolio Management: Coherent risk measures, like Expected Shortfall, help portfolio managers assess and minimize risks, especially in extreme market scenarios.
Capital Allocation: Financial institutions use coherent risk measures to allocate capital effectively, ensuring they meet solvency and regulatory requirements.
Insurance and Reinsurance: Insurers apply coherent risk measures to price policies for extreme events, balancing risk across a diverse policy portfolio.
Risk-Adjusted Performance Metrics: Measures like the Sharpe Ratio and RAROC are improved with coherent risk measures, providing accurate assessments for stress testing and high-risk conditions.
Implicit Risk Measures: Variance
Variance as a risk measure has been studied in the discounted-reward setting. Two notable studies have incorporated variance of return (VOR) as a constraint in policy optimization. One study proposed a multi-timescale actor-critic algorithm called RC-MADDPG for learning risk-constrained policies in mixed cooperative/competitive scenarios. Another study constructed a risk-sensitive cooperative framework incorporating mean-variance preferences, deriving the core of the cooperative Markov game using a max-min optimization approach.
1
VOR Constraint
Incorporated into policy optimization for risk-sensitive learning.
2
RC-MADDPG Algorithm
Multi-timescale actor-critic approach for risk-constrained policies.
3
Mean-Variance Framework
Cooperative game structure using max-min optimization.
Implicit Risk Measures: Conditional Value-at-Risk
Conditional Value-at-Risk (CVaR) is the second most frequently used risk measure in MG and MARL. It has been applied in various domains, including energy bidding, community photovoltaic systems, plug-in electric vehicle charging strategies, and multi-energy microgrids. Studies have proposed iterative algorithms to find optimal risk-sensitive bids, developed CVaR-sensitive energy sharing models, and introduced two-stage Markov games to quantify overbidding risk in energy markets.
Energy Bidding
CVaR used in wind farm bidding strategies and community energy sharing.
Electric Vehicle Charging
Optimization of PEV charging strategies in smart grids using CVaR.
Microgrid Management
Two-stage Markov games for risk-aware energy resource trading in microgrids.
Implicit Risk Measures: Chance Constraint
Chance constraints have been applied in multi-robot planning and demand-side management in microgrids. One study presented the chance-constrained iterative linear-quadratic Markov games (CC-ILQGames) algorithm for capturing agents' interactions and safety concerns in autonomous driving scenarios. Another study proposed a generalized stochastic dynamic game for modeling demand-side management in a microgrid with shared battery resources, proving the uniqueness of the risk-sensitive generalized Nash equilibrium.
Multi-Robot Planning
CC-ILQGames algorithm for autonomous driving scenarios.
Microgrid Management
Generalized stochastic dynamic game for demand-side management.
Nash Equilibrium
Proof of unique risk-sensitive generalized Nash equilibrium.
Trends and Future Directions
The review reveals a shift from purely theoretical analysis of exponential risk to more application-oriented risk measures in recent years. This trend is attributed to the need for specialized risk measures in different domains and the emergence of deep reinforcement learning methods. The increasing focus on modeling risk-sensitive behavior in real-world scenarios, particularly in finance, energy trade, and autonomous driving, suggests that the use of diverse risk measures tailored to specific applications will continue to rise.
1
Pre-2016
Limited to exponential reward/cost risk measure, mostly theoretical analysis.
2
2016-2020
Introduction of diverse risk measures, emergence of deep RL methods.
3
2020 onwards
Increased focus on application-specific risk measures and real-world scenarios.
Conclusion and Future Research
This systematic review provides a comprehensive analysis of risk sensitivity in Markov games and multi-agent reinforcement learning. It identifies and categorizes various risk measures, both explicit and implicit, used in the field. The review highlights the growing importance of risk-sensitive approaches in modeling real-world multi-agent systems, particularly in domains like finance, energy markets, and autonomous driving. Future research is likely to focus on developing more specialized risk measures for specific applications and integrating advanced deep learning techniques with risk-sensitive frameworks to address complex real-world scenarios.
1
Comprehensive Analysis
Systematic review of risk measures in MG and MARL.
2
Growing Importance
Increased focus on risk-sensitive approaches for real-world systems.
3
Future Directions
Development of specialized risk measures and integration with deep learning techniques.
Made with Gamma