Multi-Agent Reinforcement Learning (MARL) extends single-agent RL to settings where multiple agents interact within a shared environment. Agents may cooperate, compete, or operate in mixed scenarios, requiring coordination mechanisms and specialized training techniques.
The dominant paradigm in cooperative MARL, where:
- Training: Agents have access to global information (all observations, actions, or the global state)
- Execution: Each agent acts independently using only its local observations
- Benefit: Enables better credit assignment while maintaining scalability at execution time
For cooperative tasks, the key challenge is decomposing a joint value function into individual agent contributions while maintaining representational capacity:
- Individual Condition of Maximum (ICM): If each agent independently takes the action that maximizes its local Q-value, this should correspond to the team-optimal joint action
- Monotonicity Constraint: Q_total should be monotonically increasing in each agent's Q-value (used by QMIX)
- Non-stationarity: From each agent's perspective, the environment is non-stationary as other agents' policies evolve
- Credit Assignment: Determining each agent's contribution to team success
- Scalability: Complexity grows exponentially with the number of agents
- Partial Observability: Agents often have limited local observations
- Communication: Designing efficient communication protocols between agents
- QMIX: Monotonic value function factorization using a mixing network
- WQMIX: Weighted QMIX with relaxed monotonicity constraints
- QPLEX: Duplex dueling architecture for complete IGM factorization
- MAPPO: Multi-Agent PPO with shared centralized critic
- MADDPG: Multi-Agent DDPG for continuous control
| Algorithm | Type | Action Space | Best For | Key Innovation |
|---|---|---|---|---|
| MAPPO | Policy-gradient | Continuous/Discrete | Cooperative tasks with continuous actions | Shared centralized critic with GAE |
| QMIX | Value-based | Discrete | Cooperative tasks with discrete actions | Monotonic value mixing |
| WQMIX | Value-based | Discrete | Non-monotonic cooperative tasks | Importance-weighted mixing |
| QPLEX | Value-based | Discrete | Complex factorization needs | Duplex dueling structure |
| MADDPG | Policy-gradient | Continuous | Mixed cooperative-competitive | Per-agent critics with global info |
MAPPO:
- Continuous action spaces (robot control, autonomous vehicles)
- Environments with high-dimensional observations
- When stability and ease of tuning are priorities
QMIX/WQMIX/QPLEX:
- Discrete action spaces (StarCraft, traffic control)
- Strict cooperation requirements
- When sample efficiency is critical
- WQMIX when optimal policy violates monotonicity
- QPLEX when you need the strongest representational capacity
MADDPG:
- Mixed cooperative-competitive scenarios
- Continuous control tasks
- When you can afford the computational cost of per-agent critics
Multi-agent replay buffers typically store:
- Per-agent observations and actions
- Global state (if available)
- Shared team reward (cooperative) or individual rewards (competitive)
- Episode information for proper trajectory handling
- Observation encoding: CNN for visual inputs, MLP for vectors
- Agent networks: Often parameter-shared across agents to improve generalization
- Centralized components: Mixing networks or critics process concatenated information
- RNNs: Often used to handle partial observability (QMIX with GRU, etc.)
- Parameter sharing: Share weights across agents to reduce parameters and improve generalization
- Gradient clipping: Essential for stability in multi-agent settings
- Target networks: Use slower-updating targets to stabilize learning
- Episode-based training: Train on complete episodes for proper credit assignment
- Exploration: Epsilon-greedy, action noise, or entropy bonuses scaled per agent
- QMIX: Rashid et al., 2018
- WQMIX: Rashid et al., 2020
- QPLEX: Wang et al., 2020
- MAPPO: Yu et al., 2022
- MADDPG: Lowe et al., 2017
- Offline RL: Multi-agent offline RL is an emerging research area
- Reward Modeling: Designing reward functions for multi-agent cooperation
- Exploration: Multi-agent exploration strategies