A comprehensive guide to reinforcement learning methods, from foundational value-based algorithms to cutting-edge alignment techniques for large language models and multi-agent systems.
- Overview
- RL Landscape
- When to Use Each Method
- Methods Catalog
- Implementation Reference
- Performance Characteristics
- Key Innovations by Year
- Common Patterns
- Benchmarks
- References
- Next Steps
Reinforcement Learning (RL) enables agents to learn optimal behavior through interaction with an environment, receiving rewards or penalties based on their actions. Unlike supervised learning, RL agents must discover which actions yield the most reward through trial and error, making it ideal for sequential decision-making problems.
An RL agent learns a policy π that maps states to actions to maximize cumulative reward:
Objective: max E[Σ γ^t r_t]
π t=0
Where:
- π (Policy): Strategy for selecting actions given states
- r_t: Reward at timestep t
- γ: Discount factor (0 < γ ≤ 1)
- V(s): Value of state s (expected cumulative reward)
- Q(s,a): Value of taking action a in state s
RL has achieved superhuman performance in complex domains:
- Games: AlphaGo, OpenAI Five (Dota 2), AlphaStar (StarCraft II)
- Robotics: Dexterous manipulation, locomotion, industrial automation
- LLM Alignment: ChatGPT's RLHF, Claude's Constitutional AI
- Autonomous Systems: Self-driving cars, drone navigation, traffic control
- Resource Management: Data center cooling, chip design, financial trading
The field has evolved from simple Q-learning to sophisticated methods that can train billion-parameter language models and coordinate hundreds of agents.
┌─────────────────────────────────┐
│ Reinforcement Learning │
└─────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
┌─────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Classic │ │ Modern │ │ Specialized │
│ RL │ │ Methods │ │ Methods │
└────────────┘ └─────────────┘ └─────────────┘
│ │ │
┌─────┴─────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
Value- Policy Offline Model- Alignment Multi-
Based Gradient RL Based Agent
│ │ │ │ │ │
DQN/ PPO/SAC IQL/CQL DreamerV3 DPO/GRPO MAPPO/
Rainbow DDPG/TD3 ReBRAC TD-MPC2 KTO/SimPO QMIX
│ │ │ │ │ │
└───────────┴─────────┴─────────────┴────────┴─────────────┘
│
┌─────────────┴─────────────┐
│ │
┌─────▼──────┐ ┌──────▼──────┐
│ Auxiliary │ │ Planning │
│ Methods │ │ & Search │
└────────────┘ └─────────────┘
│ │
┌─────┴─────┐ ┌──────┴──────┐
│ │ │ │
▼ ▼ ▼ ▼
Exploration Sequence Reward MCTS/
ICM/RND Transformer Modeling AlphaZero
Learn action-value functions Q(s,a) to select optimal actions:
- DQN: Deep Q-Network with experience replay
- Double DQN: Addresses overestimation bias
- Dueling DQN: Separate value and advantage streams
- Rainbow: Combines 6 DQN improvements
- C51: Distributional RL (categorical distribution)
- QR-DQN: Quantile regression for distributional RL
Directly optimize the policy through gradient ascent:
- REINFORCE: Monte Carlo policy gradient
- A2C: Advantage Actor-Critic (synchronous)
- PPO: Proximal Policy Optimization (industry standard)
- DDPG: Deep Deterministic Policy Gradient
- TD3: Twin Delayed DDPG (reduces overestimation)
- SAC: Soft Actor-Critic (maximum entropy RL)
- TRPO: Trust Region Policy Optimization
Learn from fixed datasets without environment interaction:
- IQL: Implicit Q-Learning (expectile regression)
- CQL: Conservative Q-Learning (conservative estimation)
- Cal-QL: Calibrated Q-Learning
- IDQL: In-sample Dueling Q-Learning
- ReBRAC: Regularized Behavior-Regularized Actor-Critic
- EDAC: Ensemble-Diversified Actor-Critic
- TD3+BC: TD3 with behavioral cloning
- AWR: Advantage-Weighted Regression
Align language models with human preferences:
- DPO: Direct Preference Optimization (no reward model)
- GRPO: Group Relative Policy Optimization
- KTO: Kahneman-Tversky Optimization
- SimPO: Simple Preference Optimization
- ORPO: Odds Ratio Preference Optimization
- IPO: Identity Preference Optimization
- SPIN: Self-Play Fine-Tuning
- RLOO: Reinforcement Learning from Likelihood Optimization
- ReMax: Reward Maximization
- RLVR: Reinforcement Learning with Verifiable Rewards
Multiple agents learning simultaneously:
- MAPPO: Multi-Agent PPO
- QMIX: Value decomposition for cooperation
- WQMIX: Weighted QMIX
- MADDPG: Multi-Agent DDPG
- QPLEX: Q-value decomposition with duplex dueling
Learn environment dynamics for planning:
- DreamerV3: World model learning via reconstruction
- TD-MPC2: Temporal difference model predictive control
- MBPO: Model-Based Policy Optimization
Systematic exploration beyond random actions:
- ICM: Intrinsic Curiosity Module
- RND: Random Network Distillation
- Go-Explore: Archive-based exploration
Apply transformer architectures to RL:
- Decision Transformer: Conditional sequence modeling
- Elastic DT: Adaptive horizon decision transformer
- Online DT: Online fine-tuning of offline models
Learn reward functions from human feedback:
- Enhanced Reward Model: Advanced preference modeling
- PRM: Process Reward Model (step-by-step)
- ORM: Outcome Reward Model (final result)
- Generative RM: Generative reward modeling
Combine search with learned value/policy:
- MCTS: Monte Carlo Tree Search
- PRM Agent: Planning with process rewards
- AlphaZero: Self-play planning
Start: What's your problem domain?
├─ Discrete Actions (e.g., Atari games, board games)?
│ ├─ Simple environment → DQN
│ ├─ Need robustness → Rainbow DQN
│ ├─ Risk-sensitive → C51 or QR-DQN
│ └─ Multi-agent coordination → QMIX/QPLEX
├─ Continuous Control (e.g., robotics, locomotion)?
│ ├─ Sample efficiency priority → SAC or TD3
│ ├─ Stable training priority → PPO
│ ├─ Simple baseline → DDPG
│ └─ Maximum entropy exploration → SAC
├─ Learning from Offline Data?
│ ├─ No online fine-tuning → IQL or CQL
│ ├─ Hybrid online/offline → Cal-QL
│ ├─ Need conservatism → CQL or EDAC
│ └─ Simple continuous control → TD3+BC or AWR
├─ Aligning Language Models?
│ ├─ No reward model available → DPO or SimPO
│ ├─ Have reward model → PPO (RLHF)
│ ├─ Group-based preferences → GRPO
│ ├─ Probability-based → KTO
│ └─ Self-improvement → SPIN
├─ Multi-Agent Coordination?
│ ├─ Cooperative tasks → MAPPO or QMIX
│ ├─ Competitive tasks → MADDPG
│ └─ Complex value decomposition → QPLEX
├─ Need Sample Efficiency?
│ ├─ Learn world model → DreamerV3
│ ├─ Model-predictive control → TD-MPC2
│ └─ Hybrid approach → MBPO
├─ Sparse Rewards Problem?
│ ├─ Need curiosity → ICM or RND
│ ├─ Hard exploration → Go-Explore
│ └─ Need reward shaping → Reward Modeling
├─ Sequential Decision Making?
│ ├─ Offline dataset → Decision Transformer
│ ├─ Adaptive horizon → Elastic DT
│ └─ Online learning → Online DT
└─ Game Playing with Perfect Info?
├─ Need search → MCTS
├─ Self-play → AlphaZero
└─ Process-based → PRM Agent
| Game Type | Best Method | Alternative | Notes |
|---|---|---|---|
| Atari | Rainbow DQN | PPO | Discrete actions, pixel input |
| Board Games | AlphaZero | MCTS + NN | Perfect information |
| Multi-Agent Games | MAPPO | QMIX | Coordination required |
| Sparse Reward Games | Go-Explore | RND/ICM | Hard exploration |
| Task | Best Method | Alternative | Sample Efficiency |
|---|---|---|---|
| Locomotion | SAC/TD3 | PPO | High |
| Manipulation | DreamerV3 | TD-MPC2 | Very High |
| From Demos | IQL | AWR | Offline only |
| Sim-to-Real | MBPO | SAC | High |
| Scenario | Method | Complexity | Preference Type |
|---|---|---|---|
| Standard RLHF | PPO + RM | High | Pairwise |
| No Reward Model | DPO | Low | Pairwise |
| Simple Preferences | SimPO | Low | Pairwise |
| Group Preferences | GRPO | Medium | Group |
| Binary Feedback | KTO | Low | Thumbs up/down |
| Self-Improvement | SPIN | Medium | Self-generated |
| Cooperation Level | Agents | Method | Communication |
|---|---|---|---|
| Fully Cooperative | 2-20 | MAPPO | Centralized training |
| Value Decomposition | 2-50 | QMIX/QPLEX | Implicit |
| Mixed Cooperative | 2-10 | MADDPG | Explicit |
| Competitive | 2-10 | Self-Play PPO | None |
| Method | Year | Complexity | Use Case | Sample Efficiency |
|---|---|---|---|---|
| DQN | 2013 | Low | Discrete actions | Medium |
| Double DQN | 2015 | Low | Reduce overestimation | Medium |
| Dueling DQN | 2016 | Medium | State-dependent value | Medium |
| Rainbow | 2017 | High | State-of-the-art DQN | Medium |
| C51 | 2017 | Medium | Distributional RL | Medium |
| QR-DQN | 2017 | Medium | Quantile regression | Medium |
| Method | Year | Complexity | Use Case | Sample Efficiency |
|---|---|---|---|---|
| REINFORCE | 1992 | Low | Simple baseline | Low |
| A2C | 2016 | Low | Baseline actor-critic | Low |
| PPO | 2017 | Medium | General purpose (gold standard) | Medium |
| DDPG | 2015 | Medium | Continuous control | Medium |
| TD3 | 2018 | Medium | Stable continuous control | High |
| SAC | 2018 | Medium | Sample-efficient continuous | High |
| TRPO | 2015 | High | Safe policy updates | Medium |
| Method | Year | Complexity | Use Case | Dataset Quality |
|---|---|---|---|---|
| IQL | 2021 | Medium | General offline RL | Any |
| CQL | 2020 | Medium | Conservative learning | Any |
| Cal-QL | 2022 | Medium | Calibrated Q-learning | Any |
| IDQL | 2023 | Medium | In-sample dueling | Any |
| ReBRAC | 2023 | Medium | Regularized actor-critic | Medium-High |
| EDAC | 2021 | High | Ensemble diversification | Medium-High |
| TD3+BC | 2021 | Low | Simple offline baseline | Medium-High |
| AWR | 2019 | Low | Advantage-weighted | Medium-High |
| Method | Year | Complexity | Use Case | Requires RM |
|---|---|---|---|---|
| DPO | 2023 | Low | Direct preference opt | No |
| GRPO | 2024 | Medium | Group preferences | No |
| KTO | 2024 | Low | Binary feedback | No |
| SimPO | 2024 | Low | Simple preferences | No |
| ORPO | 2024 | Low | Odds ratio preferences | No |
| IPO | 2023 | Low | Identity preferences | No |
| SPIN | 2024 | Medium | Self-play fine-tuning | No |
| RLOO | 2024 | Medium | Likelihood optimization | Yes |
| ReMax | 2024 | Medium | Reward maximization | Yes |
| RLVR | 2024 | Medium | Verifiable rewards | Yes |
| Method | Year | Complexity | Use Case | Scalability |
|---|---|---|---|---|
| MAPPO | 2021 | Medium | Cooperative tasks | 10-100 agents |
| QMIX | 2018 | Medium | Value decomposition | 5-20 agents |
| WQMIX | 2020 | Medium | Weighted QMIX | 5-20 agents |
| MADDPG | 2017 | High | Competitive/mixed | 2-10 agents |
| QPLEX | 2020 | High | Complex decomposition | 5-20 agents |
| Method | Year | Complexity | Use Case | Sample Efficiency |
|---|---|---|---|---|
| DreamerV3 | 2023 | High | General RL | Very High |
| TD-MPC2 | 2024 | High | Model-predictive control | Very High |
| MBPO | 2019 | Medium | Hybrid model-based | High |
| Method | Year | Complexity | Use Case | Exploration Type |
|---|---|---|---|---|
| ICM | 2017 | Medium | Curiosity-driven | Intrinsic motivation |
| RND | 2018 | Medium | Novelty detection | Intrinsic motivation |
| Go-Explore | 2019 | High | Hard exploration | Archive-based |
| Method | Year | Complexity | Use Case | Data Type |
|---|---|---|---|---|
| Decision Transformer | 2021 | Medium | Offline RL | Trajectories |
| Elastic DT | 2023 | Medium | Adaptive horizon | Trajectories |
| Online DT | 2022 | Medium | Online fine-tuning | Mixed |
| Method | Year | Complexity | Use Case | Granularity |
|---|---|---|---|---|
| Enhanced RM | 2023 | Medium | Advanced preferences | Outcome |
| PRM | 2023 | High | Step-by-step rewards | Process |
| ORM | 2023 | Low | Final outcome rewards | Outcome |
| Generative RM | 2024 | High | Generative modeling | Process |
| Method | Year | Complexity | Use Case | Search Depth |
|---|---|---|---|---|
| MCTS | 2006 | Medium | Game playing | Adaptive |
| PRM Agent | 2023 | High | Reasoning tasks | Fixed |
| AlphaZero | 2017 | High | Perfect info games | Adaptive |
All methods are implemented in nexus/models/rl/ with production-grade quality:
from nexus.models.rl import (
# Value-Based
DQN,
DoubleDQN,
DuelingDQN,
RainbowDQN,
C51,
QRDQN,
# Policy Gradient
REINFORCE,
A2C,
PPO,
DDPG,
TD3,
SAC,
TRPO,
# Offline RL
IQL,
CQL,
CalQL,
IDQL,
ReBRAC,
EDAC,
TD3BC,
AWR,
# Alignment
DPO,
GRPO,
KTO,
SimPO,
ORPO,
IPO,
SPIN,
RLOO,
ReMax,
RLVR,
# Multi-Agent
MAPPO,
QMIX,
WQMIX,
MADDPG,
QPLEX,
# Model-Based
DreamerV3,
TDMPC2,
MBPO,
# Exploration
ICM,
RND,
GoExplore,
# Sequence-Based
DecisionTransformer,
ElasticDT,
OnlineDT,
# Reward Modeling
EnhancedRewardModel,
ProcessRewardModel,
OutcomeRewardModel,
GenerativeRM,
# Planning
MCTS,
PRMAgent,
AlphaZero,
)agent = DQN(
state_dim=4,
action_dim=2,
hidden_dim=128,
learning_rate=1e-3,
gamma=0.99,
epsilon_decay=0.995
)
action = agent.select_action(state)
agent.update(state, action, reward, next_state, done)agent = PPO(
state_dim=17,
action_dim=6,
hidden_dim=256,
learning_rate=3e-4,
gamma=0.99,
gae_lambda=0.95,
clip_epsilon=0.2,
n_epochs=10
)
# Collect rollouts
rollouts = agent.collect_rollouts(env, n_steps=2048)
# Update policy
agent.update(rollouts)agent = SAC(
state_dim=24,
action_dim=4,
hidden_dim=256,
learning_rate=3e-4,
gamma=0.99,
tau=0.005,
alpha=0.2, # Entropy coefficient
automatic_entropy_tuning=True
)
action = agent.select_action(state, evaluate=False)
agent.update(replay_buffer, batch_size=256)agent = IQL(
state_dim=11,
action_dim=3,
hidden_dim=256,
learning_rate=3e-4,
tau=0.7, # Expectile for value learning
beta=3.0 # Advantage weighting
)
# Train on offline dataset
agent.train(offline_dataset, n_steps=1000000)from nexus.models.rl.alignment import DPO
trainer = DPO(
model=base_model,
ref_model=reference_model,
beta=0.1, # KL penalty coefficient
learning_rate=5e-7,
max_length=512
)
# Train on preference pairs
trainer.train(
preference_dataset, # (prompt, chosen, rejected) tuples
num_epochs=3
)agent = MAPPO(
n_agents=5,
state_dim=20,
action_dim=5,
hidden_dim=128,
share_policy=True, # Parameter sharing
use_centralized_critic=True
)
actions = agent.select_actions(observations)
agent.update(multi_agent_buffer)PPO: ████████████ 1.0x (baseline)
DQN: ████████ 0.7x
Rainbow: ████████████████ 1.3x
SAC: ████████████████████ 1.6x
TD3: ██████████████████ 1.5x
DreamerV3: ████████████████████████████ 2.3x
TD-MPC2: ████████████████████████ 2.0x
IQL (offline): ████████████████████████████████ 2.7x
REINFORCE: ████████████████████████ Fast
A2C: ████████████████████ Fast
PPO: ████████████████ Medium-Fast
SAC/TD3: ████████████ Medium
DQN/Rainbow: ██████████ Medium
DreamerV3: ██████ Slow
AlphaZero: ███ Very Slow
SAC: ██ Very Stable
TD3: ███ Very Stable
PPO: ████ Stable
TRPO: ████ Stable
DQN: ██████ Moderate
DDPG: ████████████ Unstable
REINFORCE: ████████████████ Very Unstable
REINFORCE: ██ Minimal
DQN: ████████ Medium (replay buffer)
PPO: ████ Small (rollout buffer)
SAC/TD3: ████████ Medium (replay buffer)
Rainbow: ████████████ Large (replay + extras)
DreamerV3: ████████████████ Very Large (world model)
AlphaZero: ████████████████████ Extreme (MCTS + NN)
- 1992: REINFORCE (Williams) - First policy gradient
- 2013: DQN (Mnih et al.) - Deep RL breakthrough
- 2015: DDPG (Lillicrap et al.) - Continuous control
- 2015: TRPO (Schulman et al.) - Trust region optimization
- 2016: A3C (Mnih et al.) - Asynchronous methods
- 2017: PPO (Schulman et al.) - Industry standard
- 2017: Rainbow (Hessel et al.) - DQN improvements combined
- 2017: AlphaZero (Silver et al.) - Self-play mastery
- 2018: SAC (Haarnoja et al.) - Maximum entropy RL
- 2018: TD3 (Fujimoto et al.) - Twin critics
- 2018: QMIX (Rashid et al.) - Multi-agent value decomposition
- 2019: MBPO (Janner et al.) - Model-based policy optimization
- 2019: Go-Explore (Ecoffet et al.) - Archive exploration
- 2020: CQL (Kumar et al.) - Conservative offline RL
- 2021: IQL (Kostrikov et al.) - Implicit Q-learning
- 2021: Decision Transformer (Chen et al.) - Sequence modeling for RL
- 2021: MAPPO (Yu et al.) - Multi-agent PPO
- 2023: DPO (Rafailov et al.) - Direct preference optimization
- 2023: DreamerV3 (Hafner et al.) - General world models
- 2023: ReBRAC (Tarasov et al.) - Regularized offline RL
- 2024: GRPO, KTO, SimPO - Next-gen alignment methods
- 2024: TD-MPC2 (Hansen et al.) - Scalable model predictive control
- 2023: DPO - Eliminates reward model
- 2024: SimPO - Simplifies DPO further
- 2024: GRPO - Group-based optimization
- 2024: KTO - Kahneman-Tversky for preferences
- 2024: ORPO - Odds ratio preferences
- 2024: SPIN - Self-play improvement
SAC or TD3 (sample-efficient)
+ Replay Buffer (1M transitions)
+ Target Networks (stabilization)
+ Automatic entropy tuning (SAC)
Used by: Robotics, autonomous systems, continuous control Sample efficiency: High Stability: Very high
Phase 1: IQL or CQL on offline data
Phase 2: Fine-tune with SAC or PPO online
+ Hybrid replay buffer (offline + online)
Used by: Robotics with demonstrations, real-world RL Sample efficiency: Very high Risk: Low (starts from safe policy)
Phase 1: Supervised fine-tuning (SFT)
Phase 2: Reward modeling from preferences
Phase 3: PPO optimization with KL penalty
+ Reference model (KL constraint)
+ Value head (critic)
Used by: ChatGPT, Claude, Llama Complexity: High Quality: Very high
DPO, SimPO, or GRPO (no reward model)
+ Reference model (implicit reward)
+ Preference dataset (chosen vs rejected)
Used by: Zephyr, Starling, modern open models Complexity: Low Efficiency: High
MAPPO or QMIX
+ Centralized training, decentralized execution
+ Value decomposition (QMIX) or shared critic (MAPPO)
+ Communication channels (optional)
Used by: Multi-robot systems, traffic control, SMAC Scalability: 5-100 agents Coordination: Excellent
DreamerV3 or TD-MPC2
+ World model (learns dynamics)
+ Imagination rollouts (planning in latent space)
+ Actor-critic in world model
Used by: Sample-limited domains, sim-to-real Sample efficiency: Extreme Computational cost: High
PPO or SAC base algorithm
+ Intrinsic rewards (ICM or RND)
+ Extrinsic + intrinsic reward combination
Used by: Sparse reward environments, hard exploration Exploration: Excellent Overhead: 20-30% slower
| Algorithm | HalfCheetah | Walker2d | Ant | Humanoid |
|---|---|---|---|---|
| PPO | 1800 | 3000 | 3500 | 5000 |
| SAC | 9500 | 4500 | 5200 | 6800 |
| TD3 | 9200 | 4200 | 4800 | 6500 |
| DDPG | 8500 | 3800 | 3900 | 5200 |
| DreamerV3 | 10500 | 5000 | 6500 | 7200 |
| TD-MPC2 | 10200 | 4900 | 6200 | 7000 |
| Algorithm | Median Human-Normalized Score | Games > Human |
|---|---|---|
| DQN | 121% | 23/57 |
| Rainbow | 223% | 41/57 |
| PPO | 178% | 35/57 |
| IQN | 254% | 44/57 |
| MuZero | 731% | 52/57 |
| Algorithm | Halfcheetah-M | Walker2d-M | Hopper-M | Antmaze-U |
|---|---|---|---|---|
| BC | 42.1 | 75.0 | 52.5 | 54.6 |
| CQL | 44.4 | 79.2 | 58.0 | 74.0 |
| IQL | 47.4 | 82.6 | 66.3 | 87.5 |
| Cal-QL | 48.8 | 84.3 | 68.1 | 89.2 |
| ReBRAC | 49.2 | 85.1 | 69.4 | 90.8 |
| TD3+BC | 48.3 | 83.7 | 59.3 | 78.6 |
| Algorithm | 3s5z | 2c_vs_64zg | corridor | 6h_vs_8z |
|---|---|---|---|---|
| QMIX | 93% | 82% | 88% | 75% |
| WQMIX | 96% | 87% | 91% | 81% |
| QPLEX | 97% | 89% | 93% | 84% |
| MAPPO | 98% | 91% | 95% | 87% |
| Method | Llama 2 7B Base | Llama 2 13B Base | Mistral 7B Base |
|---|---|---|---|
| SFT Only | 12.4% | 16.8% | 19.2% |
| DPO | 18.7% | 24.3% | 28.9% |
| PPO (RLHF) | 21.3% | 27.8% | 32.1% |
| SimPO | 19.8% | 25.9% | 30.5% |
| GRPO | 20.9% | 26.7% | 31.4% |
-
Policy Gradients
- Williams (1992) - "Simple Statistical Gradient-Following Algorithms for Connectionist RL"
- Sutton et al. (1999) - "Policy Gradient Methods"
-
Value-Based Methods
- Mnih et al. (2013) - "Playing Atari with Deep RL" (DQN)
- van Hasselt et al. (2015) - "Deep RL with Double Q-learning"
- Wang et al. (2016) - "Dueling Network Architectures"
- Hessel et al. (2017) - "Rainbow: Combining Improvements in Deep RL"
- Bellemare et al. (2017) - "A Distributional Perspective on RL" (C51)
-
Policy Gradient Methods
- Lillicrap et al. (2015) - "Continuous control with deep RL" (DDPG)
- Schulman et al. (2015) - "Trust Region Policy Optimization" (TRPO)
- Mnih et al. (2016) - "Asynchronous Methods for Deep RL" (A3C)
- Schulman et al. (2017) - "Proximal Policy Optimization" (PPO)
- Haarnoja et al. (2018) - "Soft Actor-Critic" (SAC)
- Fujimoto et al. (2018) - "Addressing Function Approximation Error" (TD3)
- Offline Methods
- Kumar et al. (2020) - "Conservative Q-Learning" (CQL)
- Kostrikov et al. (2021) - "Offline RL with Implicit Q-Learning" (IQL)
- Fujimoto & Gu (2021) - "A Minimalist Approach to Offline RL" (TD3+BC)
- Nakamoto et al. (2023) - "Cal-QL: Calibrated Offline RL"
- Tarasov et al. (2023) - "Revisiting the Minimalist Approach" (ReBRAC)
- Multi-Agent RL
- Rashid et al. (2018) - "QMIX: Monotonic Value Function Factorisation"
- Lowe et al. (2017) - "Multi-Agent Actor-Critic" (MADDPG)
- Yu et al. (2021) - "The Surprising Effectiveness of PPO in Cooperative MA" (MAPPO)
- Wang et al. (2020) - "QPLEX: Duplex Dueling Multi-Agent Q-Learning"
- Model-Based RL
- Janner et al. (2019) - "When to Trust Your Model" (MBPO)
- Hafner et al. (2023) - "Mastering Diverse Domains through World Models" (DreamerV3)
- Hansen et al. (2024) - "TD-MPC2: Scalable Model Predictive Control"
- Alignment Methods
- Ouyang et al. (2022) - "Training language models to follow instructions" (InstructGPT/RLHF)
- Rafailov et al. (2023) - "Direct Preference Optimization" (DPO)
- Hong et al. (2024) - "Reference-Free RLHF" (RLOO)
- Shao et al. (2024) - "DeepSeekMath: Pushing GRPO to the Limit"
- Ethayarajh et al. (2024) - "KTO: Model Alignment as Prospect Theoretic Optimization"
- Meng et al. (2024) - "SimPO: Simple Preference Optimization"
- Exploration Methods
- Pathak et al. (2017) - "Curiosity-driven Exploration" (ICM)
- Burda et al. (2018) - "Exploration by Random Network Distillation" (RND)
- Ecoffet et al. (2019) - "Go-Explore: a New Approach for Hard-Exploration"
-
Planning & Search
- Silver et al. (2017) - "Mastering Chess and Shogi by Self-Play" (AlphaZero)
- Lightman et al. (2023) - "Let's Verify Step by Step" (Process Reward Models)
-
Sequence Modeling
- Chen et al. (2021) - "Decision Transformer: RL via Sequence Modeling"
- Zheng et al. (2022) - "Online Decision Transformer"
- Start here: DQN for discrete actions
- Then try: PPO for continuous control
- Resources: Sutton & Barto "RL: An Introduction" (2nd ed.)
- Discrete actions: Rainbow DQN
- Perfect information: AlphaZero
- Multi-agent: MAPPO or QMIX
- World models: DreamerV3
- Transformers: Decision Transformer
- Exploration: RND or Go-Explore
See implementation files in Nexus/nexus/models/rl/
Each method includes:
- Clean PyTorch implementation
- Comprehensive docstrings
- Replay buffers and data structures
- Training loops and evaluation
- Compatible interfaces for easy swapping
- Support for various environments (Gym, MuJoCo, Atari, custom)
- All algorithms follow a common interface:
select_action(),update(),save(),load() - Support for both discrete and continuous action spaces where applicable
- Automatic device handling (CPU/GPU)
- Extensive logging and checkpointing
- Unit tests and integration tests