Name	Name	Last commit message	Last commit date
parent directory ..
alignment	alignment
exploration	exploration
model_based	model_based
multi_agent	multi_agent
offline_rl	offline_rl
planning	planning
policy_gradient	policy_gradient
reward_modeling	reward_modeling
sequence_based	sequence_based
value_based	value_based
README.md	README.md

Reinforcement Learning

A comprehensive guide to reinforcement learning methods, from foundational value-based algorithms to cutting-edge alignment techniques for large language models and multi-agent systems.

Overview
RL Landscape
When to Use Each Method
Methods Catalog
Implementation Reference
Performance Characteristics
Key Innovations by Year
Common Patterns
Benchmarks
References
Next Steps

Overview

Reinforcement Learning (RL) enables agents to learn optimal behavior through interaction with an environment, receiving rewards or penalties based on their actions. Unlike supervised learning, RL agents must discover which actions yield the most reward through trial and error, making it ideal for sequential decision-making problems.

Core Concept

An RL agent learns a policy π that maps states to actions to maximize cumulative reward:

Objective: max E[Σ γ^t r_t]
           π    t=0

Where:

π (Policy): Strategy for selecting actions given states
r_t: Reward at timestep t
γ: Discount factor (0 < γ ≤ 1)
V(s): Value of state s (expected cumulative reward)
Q(s,a): Value of taking action a in state s

Why Reinforcement Learning Matters

RL has achieved superhuman performance in complex domains:

Games: AlphaGo, OpenAI Five (Dota 2), AlphaStar (StarCraft II)
Robotics: Dexterous manipulation, locomotion, industrial automation
LLM Alignment: ChatGPT's RLHF, Claude's Constitutional AI
Autonomous Systems: Self-driving cars, drone navigation, traffic control
Resource Management: Data center cooling, chip design, financial trading

The field has evolved from simple Q-learning to sophisticated methods that can train billion-parameter language models and coordinate hundreds of agents.

RL Landscape

                    ┌─────────────────────────────────┐
                    │   Reinforcement Learning        │
                    └─────────────────────────────────┘
                                  │
          ┌───────────────────────┼───────────────────────┐
          │                       │                       │
    ┌─────▼──────┐        ┌──────▼──────┐        ┌──────▼──────┐
    │   Classic  │        │   Modern    │        │ Specialized │
    │     RL     │        │   Methods   │        │   Methods   │
    └────────────┘        └─────────────┘        └─────────────┘
          │                       │                       │
    ┌─────┴─────┐         ┌──────┴──────┐        ┌──────┴──────┐
    │           │         │             │        │             │
    ▼           ▼         ▼             ▼        ▼             ▼
Value-     Policy     Offline       Model-   Alignment     Multi-
Based      Gradient      RL         Based                  Agent
    │           │         │             │        │             │
DQN/       PPO/SAC    IQL/CQL    DreamerV3  DPO/GRPO      MAPPO/
Rainbow    DDPG/TD3   ReBRAC     TD-MPC2    KTO/SimPO     QMIX
    │           │         │             │        │             │
    └───────────┴─────────┴─────────────┴────────┴─────────────┘
                                  │
                    ┌─────────────┴─────────────┐
                    │                           │
              ┌─────▼──────┐            ┌──────▼──────┐
              │ Auxiliary  │            │   Planning  │
              │  Methods   │            │  & Search   │
              └────────────┘            └─────────────┘
                    │                           │
              ┌─────┴─────┐             ┌──────┴──────┐
              │           │             │             │
              ▼           ▼             ▼             ▼
        Exploration  Sequence    Reward        MCTS/
        ICM/RND     Transformer  Modeling   AlphaZero

When to Use Each Method

Decision Tree

Start: What's your problem domain?

├─ Discrete Actions (e.g., Atari games, board games)?
│  ├─ Simple environment → DQN
│  ├─ Need robustness → Rainbow DQN
│  ├─ Risk-sensitive → C51 or QR-DQN
│  └─ Multi-agent coordination → QMIX/QPLEX

├─ Continuous Control (e.g., robotics, locomotion)?
│  ├─ Sample efficiency priority → SAC or TD3
│  ├─ Stable training priority → PPO
│  ├─ Simple baseline → DDPG
│  └─ Maximum entropy exploration → SAC

├─ Learning from Offline Data?
│  ├─ No online fine-tuning → IQL or CQL
│  ├─ Hybrid online/offline → Cal-QL
│  ├─ Need conservatism → CQL or EDAC
│  └─ Simple continuous control → TD3+BC or AWR

├─ Aligning Language Models?
│  ├─ No reward model available → DPO or SimPO
│  ├─ Have reward model → PPO (RLHF)
│  ├─ Group-based preferences → GRPO
│  ├─ Probability-based → KTO
│  └─ Self-improvement → SPIN

├─ Multi-Agent Coordination?
│  ├─ Cooperative tasks → MAPPO or QMIX
│  ├─ Competitive tasks → MADDPG
│  └─ Complex value decomposition → QPLEX

├─ Need Sample Efficiency?
│  ├─ Learn world model → DreamerV3
│  ├─ Model-predictive control → TD-MPC2
│  └─ Hybrid approach → MBPO

├─ Sparse Rewards Problem?
│  ├─ Need curiosity → ICM or RND
│  ├─ Hard exploration → Go-Explore
│  └─ Need reward shaping → Reward Modeling

├─ Sequential Decision Making?
│  ├─ Offline dataset → Decision Transformer
│  ├─ Adaptive horizon → Elastic DT
│  └─ Online learning → Online DT

└─ Game Playing with Perfect Info?
   ├─ Need search → MCTS
   ├─ Self-play → AlphaZero
   └─ Process-based → PRM Agent

Detailed Recommendations

For Game Playing

Game Type	Best Method	Alternative	Notes
Atari	Rainbow DQN	PPO	Discrete actions, pixel input
Board Games	AlphaZero	MCTS + NN	Perfect information
Multi-Agent Games	MAPPO	QMIX	Coordination required
Sparse Reward Games	Go-Explore	RND/ICM	Hard exploration

For Robotics

Task	Best Method	Alternative	Sample Efficiency
Locomotion	SAC/TD3	PPO	High
Manipulation	DreamerV3	TD-MPC2	Very High
From Demos	IQL	AWR	Offline only
Sim-to-Real	MBPO	SAC	High

For LLM Alignment

Scenario	Method	Complexity	Preference Type
Standard RLHF	PPO + RM	High	Pairwise
No Reward Model	DPO	Low	Pairwise
Simple Preferences	SimPO	Low	Pairwise
Group Preferences	GRPO	Medium	Group
Binary Feedback	KTO	Low	Thumbs up/down
Self-Improvement	SPIN	Medium	Self-generated

For Multi-Agent Systems

Cooperation Level	Agents	Method	Communication
Fully Cooperative	2-20	MAPPO	Centralized training
Value Decomposition	2-50	QMIX/QPLEX	Implicit
Mixed Cooperative	2-10	MADDPG	Explicit
Competitive	2-10	Self-Play PPO	None

Methods Catalog

Value-Based Methods

Method	Year	Complexity	Use Case	Sample Efficiency
DQN	2013	Low	Discrete actions	Medium
Double DQN	2015	Low	Reduce overestimation	Medium
Dueling DQN	2016	Medium	State-dependent value	Medium
Rainbow	2017	High	State-of-the-art DQN	Medium
C51	2017	Medium	Distributional RL	Medium
QR-DQN	2017	Medium	Quantile regression	Medium

Policy Gradient Methods

Method	Year	Complexity	Use Case	Sample Efficiency
REINFORCE	1992	Low	Simple baseline	Low
A2C	2016	Low	Baseline actor-critic	Low
PPO	2017	Medium	General purpose (gold standard)	Medium
DDPG	2015	Medium	Continuous control	Medium
TD3	2018	Medium	Stable continuous control	High
SAC	2018	Medium	Sample-efficient continuous	High
TRPO	2015	High	Safe policy updates	Medium

Offline RL Methods

Method	Year	Complexity	Use Case	Dataset Quality
IQL	2021	Medium	General offline RL	Any
CQL	2020	Medium	Conservative learning	Any
Cal-QL	2022	Medium	Calibrated Q-learning	Any
IDQL	2023	Medium	In-sample dueling	Any
ReBRAC	2023	Medium	Regularized actor-critic	Medium-High
EDAC	2021	High	Ensemble diversification	Medium-High
TD3+BC	2021	Low	Simple offline baseline	Medium-High
AWR	2019	Low	Advantage-weighted	Medium-High

Alignment Methods

Method	Year	Complexity	Use Case	Requires RM
DPO	2023	Low	Direct preference opt	No
GRPO	2024	Medium	Group preferences	No
KTO	2024	Low	Binary feedback	No
SimPO	2024	Low	Simple preferences	No
ORPO	2024	Low	Odds ratio preferences	No
IPO	2023	Low	Identity preferences	No
SPIN	2024	Medium	Self-play fine-tuning	No
RLOO	2024	Medium	Likelihood optimization	Yes
ReMax	2024	Medium	Reward maximization	Yes
RLVR	2024	Medium	Verifiable rewards	Yes

Multi-Agent Methods

Method	Year	Complexity	Use Case	Scalability
MAPPO	2021	Medium	Cooperative tasks	10-100 agents
QMIX	2018	Medium	Value decomposition	5-20 agents
WQMIX	2020	Medium	Weighted QMIX	5-20 agents
MADDPG	2017	High	Competitive/mixed	2-10 agents
QPLEX	2020	High	Complex decomposition	5-20 agents

Model-Based Methods

Method	Year	Complexity	Use Case	Sample Efficiency
DreamerV3	2023	High	General RL	Very High
TD-MPC2	2024	High	Model-predictive control	Very High
MBPO	2019	Medium	Hybrid model-based	High

Exploration Methods

Method	Year	Complexity	Use Case	Exploration Type
ICM	2017	Medium	Curiosity-driven	Intrinsic motivation
RND	2018	Medium	Novelty detection	Intrinsic motivation
Go-Explore	2019	High	Hard exploration	Archive-based

Sequence-Based Methods

Method	Year	Complexity	Use Case	Data Type
Decision Transformer	2021	Medium	Offline RL	Trajectories
Elastic DT	2023	Medium	Adaptive horizon	Trajectories
Online DT	2022	Medium	Online fine-tuning	Mixed

Reward Modeling Methods

Method	Year	Complexity	Use Case	Granularity
Enhanced RM	2023	Medium	Advanced preferences	Outcome
PRM	2023	High	Step-by-step rewards	Process
ORM	2023	Low	Final outcome rewards	Outcome
Generative RM	2024	High	Generative modeling	Process

Planning Methods

Method	Year	Complexity	Use Case	Search Depth
MCTS	2006	Medium	Game playing	Adaptive
PRM Agent	2023	High	Reasoning tasks	Fixed
AlphaZero	2017	High	Perfect info games	Adaptive

Implementation Reference

All methods are implemented in nexus/models/rl/ with production-grade quality:

from nexus.models.rl import (
    # Value-Based
    DQN,
    DoubleDQN,
    DuelingDQN,
    RainbowDQN,
    C51,
    QRDQN,

    # Policy Gradient
    REINFORCE,
    A2C,
    PPO,
    DDPG,
    TD3,
    SAC,
    TRPO,

    # Offline RL
    IQL,
    CQL,
    CalQL,
    IDQL,
    ReBRAC,
    EDAC,
    TD3BC,
    AWR,

    # Alignment
    DPO,
    GRPO,
    KTO,
    SimPO,
    ORPO,
    IPO,
    SPIN,
    RLOO,
    ReMax,
    RLVR,

    # Multi-Agent
    MAPPO,
    QMIX,
    WQMIX,
    MADDPG,
    QPLEX,

    # Model-Based
    DreamerV3,
    TDMPC2,
    MBPO,

    # Exploration
    ICM,
    RND,
    GoExplore,

    # Sequence-Based
    DecisionTransformer,
    ElasticDT,
    OnlineDT,

    # Reward Modeling
    EnhancedRewardModel,
    ProcessRewardModel,
    OutcomeRewardModel,
    GenerativeRM,

    # Planning
    MCTS,
    PRMAgent,
    AlphaZero,
)

Quick Start Examples

DQN (Discrete Actions)

agent = DQN(
    state_dim=4,
    action_dim=2,
    hidden_dim=128,
    learning_rate=1e-3,
    gamma=0.99,
    epsilon_decay=0.995
)
action = agent.select_action(state)
agent.update(state, action, reward, next_state, done)

PPO (General Purpose)

agent = PPO(
    state_dim=17,
    action_dim=6,
    hidden_dim=256,
    learning_rate=3e-4,
    gamma=0.99,
    gae_lambda=0.95,
    clip_epsilon=0.2,
    n_epochs=10
)
# Collect rollouts
rollouts = agent.collect_rollouts(env, n_steps=2048)
# Update policy
agent.update(rollouts)

SAC (Continuous Control)

agent = SAC(
    state_dim=24,
    action_dim=4,
    hidden_dim=256,
    learning_rate=3e-4,
    gamma=0.99,
    tau=0.005,
    alpha=0.2,  # Entropy coefficient
    automatic_entropy_tuning=True
)
action = agent.select_action(state, evaluate=False)
agent.update(replay_buffer, batch_size=256)

IQL (Offline RL)

agent = IQL(
    state_dim=11,
    action_dim=3,
    hidden_dim=256,
    learning_rate=3e-4,
    tau=0.7,  # Expectile for value learning
    beta=3.0  # Advantage weighting
)
# Train on offline dataset
agent.train(offline_dataset, n_steps=1000000)

DPO (LLM Alignment)

from nexus.models.rl.alignment import DPO

trainer = DPO(
    model=base_model,
    ref_model=reference_model,
    beta=0.1,  # KL penalty coefficient
    learning_rate=5e-7,
    max_length=512
)
# Train on preference pairs
trainer.train(
    preference_dataset,  # (prompt, chosen, rejected) tuples
    num_epochs=3
)

MAPPO (Multi-Agent)

agent = MAPPO(
    n_agents=5,
    state_dim=20,
    action_dim=5,
    hidden_dim=128,
    share_policy=True,  # Parameter sharing
    use_centralized_critic=True
)
actions = agent.select_actions(observations)
agent.update(multi_agent_buffer)

Performance Characteristics

Sample Efficiency (Relative to PPO)

PPO:           ████████████  1.0x (baseline)
DQN:           ████████      0.7x
Rainbow:       ████████████████  1.3x
SAC:           ████████████████████  1.6x
TD3:           ██████████████████  1.5x
DreamerV3:     ████████████████████████████  2.3x
TD-MPC2:       ████████████████████████  2.0x
IQL (offline): ████████████████████████████████  2.7x

Training Speed (Wall-clock Time, Single GPU)

REINFORCE:     ████████████████████████  Fast
A2C:           ████████████████████  Fast
PPO:           ████████████████  Medium-Fast
SAC/TD3:       ████████████  Medium
DQN/Rainbow:   ██████████  Medium
DreamerV3:     ██████  Slow
AlphaZero:     ███  Very Slow

Stability (Lower is Better)

SAC:           ██  Very Stable
TD3:           ███  Very Stable
PPO:           ████  Stable
TRPO:          ████  Stable
DQN:           ██████  Moderate
DDPG:          ████████████  Unstable
REINFORCE:     ████████████████  Very Unstable

Memory Requirements (Relative)

REINFORCE:     ██  Minimal
DQN:           ████████  Medium (replay buffer)
PPO:           ████  Small (rollout buffer)
SAC/TD3:       ████████  Medium (replay buffer)
Rainbow:       ████████████  Large (replay + extras)
DreamerV3:     ████████████████  Very Large (world model)
AlphaZero:     ████████████████████  Extreme (MCTS + NN)

Key Innovations by Year

Classic Era (1992-2015)

1992: REINFORCE (Williams) - First policy gradient
2013: DQN (Mnih et al.) - Deep RL breakthrough
2015: DDPG (Lillicrap et al.) - Continuous control
2015: TRPO (Schulman et al.) - Trust region optimization

Modern Era (2016-2020)

2016: A3C (Mnih et al.) - Asynchronous methods
2017: PPO (Schulman et al.) - Industry standard
2017: Rainbow (Hessel et al.) - DQN improvements combined
2017: AlphaZero (Silver et al.) - Self-play mastery
2018: SAC (Haarnoja et al.) - Maximum entropy RL
2018: TD3 (Fujimoto et al.) - Twin critics
2018: QMIX (Rashid et al.) - Multi-agent value decomposition
2019: MBPO (Janner et al.) - Model-based policy optimization
2019: Go-Explore (Ecoffet et al.) - Archive exploration

Contemporary Era (2020-2024)

2020: CQL (Kumar et al.) - Conservative offline RL
2021: IQL (Kostrikov et al.) - Implicit Q-learning
2021: Decision Transformer (Chen et al.) - Sequence modeling for RL
2021: MAPPO (Yu et al.) - Multi-agent PPO
2023: DPO (Rafailov et al.) - Direct preference optimization
2023: DreamerV3 (Hafner et al.) - General world models
2023: ReBRAC (Tarasov et al.) - Regularized offline RL
2024: GRPO, KTO, SimPO - Next-gen alignment methods
2024: TD-MPC2 (Hansen et al.) - Scalable model predictive control

LLM Alignment Era (2023-2024)

2023: DPO - Eliminates reward model
2024: SimPO - Simplifies DPO further
2024: GRPO - Group-based optimization
2024: KTO - Kahneman-Tversky for preferences
2024: ORPO - Odds ratio preferences
2024: SPIN - Self-play improvement

Common Patterns

Pattern 1: Online RL Stack (Continuous Control)

SAC or TD3 (sample-efficient)
+ Replay Buffer (1M transitions)
+ Target Networks (stabilization)
+ Automatic entropy tuning (SAC)

Used by: Robotics, autonomous systems, continuous control Sample efficiency: High Stability: Very high

Pattern 2: Offline → Online RL Stack

Phase 1: IQL or CQL on offline data
Phase 2: Fine-tune with SAC or PPO online
+ Hybrid replay buffer (offline + online)

Used by: Robotics with demonstrations, real-world RL Sample efficiency: Very high Risk: Low (starts from safe policy)

Pattern 3: RLHF Stack (LLM Alignment)

Phase 1: Supervised fine-tuning (SFT)
Phase 2: Reward modeling from preferences
Phase 3: PPO optimization with KL penalty
+ Reference model (KL constraint)
+ Value head (critic)

Used by: ChatGPT, Claude, Llama Complexity: High Quality: Very high

Pattern 4: Direct Alignment Stack (DPO Family)

DPO, SimPO, or GRPO (no reward model)
+ Reference model (implicit reward)
+ Preference dataset (chosen vs rejected)

Used by: Zephyr, Starling, modern open models Complexity: Low Efficiency: High

Pattern 5: Multi-Agent Cooperative Stack

MAPPO or QMIX
+ Centralized training, decentralized execution
+ Value decomposition (QMIX) or shared critic (MAPPO)
+ Communication channels (optional)

Used by: Multi-robot systems, traffic control, SMAC Scalability: 5-100 agents Coordination: Excellent

Pattern 6: Model-Based Stack

DreamerV3 or TD-MPC2
+ World model (learns dynamics)
+ Imagination rollouts (planning in latent space)
+ Actor-critic in world model

Used by: Sample-limited domains, sim-to-real Sample efficiency: Extreme Computational cost: High

Pattern 7: Exploration Stack

PPO or SAC base algorithm
+ Intrinsic rewards (ICM or RND)
+ Extrinsic + intrinsic reward combination

Used by: Sparse reward environments, hard exploration Exploration: Excellent Overhead: 20-30% slower

Benchmarks

MuJoCo Locomotion (1M steps)

Algorithm	HalfCheetah	Walker2d	Ant	Humanoid
PPO	1800	3000	3500	5000
SAC	9500	4500	5200	6800
TD3	9200	4200	4800	6500
DDPG	8500	3800	3900	5200
DreamerV3	10500	5000	6500	7200
TD-MPC2	10200	4900	6200	7000

Atari 2600 (200M frames)

Algorithm	Median Human-Normalized Score	Games > Human
DQN	121%	23/57
Rainbow	223%	41/57
PPO	178%	35/57
IQN	254%	44/57
MuZero	731%	52/57

D4RL Offline RL (Normalized Score)

Algorithm	Halfcheetah-M	Walker2d-M	Hopper-M	Antmaze-U
BC	42.1	75.0	52.5	54.6
CQL	44.4	79.2	58.0	74.0
IQL	47.4	82.6	66.3	87.5
Cal-QL	48.8	84.3	68.1	89.2
ReBRAC	49.2	85.1	69.4	90.8
TD3+BC	48.3	83.7	59.3	78.6

Multi-Agent (SMAC - StarCraft II)

Algorithm	3s5z	2c_vs_64zg	corridor	6h_vs_8z
QMIX	93%	82%	88%	75%
WQMIX	96%	87%	91%	81%
QPLEX	97%	89%	93%	84%
MAPPO	98%	91%	95%	87%

LLM Alignment (AlpacaEval 2.0 Win Rate)

Method	Llama 2 7B Base	Llama 2 13B Base	Mistral 7B Base
SFT Only	12.4%	16.8%	19.2%
DPO	18.7%	24.3%	28.9%
PPO (RLHF)	21.3%	27.8%	32.1%
SimPO	19.8%	25.9%	30.5%
GRPO	20.9%	26.7%	31.4%

References

Foundational Papers

Policy Gradients
- Williams (1992) - "Simple Statistical Gradient-Following Algorithms for Connectionist RL"
- Sutton et al. (1999) - "Policy Gradient Methods"
Value-Based Methods
- Mnih et al. (2013) - "Playing Atari with Deep RL" (DQN)
- van Hasselt et al. (2015) - "Deep RL with Double Q-learning"
- Wang et al. (2016) - "Dueling Network Architectures"
- Hessel et al. (2017) - "Rainbow: Combining Improvements in Deep RL"
- Bellemare et al. (2017) - "A Distributional Perspective on RL" (C51)
Policy Gradient Methods
- Lillicrap et al. (2015) - "Continuous control with deep RL" (DDPG)
- Schulman et al. (2015) - "Trust Region Policy Optimization" (TRPO)
- Mnih et al. (2016) - "Asynchronous Methods for Deep RL" (A3C)
- Schulman et al. (2017) - "Proximal Policy Optimization" (PPO)
- Haarnoja et al. (2018) - "Soft Actor-Critic" (SAC)
- Fujimoto et al. (2018) - "Addressing Function Approximation Error" (TD3)

Offline RL Papers

Offline Methods
- Kumar et al. (2020) - "Conservative Q-Learning" (CQL)
- Kostrikov et al. (2021) - "Offline RL with Implicit Q-Learning" (IQL)
- Fujimoto & Gu (2021) - "A Minimalist Approach to Offline RL" (TD3+BC)
- Nakamoto et al. (2023) - "Cal-QL: Calibrated Offline RL"
- Tarasov et al. (2023) - "Revisiting the Minimalist Approach" (ReBRAC)

Multi-Agent Papers

Multi-Agent RL
- Rashid et al. (2018) - "QMIX: Monotonic Value Function Factorisation"
- Lowe et al. (2017) - "Multi-Agent Actor-Critic" (MADDPG)
- Yu et al. (2021) - "The Surprising Effectiveness of PPO in Cooperative MA" (MAPPO)
- Wang et al. (2020) - "QPLEX: Duplex Dueling Multi-Agent Q-Learning"

Model-Based Papers

Model-Based RL
- Janner et al. (2019) - "When to Trust Your Model" (MBPO)
- Hafner et al. (2023) - "Mastering Diverse Domains through World Models" (DreamerV3)
- Hansen et al. (2024) - "TD-MPC2: Scalable Model Predictive Control"

LLM Alignment Papers

Alignment Methods
- Ouyang et al. (2022) - "Training language models to follow instructions" (InstructGPT/RLHF)
- Rafailov et al. (2023) - "Direct Preference Optimization" (DPO)
- Hong et al. (2024) - "Reference-Free RLHF" (RLOO)
- Shao et al. (2024) - "DeepSeekMath: Pushing GRPO to the Limit"
- Ethayarajh et al. (2024) - "KTO: Model Alignment as Prospect Theoretic Optimization"
- Meng et al. (2024) - "SimPO: Simple Preference Optimization"

Exploration Papers

Exploration Methods
- Pathak et al. (2017) - "Curiosity-driven Exploration" (ICM)
- Burda et al. (2018) - "Exploration by Random Network Distillation" (RND)
- Ecoffet et al. (2019) - "Go-Explore: a New Approach for Hard-Exploration"

Other Important Papers

Planning & Search
- Silver et al. (2017) - "Mastering Chess and Shogi by Self-Play" (AlphaZero)
- Lightman et al. (2023) - "Let's Verify Step by Step" (Process Reward Models)
Sequence Modeling
- Chen et al. (2021) - "Decision Transformer: RL via Sequence Modeling"
- Zheng et al. (2022) - "Online Decision Transformer"

Next Steps

1. New to Reinforcement Learning?

Start here: DQN for discrete actions
Then try: PPO for continuous control
Resources: Sutton & Barto "RL: An Introduction" (2nd ed.)

2. Building Game AI?

Discrete actions: Rainbow DQN
Perfect information: AlphaZero
Multi-agent: MAPPO or QMIX

3. Training Robots?

Start with: SAC or TD3
From demos: IQL or AWR
Sample efficiency: DreamerV3

4. Aligning Language Models?

Simple start: DPO or SimPO
Standard RLHF: PPO + Reward Model
Advanced: GRPO or KTO

5. Working with Offline Data?

General purpose: IQL
Conservative: CQL
Simple baseline: TD3+BC
State-of-the-art: ReBRAC

6. Research & Novel Architectures?

World models: DreamerV3
Transformers: Decision Transformer
Exploration: RND or Go-Explore

7. Production Deployment?

Proven stable: PPO, SAC, TD3
Offline safe: IQL, CQL
LLM alignment: DPO, SimPO

Contributing

See implementation files in Nexus/nexus/models/rl/

Each method includes:

Clean PyTorch implementation
Comprehensive docstrings
Replay buffers and data structures
Training loops and evaluation
Compatible interfaces for easy swapping
Support for various environments (Gym, MuJoCo, Atari, custom)

Implementation Standards

All algorithms follow a common interface: select_action(), update(), save(), load()
Support for both discrete and continuous action spaces where applicable
Automatic device handling (CPU/GPU)
Extensive logging and checkpointing
Unit tests and integration tests

FilesExpand file tree

01_reinforcement_learning

Directory actions

More options