This directory contains comprehensive documentation for policy gradient and actor-critic methods, which are fundamental approaches in modern reinforcement learning for continuous and discrete control tasks.
Policy gradient methods directly optimize the policy by computing gradients of the expected return with respect to policy parameters. Unlike value-based methods that learn action-value functions, policy gradient methods learn a parameterized policy that can naturally handle:
- Continuous action spaces: Essential for robotics, control systems
- Stochastic policies: Naturally explore and handle partial observability
- High-dimensional action spaces: Scale better than discretization approaches
- Direct policy optimization: No need to derive policy from value function
- REINFORCE: The foundational Monte Carlo policy gradient algorithm
- A2C (Advantage Actor-Critic): Synchronous advantage-based actor-critic
- PPO (Proximal Policy Optimization): Clipped surrogate objective for stable updates
- DDPG (Deep Deterministic Policy Gradient): Actor-critic for continuous control
- TD3 (Twin Delayed DDPG): Improved DDPG with twin critics and delayed updates
- SAC (Soft Actor-Critic): Maximum entropy RL for sample-efficient learning
- TRPO (Trust Region Policy Optimization): Constrained optimization with guaranteed improvement
-
Start with REINFORCE (
reinforce.md)- Understand basic policy gradients
- Learn Monte Carlo returns
- Grasp variance reduction with baselines
-
Progress to A2C (
a2c.md)- Understand actor-critic architecture
- Learn bootstrapping with TD learning
- Master advantage estimation
-
Study PPO (
ppo.md)- Learn clipped surrogate objectives
- Understand trust region concepts (simplified)
- See production-ready implementation
-
Explore DDPG (
ddpg.md)- Move to continuous action spaces
- Understand deterministic policy gradients
- Learn target networks and replay buffers
-
Advance to TD3 (
td3.md)- Master twin critics for overestimation bias
- Learn delayed policy updates
- Understand target policy smoothing
-
Study SAC (
sac.md)- Learn maximum entropy RL
- Understand stochastic policies for continuous actions
- Master automatic temperature tuning
- Master TRPO (
trpo.md)- Understand natural policy gradients
- Learn constrained optimization
- Study conjugate gradient methods
- Grasp theoretical guarantees
| Algorithm | Action Space | Key Innovation | Complexity | Sample Efficiency | Stability |
|---|---|---|---|---|---|
| REINFORCE | Both | Monte Carlo PG | Low | Low | Low |
| A2C | Discrete | Advantage estimation | Medium | Medium | Medium |
| PPO | Both | Clipped objective | Medium | Medium | High |
| DDPG | Continuous | Deterministic PG | Medium | Medium | Medium |
| TD3 | Continuous | Twin critics | Medium | High | High |
| SAC | Continuous | Maximum entropy | Medium | Very High | Very High |
| TRPO | Both | Trust region | High | Medium | Very High |
The foundation of all policy gradient methods:
∇_θ J(θ) = E_π[∇_θ log π_θ(a|s) Q^π(s,a)]
- Baseline subtraction: Reduce variance without bias
- Advantage functions: Use A(s,a) = Q(s,a) - V(s)
- GAE (Generalized Advantage Estimation): Bias-variance trade-off
- Entropy regularization: Encourage exploration
- Actor: Policy network π_θ(a|s)
- Critic: Value network V_φ(s) or Q_φ(s,a)
- Advantage: A(s,a) = Q(s,a) - V(s) or TD error
Two main approaches:
- Deterministic policies: DDPG, TD3 (use μ_θ(s))
- Stochastic policies: SAC (use Gaussian π_θ(a|s))
All implementations are located in /nexus/models/rl/:
reinforce.py: REINFORCE implementationa2c.py: A2C implementationppo.py: PPO implementationddpg.py: DDPG implementationtd3.py: TD3 implementationsac.py: SAC implementationtrpo.py: TRPO implementation
All implementations follow the Nexus design:
from nexus.models.rl import PPOAgent
config = {
"state_dim": 8,
"action_dim": 4,
"hidden_dim": 256,
"learning_rate": 3e-4,
"gamma": 0.99,
}
agent = PPOAgent(config)
action, info = agent.select_action(state)
metrics = agent.update(batch)- Learning about policy gradients
- Simple environments
- Episodic tasks
- Educational purposes
- Discrete action spaces
- Need faster learning than REINFORCE
- Want simple actor-critic
- Atari games, discrete control
- Need robust, stable training
- Both discrete/continuous actions
- Production deployments
- Robotics, complex control
- Most recommended for general use
- Continuous control tasks
- Deterministic policies
- Physical simulations
- Lower sample count
- Continuous control tasks
- Need more stability than DDPG
- Willing to trade complexity for performance
- Robotics, manipulation
- Continuous control tasks
- Need best sample efficiency
- Want automatic exploration tuning
- Complex continuous control
- Recommended for continuous control
- Need guaranteed monotonic improvement
- Stability is critical
- Can afford computational cost
- Theoretical guarantees required
- Research on trust regions
- Reward scaling: Always normalize rewards
- Network initialization: Use small weights for policy output
- Learning rates: Policy and value may need different rates
- Gradient clipping: Essential for stability
- Hyperparameter tuning: Critical for performance
- REINFORCE: High variance, needs many episodes
- A2C: Sensitive to hyperparameters
- PPO: Clip range needs tuning
- DDPG: Exploration noise scheduling
- TD3: Policy delay parameter important
- SAC: Temperature tuning critical
- TRPO: Computationally expensive
J(θ) = E_τ~π_θ[∑_{t=0}^T γ^t r_t]
∇_θ J(θ) = E_π[∇_θ log π_θ(a|s) (Q^π(s,a) - b(s))]
Critic: minimize (R_t - V_θ(s_t))^2
Actor: maximize E[log π_θ(a|s) A_θ(s,a)]
A_t = ∑_{l=0}^∞ (γλ)^l δ_{t+l}
where δ_t = r_t + γV(s_{t+1}) - V(s_t)
- Policy Gradient: Williams (1992) - "Simple Statistical Gradient-Following Algorithms"
- Actor-Critic: Sutton et al. (1999) - "Policy Gradient Methods for RL"
- Natural Gradients: Kakade (2002) - "Natural Policy Gradient"
- A3C/A2C: Mnih et al. (2016) - "Asynchronous Methods for Deep RL"
- TRPO: Schulman et al. (2015) - "Trust Region Policy Optimization"
- PPO: Schulman et al. (2017) - "Proximal Policy Optimization"
- DDPG: Lillicrap et al. (2015) - "Continuous Control with Deep RL"
- TD3: Fujimoto et al. (2018) - "Addressing Function Approximation Error"
- SAC: Haarnoja et al. (2018) - "Soft Actor-Critic"
- Schulman (2016) - "Optimizing Expectations: From Deep RL to Stochastic Computation Graphs"
- Arulkumaran et al. (2017) - "Deep Reinforcement Learning: A Brief Survey"
- Peters & Schaal (2008) - "Reinforcement Learning of Motor Skills"
- Sutton & Barto (2018) - "Reinforcement Learning: An Introduction" (Chapter 13)
- Algorithms for Reinforcement Learning (Szepesvári, 2010)
- CS 285 (Berkeley) - Deep Reinforcement Learning
- CS 234 (Stanford) - Reinforcement Learning
- DeepMind x UCL - Advanced Deep Learning & RL
- OpenAI Spinning Up: https://spinningup.openai.com/
- Stable-Baselines3: https://stable-baselines3.readthedocs.io/
- CleanRL: https://github.com/vwxyzjn/cleanrl
When adding new policy gradient algorithms:
- Follow the 10-section documentation structure
- Include mathematical derivations
- Reference Nexus implementations
- Add practical examples
- Document common pitfalls
- Update this README