This directory contains comprehensive documentation for imitation learning algorithms implemented in Nexus. Imitation learning enables agents to learn behaviors by observing expert demonstrations, bypassing the need for explicit reward engineering.
- GAIL (Generative Adversarial Imitation Learning)
- DAgger (Dataset Aggregation)
- MEGA-DAgger
- AIRL (Adversarial Inverse Reinforcement Learning)
Imitation learning addresses a fundamental question: How can an agent learn to perform a task by watching an expert, without knowing the underlying reward function?
This is crucial because:
- Reward engineering is often difficult and time-consuming
- Expert demonstrations are frequently available (human demonstrations, recorded trajectories)
- Many tasks are easier to demonstrate than to specify formally
Behavioral Cloning (BC): Directly supervised learning from expert state-action pairs
- Simple but suffers from distributional shift
- No exploration of states not visited by expert
Inverse Reinforcement Learning (IRL): Learn the reward function that explains expert behavior
- Recovers underlying objectives
- Computationally expensive (requires solving RL in inner loop)
Adversarial Imitation: Use discriminators to distinguish expert from policy behavior
- Avoids explicit reward learning
- More sample efficient than IRL
- Combines benefits of BC and IRL
We recommend studying these algorithms in the following order:
- File: dagger.md
- Difficulty: Beginner-Intermediate
- Key Concepts: Interactive learning, covariate shift, expert queries
- Use Case: Learning from imperfect demonstrations with expert feedback
- File: gail.md
- Difficulty: Intermediate
- Key Concepts: Adversarial training, discriminator rewards, GAN stability
- Use Case: Learning complex behaviors from expert demonstrations
- File: airl.md
- Difficulty: Advanced
- Key Concepts: Reward function recovery, disentangling dynamics, transfer learning
- Use Case: When you need interpretable rewards or transfer to new environments
- File: mega_dagger.md
- Difficulty: Advanced
- Key Concepts: Model-based learning, world models, safety-aware exploration
- Use Case: Safety-critical domains with limited expert interaction
| Algorithm | Paradigm | Expert Queries | Reward Recovery | Sample Efficiency | Complexity |
|---|---|---|---|---|---|
| DAgger | Interactive BC | Required | ❌ | High (with expert) | Low |
| GAIL | Adversarial | Not Required | ❌ | Medium | Medium |
| AIRL | Adversarial IRL | Not Required | ✅ | Medium | High |
| MEGA-DAgger | Model-Based | Minimal | ❌ | Very High | Very High |
Most to Least Efficient:
- MEGA-DAgger: Uses learned world model for planning, minimizes expert queries
- DAgger: Direct expert queries reduce compound errors
- GAIL/AIRL: Require many environment interactions to train policy and discriminator
Least to Most Expensive:
- DAgger: Simple supervised learning with occasional expert queries
- GAIL: Policy optimization + discriminator training
- AIRL: Additional reward function learning and disentanglement
- MEGA-DAgger: World model learning + planning + expert queries
Best to Worst:
- GAIL: Works with fixed dataset of demonstrations
- AIRL: Also works with fixed dataset
- MEGA-DAgger: Uses world model to minimize queries
- DAgger: Requires frequent expert access during training
- You have access to an expert that can provide labels during training
- Covariate shift is a major concern
- You want a simple, interpretable approach
- Computational resources are limited
- Real-time expert feedback is available
Typical Applications:
- Autonomous driving with human supervisor
- Robot manipulation with human corrections
- Game playing with expert annotations
- You have a fixed dataset of expert demonstrations
- Expert access during training is not available
- You want to learn complex, multi-modal behaviors
- Sample efficiency during training is not critical
- You don't need an interpretable reward function
Typical Applications:
- Learning from human gameplay recordings
- Robotics with demonstration datasets
- Character animation from motion capture
- You need to recover interpretable reward functions
- Transfer to new environment dynamics is required
- You want to understand the expert's objectives
- Computational cost is acceptable
- Domain knowledge can inform reward structure
Typical Applications:
- Learning human preferences for alignment
- Transfer learning across robot morphologies
- Understanding expert decision-making
- Multi-task learning with shared rewards
- Expert access is expensive or dangerous
- Safety is critical (avoid bad states)
- Sample efficiency is paramount
- You can learn accurate world models
- Planning in model space is feasible
Typical Applications:
- High-stakes medical procedures
- Autonomous vehicles (minimize dangerous situations)
- Expensive robotic systems
- Space exploration
The fundamental challenge in imitation learning: the learner's state distribution differs from the expert's.
Expert: s₀ → s₁ → s₂ → s₃ (expert states)
Learner: s₀ → s₁' → s₂'' → s₃''' (different states due to errors)
Solutions:
- DAgger: Query expert on learner's states
- GAIL/AIRL: Train policy to match expert distribution
- MEGA-DAgger: Use world model to avoid bad states
GAIL and AIRL use a discriminator D(s,a) that classifies state-action pairs:
- D(s,a) = 1 for expert demonstrations
- D(s,a) = 0 for policy rollouts
The discriminator's output provides a reward signal:
r(s,a) = -log(1 - D(s,a)) # GAIL
r(s,a) = log(D(s,a)) - log(1 - D(s,a)) # AIRL
MEGA-DAgger learns a world model M(s,a) → s' to:
- Simulate trajectories without environment interaction
- Plan using learned dynamics
- Identify states where expert input is needed
- Train policy in imagination
-
Expert Quality: Ensure demonstrations are truly expert-level
- Suboptimal demonstrations hurt all methods
- GAIL/AIRL particularly sensitive to noisy data
-
Diversity: Collect demonstrations from diverse scenarios
- Cover edge cases and rare events
- Multiple experts can improve robustness
-
Labeling: For DAgger, expert must label policy-visited states
- Make labeling interface efficient
- Consider active learning for query selection
-
GAIL/AIRL: Use techniques from GAN training
- Gradient penalty for discriminator
- Spectral normalization
- Batch normalization
- Careful learning rate tuning
-
DAgger: Balance dataset mixing
- β-decay schedule for expert data weighting
- Don't discard early expert demonstrations
-
MEGA-DAgger: World model accuracy is critical
- Use uncertainty estimates
- Fall back to expert when model is uncertain
- Iteratively improve model with real data
-
Performance Metrics:
- Task success rate
- Distance to expert trajectory
- Environment-specific rewards
-
Distributional Metrics:
- State visitation frequency
- Action distribution similarity
- Trajectory diversity
-
Sample Efficiency:
- Number of expert demonstrations needed
- Number of environment interactions
- Number of expert queries (DAgger, MEGA-DAgger)
All implementations in Nexus follow a consistent API:
from nexus.models.imitation import GAILAgent, DAggerAgent, AIRLAgent, MEGADAggerAgent
# GAIL Example
config = {
"state_dim": 17,
"action_dim": 6,
"hidden_dims": [256, 256],
"policy_lr": 3e-4,
"discriminator_lr": 3e-4,
"use_spectral_norm": True
}
agent = GAILAgent(config)
# Training loop
for epoch in range(num_epochs):
# Collect policy rollouts
policy_batch = collect_rollouts(agent, env)
# Sample expert demonstrations
expert_batch = expert_buffer.sample(batch_size)
# Update discriminator and policy
metrics = agent.update(policy_batch, expert_batch)
# DAgger Example
config = {
"state_dim": 17,
"action_dim": 6,
"hidden_dims": [256, 256],
"learning_rate": 3e-4,
"beta_decay": 0.95
}
agent = DAggerAgent(config)
# Training loop
for epoch in range(num_epochs):
# Collect policy rollouts
states, _ = collect_rollouts(agent, env)
# Query expert for labels on policy-visited states
expert_actions = expert.label(states)
# Update policy
metrics = agent.update(states, expert_actions)Problem: Small errors accumulate over time, leading to state distributions unseen during training.
Solutions:
- DAgger: Query expert on learner-visited states
- GAIL/AIRL: Match state-action distributions via adversarial training
- MEGA-DAgger: Use world model to plan and avoid error-prone states
- All: Add noise to expert demonstrations during training
Problem: Limited demonstrations lead to overfitting and poor generalization.
Solutions:
- Data augmentation (trajectory perturbations)
- Regularization (dropout, weight decay)
- Ensemble methods
- Active learning to request more data where needed
Problem: Discriminator and policy training can be unstable, leading to mode collapse or divergence.
Solutions:
- Gradient penalty (WGAN-GP style)
- Spectral normalization on discriminator
- Lower discriminator learning rate
- Multiple discriminator updates per policy update
- Batch normalization
Problem: Many reward functions can explain the same behavior.
Solutions:
- Use AIRL if reward recovery is important
- Add reward shaping or prior knowledge
- Constrain reward function space
- Multi-task learning with shared rewards
Problem: Inaccurate world model leads to poor planning and policy learning.
Solutions:
- Ensemble world models for uncertainty
- Use model only where confident
- Mix model-based and model-free updates
- Continuously update model with real data
- Conservative planning with pessimistic models
-
Behavioral Cloning Basics:
- A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning (Ross et al., AISTATS 2011)
-
DAgger:
- A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning (Ross et al., AISTATS 2011)
-
GAIL:
- Generative Adversarial Imitation Learning (Ho & Ermon, NeurIPS 2016)
-
AIRL:
- Learning Robust Rewards with Adversarial Inverse Reinforcement Learning (Fu et al., ICLR 2018)
-
MEGA-DAgger:
- Model-based Generative Adversarial Imitation Learning (Vuong et al., CoRL 2020)
- Imitation Learning Survey: An Algorithmic Perspective on Imitation Learning (Osa et al., 2018)
- Inverse RL Survey: A Survey of Inverse Reinforcement Learning: Techniques, Applications, and Open Problems (Arora & Doshi, 2021)
- Stanford CS237A: Imitation Learning
- Berkeley CS287: Advanced Robotics - Imitation Learning Module
- Imitation Learning Tutorial by Sergey Levine
- imitation: Clean implementations of IL algorithms
- Stable-Baselines3: Includes GAIL
- rlkit: Research codebase with AIRL
- D4RL: Offline RL and IL benchmark datasets
- RoboMimic: Robot manipulation demonstrations
- Atari demonstrations
18_imitation_learning/
├── README.md # This file
├── gail.md # Generative Adversarial Imitation Learning
├── dagger.md # Dataset Aggregation
├── mega_dagger.md # Model-based DAgger
└── airl.md # Adversarial Inverse RL
- New to Imitation Learning? Start with DAgger to understand the distributional shift problem and interactive learning
- Have expert demonstrations? Jump to GAIL for adversarial imitation learning
- Need interpretable rewards? Study AIRL for reward recovery
- Limited expert access? Explore MEGA-DAgger for model-based efficiency
Each algorithm documentation includes:
- Theoretical foundations
- Mathematical formulations
- Implementation details
- Code walkthroughs
- Optimization tricks
- Common pitfalls
- Experimental results
The fundamental challenge in imitation learning is distributional shift: the learner's visited state distribution differs from the expert's.
Why This Happens:
Time t=0: Learner starts at same state as expert ✓
Time t=1: Small error → slightly different state
Time t=2: No training data for this state → larger error
Time t=3: Completely off-distribution → catastrophic failure
Quantitative Analysis:
For a policy π with per-step error ε:
- Behavioral Cloning: Expected error ~ O(T²ε)
- DAgger: Expected error ~ O(Tε)
- Perfect Oracle: Expected error ~ O(ε)
Where T is the time horizon.
Example: Autonomous Driving
Expert trajectory:
Lane center → Lane center → Lane center → Lane center
BC-trained policy:
Lane center → 10cm right → 25cm right → 50cm right → OFF ROAD
The 10cm error at t=1 compounds geometrically because:
- Training data only covers "lane center" states
- No examples of "how to recover from 10cm right"
- Policy extrapolates poorly to novel states
- Errors accumulate multiplicatively
Solution Approaches:
| Method | Approach | Error Bound |
|---|---|---|
| BC | Train on expert demos only | O(T²ε) |
| BC + Noise | Add state noise to demos | O(T^1.5ε) |
| DAgger | Query expert on learner states | O(Tε) |
| GAIL | Match state-action distributions | O(Tε) |
| SafeDAgger | DAgger with safety constraints | O(Tε) |
Challenge: Expert demonstrations are expensive, potentially inconsistent, and limited in quantity.
Expert Quality Spectrum:
-
Optimal Expert (Rare):
- Perfectly solves the task
- Consistent across demonstrations
- Examples: Optimal game-playing AI, physics simulators
-
Near-Optimal Expert (Common):
- Very high performance (>95% optimal)
- Mostly consistent
- Examples: Professional human demonstrators, strong RL policies
-
Imperfect Expert (Most Common):
- Good but suboptimal (70-90% optimal)
- Some inconsistencies
- Examples: Average human demonstrators, heuristic policies
-
Mixed Quality Experts (Realistic):
- Multiple experts with varying quality
- Disagreement on edge cases
- Examples: Crowdsourced demonstrations, multiple humans
Algorithm Robustness to Expert Quality:
| Algorithm | Optimal Expert | Noisy Expert | Multiple Experts |
|---|---|---|---|
| BC | ✓✓✓ | ✗ | ✓ (averaging) |
| DAgger | ✓✓✓ | ✓ | ✓ (single query) |
| GAIL | ✓✓✓ | ✓✓ | ✓✓ |
| AIRL | ✓✓✓ | ✓✓ | ✓✓ |
| MEGA-DAgger | ✓✓ | ✓✓✓ | ✓✓✓ |
The Trade-off: Expert demonstrations are expensive; environment interactions are cheap (usually).
Data Requirements by Method:
Behavioral Cloning:
- Expert demos needed: 100-1000 trajectories
- Environment interactions: 0 (offline)
- Expert queries during training: 0
- Total cost: High (many expert demos)
DAgger:
- Expert demos needed: 10-50 trajectories (initial)
- Environment interactions: 10K-100K steps
- Expert queries during training: 10K-100K labels
- Total cost: Very high (continuous expert access)
GAIL:
- Expert demos needed: 4-50 trajectories
- Environment interactions: 1M-10M steps
- Expert queries during training: 0
- Total cost: Medium (demos only, but many env steps)
AIRL:
- Expert demos needed: 4-50 trajectories
- Environment interactions: 1M-10M steps
- Expert queries during training: 0
- Total cost: Medium-high (like GAIL but more computation)
MEGA-DAgger:
- Expert demos needed: 10-50 trajectories
- Environment interactions: 50K-500K steps
- Expert queries during training: 1K-10K labels
- Total cost: Medium (less expert access than DAgger)
Question: Should we learn the reward function or just the policy?
Policy-Only Approaches (BC, DAgger, GAIL):
- Pros: Simpler, faster training
- Cons: No interpretability, no transfer, task-specific
Reward Recovery Approaches (IRL, AIRL):
- Pros: Interpretable, transferable, reusable
- Cons: More complex, slower training, identifiability issues
When You Need Reward Recovery:
-
Transfer Learning: Apply learned behavior to new environment dynamics
- Example: Robot policy trained on one morphology, deployed on another
- Solution: AIRL learns dynamics-independent reward
-
Multi-Task Learning: Share learned objectives across tasks
- Example: "Reach goal" reward for multiple navigation tasks
- Solution: Learn reward once, reuse for new goals
-
Interpretability: Understand what the expert is optimizing
- Example: Human preference learning for AI alignment
- Solution: Recovered reward function shows human values
-
Debugging: Diagnose policy failures
- Example: Policy fails in corner case
- Solution: Check reward function to understand intended behavior
When Policy-Only is Sufficient:
- Fixed environment (no transfer needed)
- Single task (no multi-task learning)
- Black-box acceptable (no interpretability needed)
- Speed critical (reward learning too slow)
Modern imitation learning often combines multiple paradigms:
1. DAgger + GAIL (DAC):
- Use DAgger for initial policy learning
- Fine-tune with GAIL to smooth out distributional mismatch
- Best of both: DAgger's sample efficiency + GAIL's robustness
2. BC Pretraining + RL Fine-tuning:
- Pretrain policy with behavioral cloning
- Fine-tune with RL (if reward available)
- Accelerates RL training significantly
3. GAIL + Reward Shaping:
- Use GAIL discriminator as learned reward
- Add hand-crafted reward shaping for known objectives
- Combines learned and engineered rewards
4. Multi-Modal GAIL:
- Learn multiple skills from diverse demonstrations
- Use latent variable models (VAE, InfoGAIL)
- Captures multi-modal expert behavior
Imitation Learning ↔ Reinforcement Learning:
IL can be viewed as RL with specific reward structures:
| IL Method | Equivalent RL Reward |
|---|---|
| BC | r(s,a) = - |
| GAIL | r(s,a) = log D(s,a) |
| AIRL | r(s,a) = learned reward function |
| DAgger | r(s,a) = expert agreement |
Imitation Learning ↔ Optimal Control:
For linear-quadratic systems:
- BC is supervised learning of linear controller
- IRL recovers quadratic cost function
- GAIL matches state-action distributions
Imitation Learning ↔ Generative Modeling:
GAIL connections to GANs:
- Expert data = real data distribution
- Policy rollouts = generated data
- Discriminator = distribution matcher
- Training = adversarial game
1. Sample-Efficient Imitation:
- Few-shot imitation learning
- One-shot imitation from single demo
- Meta-learning for quick adaptation
2. Interactive Imitation:
- Learning from human feedback (RLHF)
- Active querying strategies
- Uncertainty-guided expert queries
3. Safe Imitation Learning:
- Safety constraints during learning
- Avoiding catastrophic states
- Conservative policy updates
4. Hierarchical Imitation:
- Learning skills and composition
- Temporal abstraction
- Options and primitives
5. Imitation from Observations:
- Learn from state-only demonstrations (no actions)
- Third-person imitation
- Video demonstrations
6. Multi-Modal Imitation:
- Learning from diverse demonstrations
- Handling multi-modal expert behavior
- Skill discovery and clustering
Decision Tree:
Do you have expert access during training?
├─ YES
│ └─ Is expert consistent and high-quality?
│ ├─ YES → DAgger
│ └─ NO → MEGA-DAgger (handles multiple/imperfect experts)
│
└─ NO (fixed demonstrations only)
└─ Do you need interpretable rewards or transfer?
├─ YES → AIRL
└─ NO → GAIL
Budget Considerations:
| Budget Constraint | Recommended Algorithm |
|---|---|
| Limited expert demos (<10) | GAIL or AIRL (maximize from few demos) |
| Limited environment interactions | DAgger (sample efficient with expert) |
| Limited computation | DAgger or BC (avoid adversarial training) |
| Limited expert availability | GAIL or AIRL (offline, no queries) |
Before deploying imitation learning:
Data Collection:
- Expert demonstrations are high-quality (>90% success rate)
- Demonstrations cover diverse scenarios
- State-action pairs are properly recorded
- Episode termination is handled correctly
- Data is normalized/preprocessed consistently
Model Selection:
- Algorithm matches your expert access pattern
- Network architecture appropriate for state/action spaces
- Hyperparameters validated on similar tasks
- Baseline comparison available (BC minimum)
Training:
- Validation set for early stopping
- Metrics tracked (loss, performance, distributional distance)
- Checkpointing for best model recovery
- Stability techniques applied (grad clip, normalization)
Evaluation:
- Test on held-out scenarios
- Measure distributional similarity to expert
- Long-horizon rollout evaluation
- Edge case testing
- Robustness to perturbations
1. Mode Collapse (GAIL/AIRL):
- Symptom: Policy learns only one trajectory, ignores diversity
- Diagnosis: Check if expert demos are multi-modal
- Fix: Increase entropy regularization, use InfoGAIL
2. Catastrophic Forgetting (DAgger):
- Symptom: Performance degrades in later iterations
- Diagnosis: Check if old data is being discarded
- Fix: Ensure data aggregation, not replacement
3. Expert Mismatch (All):
- Symptom: Policy performs differently than expert despite low loss
- Diagnosis: State/action spaces don't align
- Fix: Verify observation and action preprocessing
4. Overfitting (BC):
- Symptom: Perfect training loss, poor test performance
- Diagnosis: Too much model capacity, too little data
- Fix: Regularization, data augmentation, early stopping
5. Training Instability (GAIL/AIRL):
- Symptom: Discriminator loss oscillates wildly
- Diagnosis: Adversarial training instability
- Fix: Gradient penalty, spectral norm, careful LR tuning
-
CS 285: Deep Reinforcement Learning (Berkeley)
- Instructor: Sergey Levine
- URL: http://rail.eecs.berkeley.edu/deeprlcourse/
- Module on Imitation Learning (Lectures 2-3)
-
CS 330: Deep Multi-Task and Meta Learning (Stanford)
- Instructor: Chelsea Finn
- URL: https://cs330.stanford.edu/
- Covers few-shot imitation learning
-
DeepMind x UCL: Deep Learning Lecture Series
- Guest lecture on Imitation Learning
- URL: https://www.youtube.com/deepmind
- Berkeley Robot Learning Lab: https://rll.berkeley.edu/
- Stanford Vision and Learning Lab: https://svl.stanford.edu/
- CMU Robot Learning Lab: https://rll.ri.cmu.edu/
- Google Brain Robotics: https://research.google/teams/brain/robotics/
- DeepMind Control: https://deepmind.com/research/highlighted-research/agents
Standard Environments:
- MuJoCo: Continuous control (locomotion, manipulation)
- Atari: Discrete control (game playing)
- RoboSuite: Robot manipulation
- Meta-World: Multi-task manipulation
- D4RL: Offline RL and IL datasets
Evaluation Metrics:
- Task Success Rate: Binary success/failure
- Cumulative Reward: Sum of rewards over episode
- Expert Performance Gap: (Expert - Policy) / Expert
- Distribution Divergence: KL or JS divergence from expert
- Sample Efficiency: Performance vs. data used
Implementations:
-
imitation: https://github.com/HumanCompatibleAI/imitation
- Clean, maintained implementations of BC, DAgger, GAIL, AIRL
- Works with Gym environments
- Good documentation
-
Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3
- Includes GAIL
- Integrated with PPO/SAC
- Production-ready
-
rlkit: https://github.com/rail-berkeley/rlkit
- Research codebase from Berkeley
- GAIL, AIRL implementations
- Advanced features
-
Spinning Up: https://github.com/openai/spinningup
- Educational implementations
- Clear code, good tutorials
- Focus on understanding
Imitation learning provides powerful tools for learning from demonstrations. The key trade-offs are:
| Dimension | BC | DAgger | GAIL | AIRL | MEGA-DAgger |
|---|---|---|---|---|---|
| Sample Efficiency | Low | High | Medium | Medium | Very High |
| Expert Access Needed | Offline | Online | Offline | Offline | Minimal Online |
| Computational Cost | Low | Low | Medium | High | Very High |
| Distributional Shift | Poor | Good | Good | Good | Good |
| Reward Recovery | No | No | No | Yes | No |
| Transfer Learning | No | No | Poor | Good | No |
| Implementation Complexity | Low | Low | Medium | High | Very High |
Recommendations:
- Start simple: Try BC baseline first
- Expert access: Use DAgger if you have it
- Offline demos: Use GAIL for most tasks
- Transfer/interpret: Use AIRL when needed
- Multiple experts: Use MEGA-DAgger for imperfect experts
The field continues to evolve rapidly, with new methods combining the best aspects of these foundational approaches.