Skip to content

Latest commit

 

History

History
219 lines (162 loc) · 9.59 KB

File metadata and controls

219 lines (162 loc) · 9.59 KB

🤖 RL Robot Navigation

Python PyTorch License

Deep Reinforcement Learning for 2D Robot Navigation using DQN and PPO.

A mobile robot learns to navigate from a random start to a goal in a grid world with obstacles.

Overview

This project implements and compares two Deep RL algorithms for autonomous navigation:

Feature Details
Environment Custom Gymnasium grid world with random obstacles and BFS-verified solvability
DQN Experience replay, target network, linear ε-decay, soft updates
PPO Shared actor-critic, GAE, clipped surrogate objective, entropy bonus
Logging TensorBoard integration for loss, reward, and success rate
Visualization Episode rendering (GIF), Q-value heatmaps, training curves

Project Structure

rl-robot-navigation/
├── configs/
│   └── default.yaml          # Hyperparameter configuration
├── envs/
│   ├── __init__.py
│   └── grid_nav_env.py       # Gymnasium-compatible grid navigation environment
├── networks/
│   ├── __init__.py
│   └── models.py             # QNetwork, ActorCritic, PolicyNetwork, ValueNetwork
├── agents/
│   ├── __init__.py
│   ├── dqn.py                # DQN agent with replay buffer
│   └── ppo.py                # PPO agent with rollout buffer
├── utils/
│   ├── __init__.py
│   ├── replay_buffer.py      # Fixed-size experience replay (NumPy arrays)
│   ├── rollout_buffer.py     # Trajectory buffer with GAE computation
│   └── visualization.py      # Plotting and animation utilities
├── tests/
│   └── test_env.py           # Unit tests for the environment
├── train.py                  # Main training entry point
├── evaluate.py               # Model evaluation and GIF export
├── requirements.txt
├── setup.py
└── LICENSE

Algorithm Details

Deep Q-Network (DQN)

DQN approximates the optimal action-value function $Q^*(s, a)$ using a neural network.

Bellman optimality target:

$$y_i = r + \gamma \max_{a'} Q(s', a'; \theta^-)$$

Loss function (MSE on TD error):

$$\mathcal{L}(\theta) = \mathbb{E}\left[(y_i - Q(s, a; \theta))^2\right]$$

Key techniques: experience replay buffer, target network (soft update $\tau = 0.01$), linear $\varepsilon$-greedy decay.

Proximal Policy Optimization (PPO)

PPO optimizes a clipped surrogate objective to ensure stable policy updates.

Clipped surrogate objective:

$$L^{\text{CLIP}}(\theta) = \hat{\mathbb{E}}_t \left[\min\left(r_t(\theta)\hat{A}_t,; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$$

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}$.

Generalized Advantage Estimation (GAE):

$$\hat{A}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

Architecture

┌────────────────────────────────────────────────────────┐
│                     Agent (DQN / PPO)                  │
│  ┌──────────────┐   action   ┌──────────────────────┐  │
│  │ Neural Net   │ ────────►  │ Replay / Rollout     │  │
│  │ (Q / AC)     │ ◄──────── │ Buffer               │  │
│  └──────┬───────┘   batch   └──────────────────────┘  │
│         │ state                                        │
└─────────┼──────────────────────────────────────────────┘
          │ action ▼          ▲ (state, reward, done)
┌─────────┴──────────────────────────────────────────────┐
│                   GridNavEnv                           │
│  ┌─────┬─────┬─────┬─────┐                            │
│  │  S  │     │  #  │     │   S = Start (agent)        │
│  ├─────┼─────┼─────┼─────┤   G = Goal                 │
│  │     │  #  │     │     │   # = Obstacle              │
│  ├─────┼─────┼─────┼─────┤                            │
│  │  #  │     │     │  #  │   Actions: ↑ ↓ ← →         │
│  ├─────┼─────┼─────┼─────┤   Obs: 3-channel flat vec  │
│  │     │     │  #  │  G  │   Rewards: +100, -5, -1    │
│  └─────┴─────┴─────┴─────┘                            │
└────────────────────────────────────────────────────────┘

Installation

git clone https://github.com/Jingchen-Chen/rl-robot-navigation.git
cd rl-robot-navigation
pip install -r requirements.txt

Quick Start

Training

# Train DQN agent (600 episodes)
python train.py --algo dqn

# Train PPO agent (1M timesteps)
python train.py --algo ppo

# Custom settings
python train.py --algo dqn --seed 123 --episodes 2000 --device cuda

Evaluation

# Evaluate trained DQN model
python evaluate.py --algo dqn --model_path checkpoints/best_dqn.pth --render

# Evaluate PPO with trajectory saving
python evaluate.py --algo ppo --model_path checkpoints/best_ppo.pth --save_trajectories

TensorBoard

tensorboard --logdir runs/

Tests

python -m pytest tests/ -v

Environment Details

Property Value
Grid size 8×8 (configurable)
Obstacle ratio 15% (configurable)
Observation 3-channel flattened vector (192-dim): obstacle / agent / goal planes
Actions Discrete(4): Up, Down, Left, Right
Reward: reach goal +100
Reward: hit wall/obstacle −5 + distance shaping
Reward: each step −1 + distance shaping
Max steps 200
Map generation Random with BFS solvability guarantee (default: fixed_map: false — a new random map per episode for generalization; set to true to reuse a single fixed map across all episodes)

Configuration

All hyperparameters are in configs/default.yaml. Key settings:

Parameter DQN PPO
Learning rate 5e-4 3e-4
Discount (γ) 0.99 0.99
Batch size 64 256
Buffer size 50,000 2,048 (rollout)
ε-decay steps 30,000
Clip ε 0.2
GAE λ 0.95
Entropy coef 0.05 → 0.005 (annealed)
Update epochs 6

Results

Evaluated on 10 episodes (inline eval during training), seed 42, 8×8 grid, 15% obstacles, fixed_map: false (random maps per episode).

Note: The default configuration uses fixed_map: false, meaning a new random map is generated at the start of every episode. The agent must generalize to unseen layouts rather than memorize a single path. Results below reflect this harder generalization setting.

Algorithm Training Budget Best Eval Reward Success Rate (at best) Success Rate (final)
DQN 600 episodes 90.79 100% 100%
PPO 1M timesteps 36.0 100% 80%

PPO training highlights (1M steps, random maps):

  • Reaches 100% success at multiple eval checkpoints (~28%, ~54%, ~62%, ~64%, ~68%, ~74%, ~94% of training).
  • Best single eval: mean_R = 36.0 at ~620k steps (100% success, 10/10 episodes).
  • Final eval (1M steps): mean_R = −114.3, success = 80% — late-training degradation is expected with linear LR/entropy annealing to near-zero.
  • High variance across evals (mean_R ranging from −270 to +36) is characteristic of on-policy PPO on random maps: the agent is continuously adapting to novel layouts rather than memorizing a fixed path.
  • The value loss grows monotonically (~900 → ~2200) as the critic's value estimates scale up with cumulative returns — a known behavior under long-horizon training with no normalization.

Key observations:

  • DQN converges faster and more stably on the fixed-map setting, finding near-optimal paths consistently.
  • PPO on random maps (fixed_map: false) is the harder generalization task — 1M steps is sufficient to reach 100% success on novel maps at peak, but requires careful hyperparameter tuning to maintain late-training stability.
  • To reproduce the simpler fixed-map setting (single map memorization), set fixed_map: true in configs/default.yaml.

Reproduce PPO results: python train.py --algo ppo --seed 42 Reproduce DQN results: python train.py --algo dqn --seed 42 then python evaluate.py --algo dqn --model_path checkpoints/best_dqn.pth

License

MIT © 2026 Jingchen Chen

Acknowledgments