Deep Reinforcement Learning for 2D Robot Navigation using DQN and PPO.
A mobile robot learns to navigate from a random start to a goal in a grid world with obstacles.
This project implements and compares two Deep RL algorithms for autonomous navigation:
| Feature | Details |
|---|---|
| Environment | Custom Gymnasium grid world with random obstacles and BFS-verified solvability |
| DQN | Experience replay, target network, linear Ξ΅-decay, soft updates |
| PPO | Shared actor-critic, GAE, clipped surrogate objective, entropy bonus |
| Logging | TensorBoard integration for loss, reward, and success rate |
| Visualization | Episode rendering (GIF), Q-value heatmaps, training curves |
rl-robot-navigation/
βββ configs/
β βββ default.yaml # Hyperparameter configuration
βββ envs/
β βββ __init__.py
β βββ grid_nav_env.py # Gymnasium-compatible grid navigation environment
βββ networks/
β βββ __init__.py
β βββ models.py # QNetwork, ActorCritic, PolicyNetwork, ValueNetwork
βββ agents/
β βββ __init__.py
β βββ dqn.py # DQN agent with replay buffer
β βββ ppo.py # PPO agent with rollout buffer
βββ utils/
β βββ __init__.py
β βββ replay_buffer.py # Fixed-size experience replay (NumPy arrays)
β βββ rollout_buffer.py # Trajectory buffer with GAE computation
β βββ visualization.py # Plotting and animation utilities
βββ tests/
β βββ test_env.py # Unit tests for the environment
βββ train.py # Main training entry point
βββ evaluate.py # Model evaluation and GIF export
βββ requirements.txt
βββ setup.py
βββ LICENSE
DQN approximates the optimal action-value function
Bellman optimality target:
Loss function (MSE on TD error):
Key techniques: experience replay buffer, target network (soft update
PPO optimizes a clipped surrogate objective to ensure stable policy updates.
Clipped surrogate objective:
where
Generalized Advantage Estimation (GAE):
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Agent (DQN / PPO) β
β ββββββββββββββββ action ββββββββββββββββββββββββ β
β β Neural Net β βββββββββΊ β Replay / Rollout β β
β β (Q / AC) β βββββββββ β Buffer β β
β ββββββββ¬ββββββββ batch ββββββββββββββββββββββββ β
β β state β
βββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ
β action βΌ β² (state, reward, done)
βββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββ
β GridNavEnv β
β βββββββ¬ββββββ¬ββββββ¬ββββββ β
β β S β β # β β S = Start (agent) β
β βββββββΌββββββΌββββββΌββββββ€ G = Goal β
β β β # β β β # = Obstacle β
β βββββββΌββββββΌββββββΌββββββ€ β
β β # β β β # β Actions: β β β β β
β βββββββΌββββββΌββββββΌββββββ€ Obs: 3-channel flat vec β
β β β β # β G β Rewards: +100, -5, -1 β
β βββββββ΄ββββββ΄ββββββ΄ββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
git clone https://github.com/Jingchen-Chen/rl-robot-navigation.git
cd rl-robot-navigation
pip install -r requirements.txt# Train DQN agent (600 episodes)
python train.py --algo dqn
# Train PPO agent (1M timesteps)
python train.py --algo ppo
# Custom settings
python train.py --algo dqn --seed 123 --episodes 2000 --device cuda# Evaluate trained DQN model
python evaluate.py --algo dqn --model_path checkpoints/best_dqn.pth --render
# Evaluate PPO with trajectory saving
python evaluate.py --algo ppo --model_path checkpoints/best_ppo.pth --save_trajectoriestensorboard --logdir runs/python -m pytest tests/ -v| Property | Value |
|---|---|
| Grid size | 8Γ8 (configurable) |
| Obstacle ratio | 15% (configurable) |
| Observation | 3-channel flattened vector (192-dim): obstacle / agent / goal planes |
| Actions | Discrete(4): Up, Down, Left, Right |
| Reward: reach goal | +100 |
| Reward: hit wall/obstacle | β5 + distance shaping |
| Reward: each step | β1 + distance shaping |
| Max steps | 200 |
| Map generation | Random with BFS solvability guarantee (default: fixed_map: false β a new random map per episode for generalization; set to true to reuse a single fixed map across all episodes) |
All hyperparameters are in configs/default.yaml. Key settings:
| Parameter | DQN | PPO |
|---|---|---|
| Learning rate | 5e-4 | 3e-4 |
| Discount (Ξ³) | 0.99 | 0.99 |
| Batch size | 64 | 256 |
| Buffer size | 50,000 | 2,048 (rollout) |
| Ξ΅-decay steps | 30,000 | β |
| Clip Ξ΅ | β | 0.2 |
| GAE Ξ» | β | 0.95 |
| Entropy coef | β | 0.05 β 0.005 (annealed) |
| Update epochs | β | 6 |
Evaluated on 10 episodes (inline eval during training), seed 42, 8Γ8 grid, 15% obstacles, fixed_map: false (random maps per episode).
Note: The default configuration uses
fixed_map: false, meaning a new random map is generated at the start of every episode. The agent must generalize to unseen layouts rather than memorize a single path. Results below reflect this harder generalization setting.
| Algorithm | Training Budget | Best Eval Reward | Success Rate (at best) | Success Rate (final) |
|---|---|---|---|---|
| DQN | 600 episodes | 90.79 | 100% | 100% |
| PPO | 1M timesteps | 36.0 | 100% | 80% |
PPO training highlights (1M steps, random maps):
- Reaches 100% success at multiple eval checkpoints (~28%, ~54%, ~62%, ~64%, ~68%, ~74%, ~94% of training).
- Best single eval: mean_R = 36.0 at ~620k steps (100% success, 10/10 episodes).
- Final eval (1M steps): mean_R = β114.3, success = 80% β late-training degradation is expected with linear LR/entropy annealing to near-zero.
- High variance across evals (mean_R ranging from β270 to +36) is characteristic of on-policy PPO on random maps: the agent is continuously adapting to novel layouts rather than memorizing a fixed path.
- The value loss grows monotonically (~900 β ~2200) as the critic's value estimates scale up with cumulative returns β a known behavior under long-horizon training with no normalization.
Key observations:
- DQN converges faster and more stably on the fixed-map setting, finding near-optimal paths consistently.
- PPO on random maps (
fixed_map: false) is the harder generalization task β 1M steps is sufficient to reach 100% success on novel maps at peak, but requires careful hyperparameter tuning to maintain late-training stability. - To reproduce the simpler fixed-map setting (single map memorization), set
fixed_map: trueinconfigs/default.yaml.
Reproduce PPO results:
python train.py --algo ppo --seed 42Reproduce DQN results:python train.py --algo dqn --seed 42thenpython evaluate.py --algo dqn --model_path checkpoints/best_dqn.pth
MIT Β© 2026 Jingchen Chen