Skip to content

Jingchen-Chen/rl-robot-navigation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– RL Robot Navigation

Python PyTorch License

Deep Reinforcement Learning for 2D Robot Navigation using DQN and PPO.

A mobile robot learns to navigate from a random start to a goal in a grid world with obstacles.

Overview

This project implements and compares two Deep RL algorithms for autonomous navigation:

Feature Details
Environment Custom Gymnasium grid world with random obstacles and BFS-verified solvability
DQN Experience replay, target network, linear Ξ΅-decay, soft updates
PPO Shared actor-critic, GAE, clipped surrogate objective, entropy bonus
Logging TensorBoard integration for loss, reward, and success rate
Visualization Episode rendering (GIF), Q-value heatmaps, training curves

Project Structure

rl-robot-navigation/
β”œβ”€β”€ configs/
β”‚   └── default.yaml          # Hyperparameter configuration
β”œβ”€β”€ envs/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── grid_nav_env.py       # Gymnasium-compatible grid navigation environment
β”œβ”€β”€ networks/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── models.py             # QNetwork, ActorCritic, PolicyNetwork, ValueNetwork
β”œβ”€β”€ agents/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ dqn.py                # DQN agent with replay buffer
β”‚   └── ppo.py                # PPO agent with rollout buffer
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ replay_buffer.py      # Fixed-size experience replay (NumPy arrays)
β”‚   β”œβ”€β”€ rollout_buffer.py     # Trajectory buffer with GAE computation
β”‚   └── visualization.py      # Plotting and animation utilities
β”œβ”€β”€ tests/
β”‚   └── test_env.py           # Unit tests for the environment
β”œβ”€β”€ train.py                  # Main training entry point
β”œβ”€β”€ evaluate.py               # Model evaluation and GIF export
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ setup.py
└── LICENSE

Algorithm Details

Deep Q-Network (DQN)

DQN approximates the optimal action-value function $Q^*(s, a)$ using a neural network.

Bellman optimality target:

$$y_i = r + \gamma \max_{a'} Q(s', a'; \theta^-)$$

Loss function (MSE on TD error):

$$\mathcal{L}(\theta) = \mathbb{E}\left[(y_i - Q(s, a; \theta))^2\right]$$

Key techniques: experience replay buffer, target network (soft update $\tau = 0.01$), linear $\varepsilon$-greedy decay.

Proximal Policy Optimization (PPO)

PPO optimizes a clipped surrogate objective to ensure stable policy updates.

Clipped surrogate objective:

$$L^{\text{CLIP}}(\theta) = \hat{\mathbb{E}}_t \left[\min\left(r_t(\theta)\hat{A}_t,; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$$

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}$.

Generalized Advantage Estimation (GAE):

$$\hat{A}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Agent (DQN / PPO)                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   action   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Neural Net   β”‚ ────────►  β”‚ Replay / Rollout     β”‚  β”‚
β”‚  β”‚ (Q / AC)     β”‚ ◄──────── β”‚ Buffer               β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   batch   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚         β”‚ state                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚ action β–Ό          β–² (state, reward, done)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   GridNavEnv                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”                            β”‚
β”‚  β”‚  S  β”‚     β”‚  #  β”‚     β”‚   S = Start (agent)        β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€   G = Goal                 β”‚
β”‚  β”‚     β”‚  #  β”‚     β”‚     β”‚   # = Obstacle              β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€                            β”‚
β”‚  β”‚  #  β”‚     β”‚     β”‚  #  β”‚   Actions: ↑ ↓ ← β†’         β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€   Obs: 3-channel flat vec  β”‚
β”‚  β”‚     β”‚     β”‚  #  β”‚  G  β”‚   Rewards: +100, -5, -1    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Installation

git clone https://github.com/Jingchen-Chen/rl-robot-navigation.git
cd rl-robot-navigation
pip install -r requirements.txt

Quick Start

Training

# Train DQN agent (600 episodes)
python train.py --algo dqn

# Train PPO agent (1M timesteps)
python train.py --algo ppo

# Custom settings
python train.py --algo dqn --seed 123 --episodes 2000 --device cuda

Evaluation

# Evaluate trained DQN model
python evaluate.py --algo dqn --model_path checkpoints/best_dqn.pth --render

# Evaluate PPO with trajectory saving
python evaluate.py --algo ppo --model_path checkpoints/best_ppo.pth --save_trajectories

TensorBoard

tensorboard --logdir runs/

Tests

python -m pytest tests/ -v

Environment Details

Property Value
Grid size 8Γ—8 (configurable)
Obstacle ratio 15% (configurable)
Observation 3-channel flattened vector (192-dim): obstacle / agent / goal planes
Actions Discrete(4): Up, Down, Left, Right
Reward: reach goal +100
Reward: hit wall/obstacle βˆ’5 + distance shaping
Reward: each step βˆ’1 + distance shaping
Max steps 200
Map generation Random with BFS solvability guarantee (default: fixed_map: false β€” a new random map per episode for generalization; set to true to reuse a single fixed map across all episodes)

Configuration

All hyperparameters are in configs/default.yaml. Key settings:

Parameter DQN PPO
Learning rate 5e-4 3e-4
Discount (Ξ³) 0.99 0.99
Batch size 64 256
Buffer size 50,000 2,048 (rollout)
Ξ΅-decay steps 30,000 β€”
Clip Ξ΅ β€” 0.2
GAE Ξ» β€” 0.95
Entropy coef β€” 0.05 β†’ 0.005 (annealed)
Update epochs β€” 6

Results

Evaluated on 10 episodes (inline eval during training), seed 42, 8Γ—8 grid, 15% obstacles, fixed_map: false (random maps per episode).

Note: The default configuration uses fixed_map: false, meaning a new random map is generated at the start of every episode. The agent must generalize to unseen layouts rather than memorize a single path. Results below reflect this harder generalization setting.

Algorithm Training Budget Best Eval Reward Success Rate (at best) Success Rate (final)
DQN 600 episodes 90.79 100% 100%
PPO 1M timesteps 36.0 100% 80%

PPO training highlights (1M steps, random maps):

  • Reaches 100% success at multiple eval checkpoints (~28%, ~54%, ~62%, ~64%, ~68%, ~74%, ~94% of training).
  • Best single eval: mean_R = 36.0 at ~620k steps (100% success, 10/10 episodes).
  • Final eval (1M steps): mean_R = βˆ’114.3, success = 80% β€” late-training degradation is expected with linear LR/entropy annealing to near-zero.
  • High variance across evals (mean_R ranging from βˆ’270 to +36) is characteristic of on-policy PPO on random maps: the agent is continuously adapting to novel layouts rather than memorizing a fixed path.
  • The value loss grows monotonically (~900 β†’ ~2200) as the critic's value estimates scale up with cumulative returns β€” a known behavior under long-horizon training with no normalization.

Key observations:

  • DQN converges faster and more stably on the fixed-map setting, finding near-optimal paths consistently.
  • PPO on random maps (fixed_map: false) is the harder generalization task β€” 1M steps is sufficient to reach 100% success on novel maps at peak, but requires careful hyperparameter tuning to maintain late-training stability.
  • To reproduce the simpler fixed-map setting (single map memorization), set fixed_map: true in configs/default.yaml.

Reproduce PPO results: python train.py --algo ppo --seed 42 Reproduce DQN results: python train.py --algo dqn --seed 42 then python evaluate.py --algo dqn --model_path checkpoints/best_dqn.pth

License

MIT Β© 2026 Jingchen Chen

Acknowledgments

  • Gymnasium β€” RL environment toolkit
  • PyTorch β€” deep learning framework

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages