Skip to content

Commit efce80c

Browse files
authored
Create README.md
1 parent 16adec0 commit efce80c

File tree

1 file changed

+64
-0
lines changed

1 file changed

+64
-0
lines changed

README.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# PPO OpenAI Gym Minitaur
2+
This project implements a Proximal Policy Optimization (PPO) reinforcement learning agent to train the Minitaur robot to walk in the `MinitaurBulletEnv-v0` environment using PyBullet. The agent uses a multilayer perceptron (MLP) to model the policy and value networks and learns to control the robot in a continuous action space.
3+
4+
## DEMO
5+
![](https://github.com/user-attachments/assets/8c6f88ac-d396-4f09-8316-79b777b29441)
6+
7+
## HOW TO RUN THE CODE
8+
### Requirements
9+
- Python 3.10+
10+
- PyTorch
11+
- gym
12+
- pybullet
13+
- matplotlib
14+
- numpy
15+
16+
### Installation
17+
```bash
18+
git clone https://github.com/EricChen0104/PPO_PyBullet_Minitaur.git
19+
cd PPO_PyBullet_Minitaur
20+
```
21+
22+
## Algorithm Details
23+
The PPO agent uses:
24+
- Shared MLP backbone with 2 hidden layers
25+
- Gaussian action distribution with learned mean and log_std
26+
- Tanh to bound actions in [-1, 1]
27+
- GAE for advantage estimation
28+
- Clipped surrogate objective for policy update
29+
30+
### Policy Network
31+
- Input: 28-dim observation (Minitaur state)
32+
- Actor: Two hidden layers: [128, 89], ReLU activations
33+
- Critic: Two hidden layers: [89, 55], ReLU activations
34+
- GAE calculation: <br/>
35+
![](https://github.com/user-attachments/assets/bf26b6eb-4614-4a08-9471-eae84892a9e4)
36+
37+
### Reward
38+
![](https://github.com/EricChen0104/PPO_PyBullet_Minitaur/blob/master/plot/ppo_training_curve.png?raw=true)
39+
40+
### Hyperparameters
41+
| Parameter | Value |
42+
|------------------------|-----------|
43+
| Total steps | 1,000 |
44+
| Steps per rollout | 4096 |
45+
| PPO epochs | 10 |
46+
| Minibatch size | 128 |
47+
| Learning rate | 3e-5 |
48+
| γ (discount factor) | 0.99 |
49+
| λ (GAE lambda) | 0.95 |
50+
| Clip range (ε) | 0.2 |
51+
| Value loss coeff | 0.5 |
52+
| Entropy coeff | 0.04 |
53+
54+
## Future Work
55+
- [ ] Add observation normalization (e.g. running mean/std)
56+
- [ ] Implement reward normalization
57+
- [ ] Test with LSTM-based recurrent policies
58+
- [ ] Add curriculum learning (e.g. with terrain or perturbation)
59+
60+
## References
61+
- Tan, J., Zhang, T., Coumans, E., Iscen, A., Bai, Y., Hafner, D., ... & Vanhoucke, V. (2018). Sim-to-real: Learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332.
62+
- Jadoon, N. A. K., & Ekpanyapong, M. (2025). Quadruped Robot Simulation Using Deep Reinforcement Learning--A step towards locomotion policy. arXiv preprint arXiv:2502.16401.
63+
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
64+

0 commit comments

Comments
 (0)