|
| 1 | +# PPO OpenAI Gym Minitaur |
| 2 | +This project implements a Proximal Policy Optimization (PPO) reinforcement learning agent to train the Minitaur robot to walk in the `MinitaurBulletEnv-v0` environment using PyBullet. The agent uses a multilayer perceptron (MLP) to model the policy and value networks and learns to control the robot in a continuous action space. |
| 3 | + |
| 4 | +## DEMO |
| 5 | + |
| 6 | + |
| 7 | +## HOW TO RUN THE CODE |
| 8 | +### Requirements |
| 9 | +- Python 3.10+ |
| 10 | +- PyTorch |
| 11 | +- gym |
| 12 | +- pybullet |
| 13 | +- matplotlib |
| 14 | +- numpy |
| 15 | + |
| 16 | +### Installation |
| 17 | +```bash |
| 18 | +git clone https://github.com/EricChen0104/PPO_PyBullet_Minitaur.git |
| 19 | +cd PPO_PyBullet_Minitaur |
| 20 | +``` |
| 21 | + |
| 22 | +## Algorithm Details |
| 23 | +The PPO agent uses: |
| 24 | +- Shared MLP backbone with 2 hidden layers |
| 25 | +- Gaussian action distribution with learned mean and log_std |
| 26 | +- Tanh to bound actions in [-1, 1] |
| 27 | +- GAE for advantage estimation |
| 28 | +- Clipped surrogate objective for policy update |
| 29 | + |
| 30 | +### Policy Network |
| 31 | +- Input: 28-dim observation (Minitaur state) |
| 32 | +- Actor: Two hidden layers: [128, 89], ReLU activations |
| 33 | +- Critic: Two hidden layers: [89, 55], ReLU activations |
| 34 | +- GAE calculation: <br/> |
| 35 | +  |
| 36 | + |
| 37 | +### Reward |
| 38 | + |
| 39 | + |
| 40 | +### Hyperparameters |
| 41 | +| Parameter | Value | |
| 42 | +|------------------------|-----------| |
| 43 | +| Total steps | 1,000 | |
| 44 | +| Steps per rollout | 4096 | |
| 45 | +| PPO epochs | 10 | |
| 46 | +| Minibatch size | 128 | |
| 47 | +| Learning rate | 3e-5 | |
| 48 | +| γ (discount factor) | 0.99 | |
| 49 | +| λ (GAE lambda) | 0.95 | |
| 50 | +| Clip range (ε) | 0.2 | |
| 51 | +| Value loss coeff | 0.5 | |
| 52 | +| Entropy coeff | 0.04 | |
| 53 | + |
| 54 | +## Future Work |
| 55 | +- [ ] Add observation normalization (e.g. running mean/std) |
| 56 | +- [ ] Implement reward normalization |
| 57 | +- [ ] Test with LSTM-based recurrent policies |
| 58 | +- [ ] Add curriculum learning (e.g. with terrain or perturbation) |
| 59 | + |
| 60 | +## References |
| 61 | +- Tan, J., Zhang, T., Coumans, E., Iscen, A., Bai, Y., Hafner, D., ... & Vanhoucke, V. (2018). Sim-to-real: Learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332. |
| 62 | +- Jadoon, N. A. K., & Ekpanyapong, M. (2025). Quadruped Robot Simulation Using Deep Reinforcement Learning--A step towards locomotion policy. arXiv preprint arXiv:2502.16401. |
| 63 | +- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. |
| 64 | + |
0 commit comments