An implementation of Proximal Policy Optimization (PPO) designed for OpenAI Gymnasium. This repository is built for immediate use on BipedalWalker-v3 and Humanoid-v5. Its modular structure allows fast integration on new gym environments.
BipedalWalker-v3
A bipedal agent learning to walk on uneven terrain.

Humanoid-v5
A humanoid robot learning basic locomotion.

Step 1: Create a new conda environment gym:
conda create -n gym python=3.10
conda activate gym
Step 2: Clone this repository:
git clone https://github.com/jianglanwei/PPO-OpenAI-Gym
cd PPO-OpenAI-Gym
Step 3: Install dependencies:
pip install -r requirements.txt
To train a BipedalWalker-v3 policy from scratch, execute:
python3 train.py --env BipedalWalker-v3
- Checkpoints are saved to
policy_ckpt/BipedalWalker-v3/<train_start_time>. Only the top 5 checkpoints will be retained per run. - Training hyperparameters can be customized in
config/BipedalWalker-v3.yaml. - Real-time training metrics are logged to Weights & Biases (
wandb).
Use the --resume_run flag to load a checkpoint from a previous session and continue training:
python3 train.py --env BipedalWalker-v3 --resume_run <train_start_time>
This repository includes pretrained BipedalWalker-v3 checkpoints from session
06-05-25_03:09:03 (located here).
To resume training from the highest-reward checkpoint of that session, use:
python3 train.py --env BipedalWalker-v3 --resume_run 06-05-25_03:09:03
To resume from a specific epoch, add the --load_epoch flag:
python3 train.py --env BipedalWalker-v3 --resume_run 06-05-25_03:09:03 --load_epoch 990
Use play.py to render a trained agents. This script supports rendering real-time (human mode, default) or by generating GIF files (rgb_array mode, suitable for headless execution). The general command is
python3 play.py --env BipedalWalker-v3 --run <train_start_time> --epoch <epoch_number> --render_mode <human|rgb_array>
For example, to visualize BipedalWalker-v3 session 06-05-25_03:09:03 in a Gymnasium window, run:
python3 play.py --env BipedalWalker-v3 --run 06-05-25_03:09:03 --render_mode human
By default, this loads the checkpoint with the highest reward. Use --epoch to target a specific checkpoint:
python3 play.py --env BipedalWalker-v3 --run 06-05-25_03:09:03 --epoch 990 --render_mode human
The
Humanoid-v5environment:
This repository also includes tuned hyperparameters and pretrained checkpoints forHumanoid-v5.Train:
python3 train.py --env Humanoid-v5Virtualize Pretrained Policy:
python3 play.py --env Humanoid-v5 --run 06-06-25_19:08:43
This repository is designed to be easily extensible. To train a PPO agent on a new OpenAI Gym environment:
Create a new YAML file in the config/ directory named exactly after your
target environment ID (e.g., LunarLander-v2.yaml). Copy an existing config file
(BipedalWalker-v3.yaml or Humanoid-v5.yaml) as a template. This file defines all hyperparameters, such as learning rate, batch size, and the actor-critic network.
If your environment requires a specialized neural network (e.g., a CNN for pixel-based inputs):
-
Add a new Actor-Critic class in
module.py. Refer to the existing classes in that file to ensure the input/output matches. -
Update the
actor_criticfield in the environment's config file (from Section 2.1) to match the name of your new class.
Start a new training session:
python3 train.py --env <env_name>
Resume from a previous session:
python3 train.py --env <env_name> --resume_run <train_start_time> --load_epoch <epoch_number>
Render the agent's performance and save the rollout as a GIF:
python3 play.py --env <env_name> --run <train_start_time> --epoch <epoch_number> --render_mode <human|rgb_array>