This repository implements PPO for continuous action spaces in C++ matching the CleanRL python implementation closely. It also contains a minimum port of gymnasium to C++, containing the needed functionality for PPO. Additionally, the repository provides environments for mujoco, half_cheetah_v5 and humanoid_v4, as well as an environment for autonomous driving with the CARLA leaderboard 2.0.
The repository also implements Asynchronous Collection Proximal Policy Optimization (AC-PPO) which parallelizes data collection via multithreading and cuda streams, leading to faster training time than PPO in nonhomogeneous environments. The idea is described in Appendix B.1 of this paper.
To run the training and evaluation of with the CARLA leaderboard 2.0, you also need to download and set up the CaRL repo.
To most convenient way to compile and run the program is to build the singularity container and run the code inside that. Building the container can take a while, depending on your CPU power, because it builds several libraries and needs 12 GB of space. I have tested the code with singularity-ce version 3.11, but other version should work as well.
cd tools
sudo singularity build ppo_cpp.sif make_singularity_image.def
Alternatively you can setup your own computer by installing all necessary libaries. You can have a look at make_singularity_image.def on how to do it. This often takes some time and you will face various issues, so it is only recommended for experienced C++ users.
The code can be compiled via cmake:
cd /path/to/ppo.cpp
singularity exec tools/ppo_cpp.sif cmake -B build -DCMAKE_BUILD_TYPE=Release -G "Ninja"
singularity exec tools/ppo_cpp.sif cmake --build build -j$(nproc)
To train CARLA models have a look at the training scripts in CaRL.
Generally you need to build the container, compile the program and then set the paths correctly:
--ppo_cpp_install_path /path/to/folder_with_binaries
--cpp_singularity_file_path /path/to/ppo_cpp.sif
To run the mujoco model cd into the repositories directory and run either of these two commands.
The environment can be set via the --env_id
variable. Humanoid-v4 and HalfCheetah-v5 are currently supported.
Other hyperparameters can be similarly set via the program arguments.
cd /path/to/ppo.cpp
singularity exec --nv tools/ppo_cpp.sif build/ppo_continuous_action --env_id Humanoid-v4
singularity exec --nv tools/ppo_cpp.sif build/ac_ppo_continuous_action --env_id HalfCheetah-v5
Libtorch does not natively support multi-gpu training.
We implemented the multi-gpu communication ourselves using the backend code of torch-fort.
To use multiple GPUs for training the code needs to be started with mpirun (-n
= number of GPUs), similar how pytorch DDP ist started with torchrun:
singularity exec --nv tools/ppo_cpp.sif mpirun -n 1 --bind-to none build/ac_ppo_continuous_action --env_id HalfCheetah-v5
We implemented the Mujoco environments mainly to check if the implementation is correct. Below we compare the runs of ppo_continuous_action.cpp with cleanRL's ppo_continuous_action.py. There are some numerical differences but the runs are very close for RL standards.
Interestingly the C++ implementation is up to 72% faster in SPS than the python implementation on the same hardware without any specific optimizations. All runs are in CPU mode which is faster than GPU mode for these tiny Mujoco networks in both languages, the GPU default in ppo_continuous_action.py is suboptimal. For the larger CNN used in the CaRL model the speedup of the GPU in enough to outweigh the GPU-CPU communication overhead.
The code implements the DD-PPO preemption trick. I had some runs with poor performance using the preemption trick which is why I disabled it. It could be that the trick itself caused the performance degradation but there might also be a bug, so I do not recommend using it right now (use_dd_ppo_preempt=0).
The CARLA training code seems to have higher peak GPU memory usage than our pytorch version. My training runs did not max out my GPU memory, so I have not investigated this issue further. There might be some PyTorch DDP memory optimization that is not included in our custom implementation or something like that.
The original code in this repository is provided under the Civil-M license, which is a variant of the MIT license that bans dual-use. The license contains a partial copyleft which requires derivative work to include the civil clause in their license. For further information see the accompaning documentation on Civil Software Licenses.
If you find the repo useful, please consider giving it a star 🌟. To cite the paper please use the following bibtex:
@article{Jaeger2025ArXiv,
author = {Bernhard Jaeger and Daniel Dauner and Jens Beißwenger and Simon Gerstenecker and Kashyap Chitta and Andreas Geiger},
title = {CaRL: Learning Scalable Planning Policies with Simple Rewards},
year = {2025},
journal = {arXiv.org},
volume = {2504.17838},
}
The original code in this repository was written by Bernhard Jaeger.
Code like this is build on the shoulders of many other open source repositories. Particularly, we would like to thank the following repositories for their contributions:
We also thank the creators of the numerous libraries we use. Complex projects like this would not be feasible without your contribution.