Beyond Fixed Morphologies: Learning Graph Policies with Trust Region Compensation in Variable Action Spaces
Official code release for the paper:
Beyond Fixed Morphologies: Learning Graph Policies with Trust Region Compensation in Variable Action Spaces
Anonymous for double blind review
Trust region methods such as TRPO and PPO are a cornerstone of modern reinforcement learning, offering stability and strong performance in continuous control. At the same time, there is growing interest in morphological generalization — enabling policies to operate across agents with different kinematic structures. Graph-based policy architectures naturally encode these structural differences, but the impact of variable action space dimensionality on trust region optimization remains unclear.
In this work, we present a theoretical and empirical study of how action space variation influences the optimization landscape under KL-divergence constraints (TRPO) and clipping penalties (PPO). We introduce a dimension compensation mechanism** providing fair policy updates among varying action space dimensions.
Experiments in the Gymnasium Swimmer environment, where morphology can be systematically altered without changing the underlying task, show that TRC improves stability and generalization in graph-based policies.
- Implementation of Graph Neural Network policies for variable action spaces.
- Trust Region Dimension Compenation for stability in policy optimization.
- Support for morphology-varying RL tasks.
- Training and evaluation scripts for all experiments in the paper.
- Modular design for easy extension to custom environments.
Note:
• All experiments in the paper were conducted on Ubuntu 20.04.6 LTS (CUDA 12.2) with Python 3.11.
• All code is developed based on Stable-Baselines3 and we used Hydra for the configuration manangement.
git clone https://github.com/jr-robotics/MorphologicalPPO.git
cd MorphologicalPPO
conda create --name venv python=3.11
conda activate venv
pip install -r requirements.txt
MorphologicalPPO/
├── config/ # Hydra configuration files
│ ├── agent/ # Agent-specific settings
│ ├── callbacks/ # Callbacks during training/evaluation
│ ├── env/ # Environment definitions and parameters
│ ├── hparams_search/ # Hyperparameter search configs
│ ├── learner/ # Learning wrapper configs
│ ├── policy/ # Policy architecture and parameters
│ ├── env/ # Environment definitions and parameters
│ ├── hparams_search/ # Hyperparameter search configs
│ └── train.yaml # Main training configuration
├── src/ # Core source code
│ ├── agents/ # Action space agnostic PPO implementation
│ ├── common/ # Supporting code for buffers, callbacks, etc.
│ ├── envs/ # Environment adaption based on Swimmer-v5
│ ├── models/ # GNN-based policy networks
│ ├── models/ # GNN-based actor-critic p
│ ├── utils/ # Helpers for instantiation and postprocessing
│ └── wrappers/ # Code wrappers
├── eval_best.py # Evaluates best snapshots and renders envs
├── inference.py # Inference script evaluating policy snapshots
├── train.py # Main training entry point
├── requirements.txt # Python dependencies
├── README.md # This file
└── LICENSE # License information
All commands assume you are running from the repository root with the virtual Hydra will automatically create output directories under
logs/runs
orlogs/multiruns
.
Run training with the default train.yaml
configuration:
python train.py
Trainings with alternative configurations are done via:
python train.py cfg=your_config
Hyperparamter serarch is done via an experiment config. For example,
python train.py -m experiment=hparams_var_flex_ppo
runs the hyperparameters search described in the paper for the PPO version compensating the action space dimensions. Other agents can easily loaded by either creating a dedictated experiment file or orverloading the agent key in bash:
python train.py -m experiment=hparams_var_flex_ppo agent=sb3_flex_ppo
Run
python inference.py --run logs/multiruns/../<run_dir>
to evaluate all checkpoints found in <run_dir>/checkpoints/periodic and stores the mean episodic reward in <run_dir>/inference/evaluation_results.csv
Running the code below evaluates the best policy (<run_dir>/checkpoints/best_model/best_model.zip) and stores rendered images to (<run_dir>/images) along with an mp4 video.
python render_policy.py --run logs/multiruns/../<run_dir>
However, this requires that (<run_dir>/inference/config.yaml) exist.
To reproduce the results from the papare run:
python train.py -m experiment=multiseed_grid agent=sb3_flex_ppo,sb3_flex_varppo # vary seeds
python inference.py --run logs/multiruns/inference/<timestamp>/0/ # inference agent 1
python inference.py --run logs/multiruns/inference/<timestamp>/1/ # inference agent 2
python visualize_inference.py --run /logs/multiruns/multiseed_grid/<timestamp> # evaluates and plots