Official implementation of our IEEE RAL 2025 paper
Authors: Manav Vora, Ilan Shomorony, Melkior Ornik
We study capacity- and budget-constrained multi-agent MDPs (CB-MA-MDPs), a class that captures many maintenance and scheduling tasks in which each agent can irreversibly fail and a planner must decide (i) when to apply a restorative action and (ii) which subset of agents to treat in parallel. The global budget limits the total number of restorations, while the capacity constraint bounds the number of simultaneous actions, turning naïve dynamic programming into a combinatorial search that scales exponentially with the number of agents.
We propose a two-stage solution that remains tractable for large systems:
- LSAP-based Grouping: Partitions agents into disjoint sets maximizing diversity in expected time-to-failure
- Meta-trained PPO: Solves each sub-MDP with transfer learning for rapid convergence
We validate our approach on industrial robot repair scheduling with limited technicians and budget. Results demonstrate that our method outperforms baselines in maximizing system uptime, particularly for large team sizes, with scalability confirmed for 1000+ agents.
Distribution of survival times |
Performance across budget levels |
Computational time heatmap |
- Python 3.8+
- PyTorch (for PPO)
- Stable-Baselines3
- See
requirements.txtfor full dependencies
# Clone repository
git clone https://github.com/yourusername/RAL-2025---Capacity-and-Budget-Constrained-Multi-Agent-RL.git
cd RAL-2025---Capacity-and-Budget-Constrained-Multi-Agent-RL
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt├── env/ # Environment Implementation
│ ├── env_multi_repair.py # CB-MA-MDP environment (Gymnasium)
│ ├── robot.py # Agent with Weibull degradation
│ ├── complete_components_data.csv # Real component failure data
│ └── all_robots_ttfs_*.npy # Pre-computed ETTF statistics
│
└── PPO/ # Algorithms and Experiments
│
├── 🌟 PROPOSED METHOD (LSAP + Meta-PPO)
│ ├── lsap_partitioning.py # LSAP grouping algorithm
│ ├── lsap_partitioning_policy.py # Main LSAP+PPO (stores results)
│ ├── ppo_new.py # PPO training script
│ ├── meta_ppo_policy.py # Meta-learning policy
│ └── ppo_policy.py # Vanilla PPO evaluation
│
├── 📊 BASELINE METHODS
│ ├── baseline_partitioning_policy.py # Random Partition + Meta-PPO (RP-PPO)
│ ├── genetic_algorithm_2.py # Genetic Algorithm (GA)
│ ├── auction_method.py # Auction Heuristic
│ ├── greedy_baseline.py # Greedy Heuristic
│ ├── ilp.py # Integer Linear Programming (ILP)
│ └── milp_partitioning_policy.py # MILP Partition + Meta-PPO (MP-PPO)
│
└── 📈 COMPARISON & ANALYSIS
├── compare_repair_policies.py # Main comparison (generates figures)
└── compare_grouping.py # Grouping strategy comparison
Note: Only source code and essential data files are tracked. Generated results (.npy), models (.pth), and plots (.png) are excluded via .gitignore.
First, train the PPO policy that will be used by the partitioning methods:
cd PPO
python ppo_new.pyThis trains the meta-PPO policy on diverse agent configurations.
cd PPO
python lsap_partitioning_policy.pyThis will:
- Perform LSAP-based partitioning
- Apply the trained meta-PPO policy to each group
- Store results as
.npyfiles (e.g.,lsap_partitioning_policy_*_survival_times.npy)
Default configuration: 10 robots, 3 repairmen, budget=30
cd PPO
python lsap_partitioning_policy.py --num-robots 100 --num-repairmen 30 --repair-budget 1000cd PPO
python ppo_policy.py --num-robots 100 --num-repairmen 30 --repair-budget 1000cd PPO
python baseline_partitioning_policy.py --num-robots 100 --num-repairmen 30 --repair-budget 1000cd PPO
python genetic_algorithm_2.py --num-robots 100 --num-repairmen 30 --repair-budget 1000cd PPO
python auction_method.py --num-robots 100 --num-repairmen 30 --repair-budget 1000cd PPO
python ilp.py --num-robots 10 --num-repairmen 3 --repair-budget 30Note: ILP only tractable for N ≤ 20 robots due to computational complexity.
cd PPO
python milp_partitioning_policy.py --num-robots 100 --num-repairmen 30 --repair-budget 1000After running experiments, generate comparison figures:
cd PPO
python compare_repair_policies.pyThis reproduces the paper figures comparing all methods across different scales.
Stage 1: LSAP-based Partitioning
- Compute expected time-to-failure (ETTF) for each agent using Weibull parameters
- Formulate Linear Sum Assignment Problem (LSAP) to partition agents into
rdiverse groups - Allocate budget proportionally based on group ETTF
- Complexity: O(N³) using Hungarian algorithm
Stage 2: Meta-trained PPO
- Pre-train PPO policy on diverse synthetic agent configurations
- Fine-tune policy for each group via transfer learning
- Deploy policies independently per group in parallel
- Complexity: O(r × PPO_training)
Overall: Scales to 1000+ agents, outperforms baselines while remaining computationally tractable.
Our paper compares LSAP + Meta-PPO against six baselines:
- Integer Linear Programming (ILP) – Exact integer linear program solved with Gurobi (optimal for small N)
- Vanilla PPO – Single PPO network trained on full CB-MA-MDP
- Genetic Algorithm (GA) – Rank selection, two-point crossover, bit-flip mutation
- Auction Heuristic – Agents bid based on failure risk; top-r bids receive repairs
- Random Partition + Meta-PPO (RP-PPO) – Random grouping followed by meta-PPO
- MILP Partition + Meta-PPO (MP-PPO) – Diversity-maximizing MILP with LP relaxation + rounding
State Space: Per-agent health h_i ∈ [0, 100] + available budget
Action Space:
- Single repairman: Discrete(N+1) – which agent to repair or none
- Multiple repairmen: MultiBinary(N) with constraint
sum(action) ≤ r
Dynamics:
- Weibull degradation:
P(h' | h, a=0) ~ Weibull(shape, scale) - Repair: Probabilistic restoration to higher health values
P(h' | h, a=1)
Objective: Maximize system survival time (timesteps until first agent failure)
Constraints:
- Capacity: Maximum
rsimultaneous repairs per timestep - Budget: Total
Brepairs over horizonH
Key settings in env/env_multi_repair.py:
max_steps = 100 # Episode horizon
initial_health = 100 # Starting health for all agents
failure_threshold = 0 # Agent fails when health ≤ 0Settings in PPO/ppo_new.py:
learning_rate = 3e-4
n_steps = 2048
batch_size = 64
n_epochs = 10
gamma = 0.99
gae_lambda = 0.95If you use this code or build upon this work, please cite:
@article{vora2025capacity,
title={Capacity-Aware Planning and Scheduling in Budget-Constrained Multi-Agent MDPs: A Meta-RL Approach},
author={Vora, Manav and Shomorony, Ilan and Ornik, Melkior},
journal={IEEE Robotics and Automation Letters},
year={2025},
publisher={IEEE}
}This project is licensed under the MIT License - see the LICENSE file for details.
For questions or issues:
- Email: mkvora2@illinois.edu
- Paper: IEEE Xplore
- Issues: Open an issue on GitHub
- Component failure data from industrial maintenance datasets
- Built with Gymnasium, Stable-Baselines3, and PyTorch
- Supported by [funding agencies/institutions]


