A medical machine learning benchmark platform for evaluating automated machine learning agents on realistic healthcare tasks.
ReX-MLE provides a framework for:
- Running multiple ML agents (RD-Agent, ML-Master, etc.) on standardized medical ML challenges
- Preparing and managing challenge datasets
- Evaluating agent submissions against benchmark metrics
- Analyzing agent strategies and performance
- Miniconda or Anaconda installed
- Bash shell
- Python 3.11+
Run the setup script to create the required conda environments:
./setup.shThis installs:
rexmle: The evaluator environment for challenge management and gradingrexagent: The base agent environment for running agents
After running ./setup.sh, the conda environments are ready to use. Make sure both rexmle and rexagent environments are properly installed before proceeding.
Before running agents on a challenge, you need to prepare the challenge data.
- Activate the rexmle environment:
conda activate rexmle- Change to the rex-mle directory:
cd ./rex-mle- List available challenges:
python -m rexmle.cli list- View challenge information:
python -m rexmle.cli info CHALLENGE_NAME- Prepare the challenge:
python -m rexmle.cli prepare CHALLENGE_NAMEFor ML-Master agent, install additional dependencies:
bash setup/setup_mlmaster.shAfter preparing challenges, setup RD-Agent data directory with symlinks to challenge data:
cd rex-mle
python setup_rdagent_data.pyTo run agents, use the scripts in rex-mle/ (e.g., run_aide.sh, run_mlmaster.sh, run_rdagent.sh). Each script supports configurable model variants and time limits. All scripts assume you are already in a GPU-enabled compute environment.
Environment variables (including API credentials) should be set in a .env file in the project root before running agents.
To run your own agent, create a similar folder in rex-mle/agents/ and implement a startup script (e.g., run_agent_*.py). Follow the pattern of existing agents (AIDE, ML-Master, RD-Agent) for consistency with the evaluation framework.
Once an agent completes and generates a submission, you can grade the results.
- Create a JSONL file listing submission paths (see
example_submission.jsonlfor format):
{"submission_dir": "/path/to/submission/directory"}- Grade the submissions:
cd rex-mle
python -m rexmle.cli grade-batch --submission ./your_submission.jsonl --output-dir ./metrics --suffix your_suffixThe grading output will be saved to the specified output directory with evaluation metrics.
ReX-MLE/
├── setup.sh # Main setup script
├── setup/ # Setup scripts for specific components
│ └── setup_mlmaster.sh # ML-Master specific setup
├── rex-mle/ # Core evaluation and challenge management
│ ├── rexmle/ # ReX-MLE package
│ ├── agents/ # Agent implementations
│ │ ├── rdagent/ # RD-Agent configuration
│ │ ├── ml-master/ # ML-Master configuration
│ │ └── ... # Other agents
│ ├── challenges/ # Challenge definitions and data
│ └── example_submission.jsonl
├── strategies/ # Strategy analysis and documentation
└── README.md
After grading submissions, you can score agent logs for the 13 challenge strategies using the /strategies folder. Each agent (AIDE, MLMaster, RDAgent) has its own preprocessing pipeline:
cd strategies/
python analyze_strategies.py --batch-dir <preprocessed-logs>
python aggregate_strategy_scores.py --scores-dir <scores-dir> --output <output>.jsonSee strategies/README.md for detailed instructions for each agent type.
Each agent directory contains its own documentation for specific configuration and usage.
If you use ReX-MLE in your work, please cite:
@article{kenia2025rexmleautonomousagentbenchmark,
title={ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges},
author={Kenia, Roshan and Zhang, Xiaoman and Rajpurkar, Pranav},
journal={arXiv preprint arXiv:2512.17838},
year={2025},
eprint={2512.17838},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.17838}
}- Ensure you have sufficient disk space for challenge data and agent outputs
- Some challenges may require significant computational resources
- Check individual agent directories for specific requirements and troubleshooting