The official codes for "M^3Builder: A Multi-Agent System for Automated Machine Learning in Medical Imaging"
In this paper, we present M^3Builder, an agentic system for automating machine learning in medical imaging tasks. Our approach combines an efficient medical imaging ML workspace with free-text descriptions of datasets, code templates, and interaction tools. Additionally, we propsoe a multi-agent collaborative agent system designed specifically for AI model building, with 4 role-playing LLMs, Task Manager, Data Engineer, Module Architect, and Model Trainer. In benchmarking against 5 SOTA agentic systems across 14 radiology task-specific datasets, M3Builder achieves a 94.29% model building success rate with Claude3.7-Sonnet standing out among 7 SOTA LLMs.
Overview of our system M3Builder. From user’s free-text request to model delivery. The system integrates user requirements, a workspace with candidate data, tools, and code templates. A network of 4 specialized collaborative agents performs task analysis, data engineering, module assembling and training execution. A sample log tracks the Model Trainer agent’s activities during diagnosis model development.
We provide requirements.txt as a reference, the versions of packages are not compulsory.
- Create a folder named
ExternalDatasetlocally. - Put your custom dataset folder into
ExternalDataset. - We suggest you remove training/testing-irrelevant files from your dataset folder to avoid interference!
Your ExternalDataset folder should be like:
ExternalDataset
| Dataset_1
| | images
| | masks
| | labels.csv
| Dataset_2
| | Class1
| | Class2
| | labels.json
| ......
After environment setup and data preparation, you should first check all the files, and replace all 'path/to/sth' into your own paths.
Then, edit the human_requirements parameter in run.sh to your own requirements, and run:
./run.sh
Training logs and checkpoints will be placed under `TrainPipeline/Logout'.
Task Completion Performance Across LLMs. Each experiment undergoes multi-runs, with results shown as successful completions over total rounds (a/b format). Green cells indicate that all runs passed, Yellow indicates partially passed, and Red indicates that all runs failed.
Framework Comparison with SOTAs and Ablations on System Design using Sonnet. Results are averaged over two runs per task in dataset-level. “w/o Colab” represents single-agent execution, and “Iters” means the self-correction rounds.
We sincerely thank all the contributors who developed relevant codes in our repository.
@article{feng2025m3,
title={M^3Builder: A Multi-Agent System for Automated Machine Learning in Medical Imaging},
author={Feng, Jinghao and Zheng, Qiaoyu and Wu, Chaoyi and Zhao, Ziheng and Zhang,
Ya and Wang, Yanfeng and Xie, Weidi},
journal={arXiv preprint arXiv:2502.20301},
year={2025}
}



