- Overview
- Environment Setup
- Dataset Preparation
- Supported Models
- Inference
- Results Snapshot
- Acknowledgement
- Contact Info
- Citation
We introduce an Executor-Analyst architecture that couples TxAgent (Executor) with Gemini-2.5/3 series models (Analyst) to mitigate the “context utilization failure” described in the tech report. TxAgent focuses on precise tool calls inside ToolUniverse v1.0 (211 curated biomedical APIs), while Gemini performs long-context synthesis, optional Google search, and deterministic post-processing. Our stratified late-fusion ensemble delivers competitive performance without additional training, ultimately taking 2nd place in Track 2.
While most experiments run on AMD MI300X, only light adjustments are needed for NVIDIA H100/V100.
# prep:clone the repo
git clone git@github.com:June01/CureAgent.git# Host (Phase 1)
docker pull rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513
docker run -it --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri -p 8890:8890 -v /root/code:/app/code aa1e9eebfc30 bash
# Container (Phase 2)
git clone git@github.com:mims-harvard/TxAgent.git
# if amd, please mv ./txagent/* TxAgent/src/txagent/
cd TxAgent
pip install -e .
cd CureAgent
pip install -e .
# IMPORTANT: Use exactly tooluniverse==0.2.0 (with 211 tools) or below to avoid performance drop when tools are increased to 600+
pip install tooluniverse==0.2.0Run the OpenAI GPT-OSS checkpoints through vLLM serving. The environment setup is following link.
# Bring up vLLM server (example port 8001)
python run_gpt_oss_vllm.py --config configs/metadata_config_gpt_oss_120b.json
# Sanity check the endpoint
curl http://localhost:8001/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"openai/gpt-oss-120b","prompt":"The future of AI is","max_tokens":100,"temperature":0}'Download validation/test JSONL from Kaggle: https://www.kaggle.com/competitions/cure-bench
Example config snippets:
{
"dataset": {
"dataset_name": "cure_bench_phase_1",
"dataset_path": "/abs/path/curebench_valset_phase1.jsonl",
"description": "CURE-Bench 2025 val questions"
}
}Replicate for curebench_testset_phase2.jsonl when generating final submissions. All configs live in configs/metadata_config_*.json. Use absolute paths or paths relative to repo root.
| Model | Type | Tools | Config |
|---|---|---|---|
| TxAgent (Llama-3.1-8B) | Agentic | ✅ | metadata_config_testset_phase2_txagent.json |
| Llama-3.1-8B / 70B | Internal | ❌ | metadata_config_testset_phase2_llama3_8b.json |
| Llama3-Med42-8B | Internal | ❌ | metadata_config_testset_phase2_llama3_med42_8b.json |
| Qwen3-8B | Internal | ❌ | metadata_config_testset_phase2_qwen3_8b.json |
| Qwen3-32B-Medical | Internal | ❌ | mmetadata_config_testset_phase2_qwen3_32b_medical.json |
| Baichuan-M2-32B | Internal | ❌ | metadata_config_testset_phase2_baichuan_m2_32b.json |
| MedGemma-4B / 27B | Internal | ❌ | metadata_config_testset_phase2_medgemma_4b_it.json |
| GPT-OSS 20B / 120B | Internal | ❌ | metadata_config_testset_phase2_gpt_oss_20b.json |
| Model | Type | Search |
|---|---|---|
| gemini-2.5-flash | API | ✅ |
| gemini-2.5-pro | API | ✅ |
| gemini-3-pro-preview* | API | ✅ |
*Post-competition benchmark used to validate scalability (see Tech report Table 2).
TxAgent + other open-source configs
export CONFIG_DIR=configs
python run.py --config "${CONFIG_DIR}/metadata_config_testset_phase2_txagent.json.json"
python run.py --config "${CONFIG_DIR}/metadata_config_testset_phase2_llama3_8b.json"
# ...repeat for any config listed in configs/All runs write submission.csv + metadata JSON under results/competition_testset_<phasex>_<model_name>/.
GPT-OSS via vLLM
# Curl sanity check (replace port/model as needed)
curl http://localhost:8001/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"openai/gpt-oss-20b","prompt":"The future of AI is","max_tokens":100,"temperature":0}'
# Batch evaluation (vLLM serving)
python run_gpt_oss_vllm.py --config configs/metadata_config_gpt_oss_20b.json --verboseGemini-only API
- Export
GEMINI_API_KEY(and setmodelmodel_namegoogle_search_enablednum_workersetc ingemini_with_search_config.jsonif needed). - Example config:
{
"dataset_path": "dataset/curebench_testset_phase2.jsonl",
"num_workers": 8,
"full_evaluation": true,
"google_search": true,
"model_name": "gemini-2.5-flash",
"output_dir": "gemini/testset_gemini_2.5_flash_with_search_phase2_results"
}- Launch inference:
cd gemini
python run_testset_with_search.pyThe script spawns per-worker Gemini clients, streams progress, and writes testset_submission_<timestamp>.csv + zipped metadata inside the configured output_dir in gemini_with_search_config.json.
Our winning submission uses a stratified late-fusion ensemble:
- Executor self-consistency – Run TxAgent with temperature
T=0.8and sample budgetn=10(orn=10×3for the ensemble) to harvest diverse tool-use transcripts. See Tech report Tables 1–2 for scaling gains. - Evidence aggregation – Each TxAgent subgroup writes its own CSV in
results/. Retain the top-k tool calls per subgroup rather than pooling globally to preserve rare but critical evidence. - Analyst reasoning & post-processing (Gemini) – Use
gemini/run_final_step_with_gemini.pyto rerank each subgroup’ssubmission.csv:The script removes invalidcd gemini python run_final_step_with_gemini.py \ /root/code/CureAgent/results/<txagent_run>/submission.csv \ --model_name gemini-2.5-flash \ --enable_search \ --api_key ${GEMINI_API_KEY} \ --output stratified_group1.csv
toolmessages, truncates long traces, retries with Gemini, extracts[FinalAnswer]tags (or falls back to A/B/C/D heuristics), and supports multiprocessing via--num_workers. - Late fusion – Majority vote across Analyst outputs (choices + rationales). Combine
stratified_group*.csv, and package the final Kaggle ZIP.
This topology prevents the early information bottleneck noted in Tech Report (§Methods, Fig. 3) and underpins our Track 2 results.
- Open-source: TxAgent self-consistency (n=30) reaches 73.5% on phase2; other OSS models lag by 15–40 pts without fine-tuned tool use (see Tech report, Table 1).
- Closed-source: Gemini-2.5-Pro + search hits 74.8% standalone; Gemini-3-Pro with search climbs to 81.3%.
- Executor-Analyst (ours): TxAgent (10×3) + Gemini-2.5 Flash (search)x3 + stratified late fusion delivers 83.8% phase2 accuracy, securing 2nd place Track 2.
- AMD for providing GPU resources, and https://www.synapnote.ai/
- CUREBench — parts of the code are derived from their baseline
Please cite the upcoming tech report once the arXiv link is live:
@article{xie2025cureagent,
title={CureAgent: A Training-Free Executor-Analyst Framework for Clinical Reasoning},
author={Xie, Ting-Ting and Zhang, Yixin},
journal={arXiv preprint arXiv:2512.05576},
year={2025}
}
