MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning
Xukai Wang*, Xuanbo Liu*, Mingrui Chen*, Haitian Zhong*, Xuanlin Yang*, Bohan Zeng, Jinbo Hu, Hao Liang, Junbo Niu, Xuchen Li, Ruitao Wu, Ruichuan An, Yang Shi, Liu Liu, Xu-Yao Zhang, Qiang Liu, Zhouchen Lin, Wentao Zhang, Bin Dong
MorphoBench is an adaptive reasoning benchmark for large-scale models. It curates over 1,300 multidisciplinary questions and dynamically adjusts task difficulty based on model reasoning traces, providing a scalable and reliable framework for evaluating the reasoning performance of advanced models like o3 and GPT-5.
MorphoBench includes 5 datasets with varying difficulty levels:
| Dataset | Description | Questions | Hints |
|---|---|---|---|
Morpho_R_v0 |
Base reasoning questions | 1,307 | None |
Morpho_R_Lite |
Easy mode with helpful hints | 2,614 | β Helpful |
Morpho_R_Complex |
Hard mode with misleading hints | 2,614 | |
Morpho_P_v0 |
Base perception questions | 476 | None |
Morpho_P_Perturbed |
Perturbed perception questions | 476 | None |
The MorphoBench dataset is available on Hugging Face: OpenDCAI/MorphoBench
from datasets import load_dataset
dataset = load_dataset("OpenDCAI/MorphoBench")After downloading, create a data/ folder inside your local project directory and place the datasets there:
MorphoBench/
βββ adaption/
βββ asset/
βββ data/
β βββ Morpho_P_Perturbed/
β βββ Morpho_P_v0/
β βββ Morpho_R_Complex/
β βββ Morpho_R_Lite/
β βββ Morpho_R_v0/
βββ eval_agent/
βββ scripts/
βββ output/
βββ ...
cd Morphobench
pip install -r requirements.txtCreate a .env file in the project root:
# API Configuration
API_KEY=your_openai_api_key
API_BASE=https://api.openai.com/v1
# Model Configuration (optional)
JUDGE_MODEL=o3-mini-2025-01-31
BREAKDOWN_MODEL=o3-mini-2025-01-31
CHECK_MODEL=o3-mini-2025-01-31
SUMMARY_MODEL=o3-mini-2025-01-31
HINT_MODEL=o3-mini-2025-01-31
# Concurrency (optional)
EVAL_NUM_WORKERS=50
EVAL_MAX_TOKENS=4096Generate model predictions for all datasets:
bash scripts/run_batch.shPredictions will be saved under:
output/infer_result/
bash scripts/evaluate_batch.shThe eval_agent module provides comprehensive evaluation including:
- Correctness Evaluation: Judges answer correctness using LLM
- Reasoning Quality Evaluation: Analyzes reasoning completeness and logical coherence
- Hint Follow Evaluation: Assesses how models follow/deviate from hints (R_Lite & R_Complex only)
python -m eval_agent.runner \
--dataset ./data/Morpho_R_v0 \
--predictions ./output/infer_result/Morpho_R_v0_o3.json \
--difficulty v0 \
--model_name Morpho_R_v0_o3bash eval_agent/run_eval.shoutput/
βββ eval_agent_result/ # Evaluation results (JSON + TXT)
βββ eval_agent_traces/ # Detailed reasoning traces
β βββ reasoning_quality/ # Step-by-step reasoning analysis
β β βββ {dataset}/{model}/
β βββ hint_follow/ # Hint alignment analysis
β βββ {dataset}/{model}/
βββ metrics_summary/ # Aggregated metrics (CSV)
βββ 1_accuracy.csv
βββ 2_completeness.csv
βββ 3_logical_coherence.csv
βββ 4_hint_alignment_score.csv
βββ 5_hint_justified_deviation_rate.csv
- Accuracy: Percentage of correct answers
- Calibration Error: Confidence calibration measurement
- Response Length: Average response token count
- Completeness (0-100): Whether reasoning covers all necessary steps
- Logical Coherence (0-100): Whether reasoning steps follow logically
- Alignment Score (0-100): How well reasoning aligns with provided hints
- Justified Deviation Rate: Percentage of deviations with valid justification
The following figure summarizes the evaluation results on MorphoBench
MorphoBench/
βββ adaption/ # Adaptive reasoning scripts
β βββ Agent_reasoning.py
β βββ Agent_recognition.py
βββ asset/ # Images and assets
βββ data/ # Datasets (download from HuggingFace)
βββ eval_agent/ # Evaluation agent module
β βββ __init__.py
β βββ config.py # Configuration
β βββ runner.py # Main entry point
β βββ run_eval.sh # Batch evaluation script
β βββ evaluators/ # Evaluation implementations
β β βββ correctness.py
β β βββ reasoning_quality.py
β β βββ hint_follow.py
β βββ tools/ # LLM-based evaluation tools
β βββ base_tool.py
β βββ reasoning_breakdown.py
β βββ step_check.py
β βββ hint_check.py
βββ scripts/ # Inference and evaluation scripts
β βββ run_batch.sh
β βββ run_model_predictions.py
β βββ evaluate_batch.sh
β βββ evaluate_judge.py
βββ output/ # Generated outputs
βββ requirements.txt
βββ LICENSE
βββ README.md
This repository adapts evaluation script from Humanity's Last Exam. We sincerely thank the authors for their valuable contributions to the research community.
If you find MorphoBench useful for your research, please cite our paper:
@misc{wang2025morphobenchbenchmarkdifficultyadaptive,
title={MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning},
author={Xukai Wang and Xuanbo Liu and Mingrui Chen and Haitian Zhong and Xuanlin Yang and Bohan Zeng and Jinbo Hu and Hao Liang and Junbo Niu and Xuchen Li and Ruitao Wu and Ruichuan An and Yang Shi and Liu Liu and Xu-Yao Zhang and Qiang Liu and Zhouchen Lin and Wentao Zhang and Bin Dong},
year={2025},
eprint={2510.14265},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.14265},
}
