Skip to content

OpenDCAI/MorphoBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

πŸ€— Dataset (Hugging Face) πŸ“‘ Paper (arXiv:2510.14265)

MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

Xukai Wang*, Xuanbo Liu*, Mingrui Chen*, Haitian Zhong*, Xuanlin Yang*, Bohan Zeng, Jinbo Hu, Hao Liang, Junbo Niu, Xuchen Li, Ruitao Wu, Ruichuan An, Yang Shi, Liu Liu, Xu-Yao Zhang, Qiang Liu, Zhouchen Lin, Wentao Zhang, Bin Dong

πŸ“£ Overview

MorphoBench Overview

MorphoBench is an adaptive reasoning benchmark for large-scale models. It curates over 1,300 multidisciplinary questions and dynamically adjusts task difficulty based on model reasoning traces, providing a scalable and reliable framework for evaluating the reasoning performance of advanced models like o3 and GPT-5.

πŸ“Š Datasets

MorphoBench includes 5 datasets with varying difficulty levels:

Dataset Description Questions Hints
Morpho_R_v0 Base reasoning questions 1,307 None
Morpho_R_Lite Easy mode with helpful hints 2,614 βœ… Helpful
Morpho_R_Complex Hard mode with misleading hints 2,614 ⚠️ Misleading
Morpho_P_v0 Base perception questions 476 None
Morpho_P_Perturbed Perturbed perception questions 476 None

πŸŽ“ Dataset

The MorphoBench dataset is available on Hugging Face: OpenDCAI/MorphoBench

from datasets import load_dataset
dataset = load_dataset("OpenDCAI/MorphoBench")

After downloading, create a data/ folder inside your local project directory and place the datasets there:

MorphoBench/
β”œβ”€β”€ adaption/
β”œβ”€β”€ asset/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ Morpho_P_Perturbed/
β”‚   β”œβ”€β”€ Morpho_P_v0/
β”‚   β”œβ”€β”€ Morpho_R_Complex/
β”‚   β”œβ”€β”€ Morpho_R_Lite/
β”‚   └── Morpho_R_v0/
β”œβ”€β”€ eval_agent/
β”œβ”€β”€ scripts/
β”œβ”€β”€ output/
└── ...

βš™οΈ Usage

Environment Setup

cd Morphobench
pip install -r requirements.txt

Configuration

Create a .env file in the project root:

# API Configuration
API_KEY=your_openai_api_key
API_BASE=https://api.openai.com/v1

# Model Configuration (optional)
JUDGE_MODEL=o3-mini-2025-01-31
BREAKDOWN_MODEL=o3-mini-2025-01-31
CHECK_MODEL=o3-mini-2025-01-31
SUMMARY_MODEL=o3-mini-2025-01-31
HINT_MODEL=o3-mini-2025-01-31

# Concurrency (optional)
EVAL_NUM_WORKERS=50
EVAL_MAX_TOKENS=4096

Run Inference

Generate model predictions for all datasets:

bash scripts/run_batch.sh

Predictions will be saved under:

output/infer_result/

Evaluate Model Results

Basic Evaluation

bash scripts/evaluate_batch.sh

Advanced Evaluation with Eval Agent

The eval_agent module provides comprehensive evaluation including:

  1. Correctness Evaluation: Judges answer correctness using LLM
  2. Reasoning Quality Evaluation: Analyzes reasoning completeness and logical coherence
  3. Hint Follow Evaluation: Assesses how models follow/deviate from hints (R_Lite & R_Complex only)

Run Single Evaluation

python -m eval_agent.runner \
    --dataset ./data/Morpho_R_v0 \
    --predictions ./output/infer_result/Morpho_R_v0_o3.json \
    --difficulty v0 \
    --model_name Morpho_R_v0_o3

Run Batch Evaluation

bash eval_agent/run_eval.sh

Evaluation Outputs

output/
β”œβ”€β”€ eval_agent_result/          # Evaluation results (JSON + TXT)
β”œβ”€β”€ eval_agent_traces/          # Detailed reasoning traces
β”‚   β”œβ”€β”€ reasoning_quality/      # Step-by-step reasoning analysis
β”‚   β”‚   └── {dataset}/{model}/
β”‚   └── hint_follow/            # Hint alignment analysis
β”‚       └── {dataset}/{model}/
└── metrics_summary/            # Aggregated metrics (CSV)
    β”œβ”€β”€ 1_accuracy.csv
    β”œβ”€β”€ 2_completeness.csv
    β”œβ”€β”€ 3_logical_coherence.csv
    β”œβ”€β”€ 4_hint_alignment_score.csv
    └── 5_hint_justified_deviation_rate.csv

πŸ“ˆ Metrics

Correctness Metrics

  • Accuracy: Percentage of correct answers
  • Calibration Error: Confidence calibration measurement
  • Response Length: Average response token count

Reasoning Quality Metrics

  • Completeness (0-100): Whether reasoning covers all necessary steps
  • Logical Coherence (0-100): Whether reasoning steps follow logically

Hint Follow Metrics (R_Lite & R_Complex only)

  • Alignment Score (0-100): How well reasoning aligns with provided hints
  • Justified Deviation Rate: Percentage of deviations with valid justification

πŸ“Š Evaluation Results

The following figure summarizes the evaluation results on MorphoBench

MorphoBench Evaluation Results

πŸ“ Project Structure

MorphoBench/
β”œβ”€β”€ adaption/                   # Adaptive reasoning scripts
β”‚   β”œβ”€β”€ Agent_reasoning.py
β”‚   └── Agent_recognition.py
β”œβ”€β”€ asset/                      # Images and assets
β”œβ”€β”€ data/                       # Datasets (download from HuggingFace)
β”œβ”€β”€ eval_agent/                 # Evaluation agent module
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py              # Configuration
β”‚   β”œβ”€β”€ runner.py              # Main entry point
β”‚   β”œβ”€β”€ run_eval.sh            # Batch evaluation script
β”‚   β”œβ”€β”€ evaluators/            # Evaluation implementations
β”‚   β”‚   β”œβ”€β”€ correctness.py
β”‚   β”‚   β”œβ”€β”€ reasoning_quality.py
β”‚   β”‚   └── hint_follow.py
β”‚   └── tools/                 # LLM-based evaluation tools
β”‚       β”œβ”€β”€ base_tool.py
β”‚       β”œβ”€β”€ reasoning_breakdown.py
β”‚       β”œβ”€β”€ step_check.py
β”‚       └── hint_check.py
β”œβ”€β”€ scripts/                   # Inference and evaluation scripts
β”‚   β”œβ”€β”€ run_batch.sh
β”‚   β”œβ”€β”€ run_model_predictions.py
β”‚   β”œβ”€β”€ evaluate_batch.sh
β”‚   └── evaluate_judge.py
β”œβ”€β”€ output/                    # Generated outputs
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ LICENSE
└── README.md

πŸ™ Acknowledgements

This repository adapts evaluation script from Humanity's Last Exam. We sincerely thank the authors for their valuable contributions to the research community.

πŸ“– Citation

If you find MorphoBench useful for your research, please cite our paper:

@misc{wang2025morphobenchbenchmarkdifficultyadaptive,
      title={MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning}, 
      author={Xukai Wang and Xuanbo Liu and Mingrui Chen and Haitian Zhong and Xuanlin Yang and Bohan Zeng and Jinbo Hu and Hao Liang and Junbo Niu and Xuchen Li and Ruitao Wu and Ruichuan An and Yang Shi and Liu Liu and Xu-Yao Zhang and Qiang Liu and Zhouchen Lin and Wentao Zhang and Bin Dong},
      year={2025},
      eprint={2510.14265},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.14265}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors