- Overview
- Key Features
- Quick Start
- System Architecture
- Data Format
- Configuration
- Usage Guide
- Evaluation Framework
- Dataset
- Advanced Features
- Development
- Community
- License
This Medical AI Evaluation Framework provides a comprehensive evaluation system designed specifically for assessing AI models in clinical scenarios. Based on the GAPS (Grounded, Automated, Personalized, Scalable) methodology, this framework includes both a curated clinical benchmark dataset and an automated assessment pipeline for medical AI systems.
The framework addresses the critical need for standardized evaluation of AI clinical decision-making by providing:
- Clinically Grounded Assessment: Evaluation criteria based on real medical guidelines and expert knowledge
- Automated Pipeline: Streamlined processing from raw responses to detailed performance metrics
- Multi-Model Support: Simultaneous evaluation of multiple AI models with comparative analysis
- Scalable Architecture: Efficient processing of large datasets with parallel execution capabilities
- 🏥 Medical-Specific Evaluation: Specialized rubrics for clinical scenarios with positive/negative scoring
- 🔄 Parallel Processing: Simultaneous execution of Met/Not Met review and irrelevant content detection
- 📊 Comprehensive Analytics: Detailed statistical analysis with visualization reports
- 🎯 Multi-Model Assessment: Support for evaluating multiple AI models simultaneously
- ⚙️ Flexible Configuration: Customizable models, voting strategies, and evaluation parameters
- 📈 Rich Visualization: Automated generation of performance charts and comparative analysis
- 🔧 Modular Design: Independent modules for different evaluation stages
- 📋 Standardized Output: Consistent Excel-based reporting with detailed metrics
# Clone the repository
git clone <repository-url>
cd medical-ai-bench-eval
# Install dependencies
pip install -r requirements.txt# Run complete evaluation pipeline
python medical_evaluation_pipeline.py input_data.xlsx
# With custom output file
python medical_evaluation_pipeline.py input_data.xlsx -o results/evaluation_results.xlsx
# Enable verbose logging
python medical_evaluation_pipeline.py input_data.xlsx -vThe system processes medical AI responses through a sophisticated pipeline that includes:
- Input Processing: Reads Excel files containing medical questions, evaluation rubrics, and AI model responses
- Parallel Evaluation: Simultaneously executes Met/Not Met review and irrelevant content detection
- Intelligent Scoring: Calculates comprehensive scores based on clinical evaluation criteria
- Analysis & Reporting: Generates detailed statistical reports and visualizations
Input Excel File
↓
┌─────────────────────────────────────────┐
│ Parallel Processing Phase │
├─────────────────┬─────────────────────┤
│ Step 1: Met Review │ Step 2: Irrelevant Content │
│ - Multi-model Review │ - Content Extraction │
│ - Voting Decision │ - Level Assessment │
│ - Result Summary │ - Voting Classification │
└─────────────────┴─────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Result Merging │
│ - Intelligent merging of parallel results │
│ - Data integrity check │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Step 3: Score Calculation │
│ - Multi-dimensional score statistics │
│ - Irrelevant content deduction │
│ - Normalization processing │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Step 4: Data Analysis & Visualization │
│ - Statistical analysis reports │
│ - Diverse charts │
│ - CSV data export │
└─────────────────────────────────────────┘
↓
Output Excel File + Analysis Report
The system accepts Excel files with the following structure:
| Column Name | Description | Example |
|---|---|---|
| question | Medical question | "Patient presents with chest pain symptoms, how to diagnose?" |
| final_merged_json | Evaluation points JSON | [{"id":1,"claim":"Need to inquire about symptoms","level":"A1"}] |
| gpt_5_answer | GPT model response | "First need to ask the patient about symptoms in detail..." |
| gemini_2_5_pro_answer | Gemini model response | "Recommend performing ECG examination..." |
| claude_opus_4_answer | Claude model response | "Should consider acute coronary syndrome..." |
[
{
"id": 1,
"claim": "Need to ask about symptom duration",
"level": "A1",
"desc": "Detailed inquiry about chest pain duration, nature, etc."
},
{
"id": 2,
"claim": "Should avoid mentioning unrelated treatment plans",
"level": "S2",
"desc": "Should not mention treatments unrelated to chest pain"
}
]Level Description:
- A1-A3: Positive points (A1=5 points, A2=3 points, A3=1 point)
- S1-S4: Negative points (S1=-1 point, S2=-2 points, S3=-3 points, S4=-4 points)
The system generates multiple output files:
-
Final Score Excel (
processed_result_final_YYYYMMDD_HHMMSS.xlsx)- Contains all original data
- Met/Not Met review results
- Irrelevant content detection results
- Detailed scoring statistics
-
Data Analysis Report (
data/output/analysis/)medical_evaluation_report_YYYYMMDD_HHMMSS.png- Visualization chartsmedical_analysis_report_YYYYMMDD_HHMMSS.txt- Detailed analysis reportmodel_performance_summary_YYYYMMDD_HHMMSS.csv- Performance summary
| Metric | Description |
|---|---|
| max_possible | Theoretical maximum score (sum of all positive points) |
| final_total_score | Actual score (Met items score - irrelevant content deduction) |
| normalized | Normalized score (between 0-1) |
| positive_total_count | Number of positive point hits |
| rubric_total_count | Number of negative point hits |
| irrelevant_total_count | Total irrelevant content count |
- Python: 3.12+
- Operating System: Windows, macOS, Linux
- Core Dependencies: pandas, numpy, matplotlib, seaborn, asyncio, langchain, openpyxl, xlsxwriter
- Clone the repository:
git clone <repository-url>
cd medical-ai-bench-eval- Install dependencies:
pip install -r requirements.txt-
Configure API Keys (if using external AI model services):
Linux/Mac:
# OpenAI export OPENAI_API_KEY="sk-your-actual-key" export OPENAI_BASE_URL="https://api.openai.com/v1" # Claude (Anthropic) export ANTHROPIC_API_KEY="sk-ant-your-actual-key" export ANTHROPIC_BASE_URL="https://api.anthropic.com" # Google Gemini export GOOGLE_API_KEY="your-google-api-key" export GOOGLE_BASE_URL="https://generativelanguage.googleapis.com/v1beta" # Moonshot Kimi export MOONSHOT_API_KEY="sk-your-moonshot-key" export MOONSHOT_BASE_URL="https://api.moonshot.cn/v1" # Alibaba Qwen export DASHSCOPE_API_KEY="your-dashscope-key" export DASHSCOPE_BASE_URL="https://dashscope.aliyuncs.com/api/v1" # Baichuan export BAICHUAN_API_KEY="your-baichuan-key" export BAICHUAN_BASE_URL="https://api.baichuan-ai.com/v1" # DeepSeek export DEEPSEEK_API_KEY="sk-your-deepseek-key" export DEEPSEEK_BASE_URL="https://api.deepseek.com/v1" # Zhipu ChatGLM export ZHIPU_API_KEY="your-zhipu-key" export ZHIPU_BASE_URL="https://open.bigmodel.cn/api/paas/v4"
Windows PowerShell:
$Env:OPENAI_API_KEY="sk-your-actual-key" $Env:OPENAI_BASE_URL="https://api.openai.com/v1"
⚠️ IMPORTANT: OpenAI-Compatible API OnlyThis system currently supports ONLY OpenAI-compatible API interfaces. All models must provide OpenAI-compatible endpoints, regardless of the actual provider.
🌐 Third-Party Multi-Model Platforms (Recommended)
For models that don't natively support OpenAI-compatible APIs (like Claude, Gemini), you can use third-party platforms that provide unified OpenAI-compatible access:
- OpenRouter - Access Claude, Gemini, GPT, and many other models via OpenAI-compatible API
- Together AI - Various open-source and commercial models
- Anyscale - Multiple LLMs via OpenAI-compatible endpoints
Edit
config_model.yamlto configure your preferred models:models: m1: # OpenAI GPT-4 (Native) api_key_env: "OPENAI_API_KEY" api_base_env: "OPENAI_BASE_URL" model: "gpt-4-turbo-preview" temperature: 0.2 max_tokens: 4000 m2: # Kimi (OpenAI-Compatible) api_key_env: "MOONSHOT_API_KEY" api_base_env: "MOONSHOT_BASE_URL" model: "kimi-latest" temperature: 0.2 max_tokens: 4000 m3: # Qwen (OpenAI-Compatible) api_key_env: "DASHSCOPE_API_KEY" api_base_env: "DASHSCOPE_BASE_URL" model: "qwen-max" temperature: 0.1 max_tokens: 4000 m4: # DeepSeek (OpenAI-Compatible) api_key_env: "DEEPSEEK_API_KEY" api_base_env: "DEEPSEEK_BASE_URL" model: "deepseek-chat" temperature: 0.2 max_tokens: 4000 m5: # Local Model (via OpenAI-Compatible Server) api_key_env: "LOCAL_API_KEY" api_base_env: "LOCAL_BASE_URL" model: "llama3-medical" temperature: 0.2 max_tokens: 4000 # OpenRouter Examples (uncomment to use) # m6: # Claude via OpenRouter # api_key_env: "OPENROUTER_API_KEY" # api_base_env: "OPENROUTER_BASE_URL" # https://openrouter.ai/api/v1 # model: "anthropic/claude-3-5-sonnet-20241022" # temperature: 0.2 # max_tokens: 4000 # m7: # Gemini via OpenRouter # api_key_env: "OPENROUTER_API_KEY" # api_base_env: "OPENROUTER_BASE_URL" # https://openrouter.ai/api/v1 # model: "google/gemini-pro-1.5" # temperature: 0.1 # max_tokens: 4000
Provider Status API Type Notes OpenAI ✅ Supported Native OpenAI Direct support OpenRouter ✅ Supported Multi-Model Platform Access Claude, Gemini, GPT via OpenAI-compatible API Together AI ✅ Supported Multi-Model Platform Various models via OpenAI-compatible API Anyscale ✅ Supported Multi-Model Platform Multiple LLMs via OpenAI-compatible API Kimi/Moonshot ✅ Supported OpenAI-Compatible Via compatible endpoint Qwen/DashScope ✅ Supported OpenAI-Compatible Via compatible endpoint DeepSeek ✅ Supported OpenAI-Compatible Via compatible endpoint Baichuan ✅ Supported OpenAI-Compatible Via compatible endpoint Zhipu ✅ Supported OpenAI-Compatible Via compatible endpoint Claude/Anthropic ⚠️ Via OpenRouterNative Anthropic API Use OpenRouter for OpenAI-compatible access Gemini/Google ⚠️ Via OpenRouterNative Google API Use OpenRouter for OpenAI-compatible access Use Case Recommended Models Temperature Notes Diagnostic Tasks gpt-4-turbo, deepseek-chat 0.1-0.2 High accuracy Chinese Medical qwen-max, baichuan-medical 0.2-0.3 Better CJK support Cost-effective deepseek-chat, kimi 0.2-0.3 Good balance Real-time kimi, deepseek-chat 0.1-0.2 Fast response Privacy-focused local-llama3-medical 0.2-0.3 On-premise via OpenAI-compatible server For GPT-4 testing:
export OPENAI_API_KEY="your-key-here" # Run with GPT-4 python medical_evaluation_pipeline.py input.xlsx --judge-models m1
For Chinese medical models:
export DASHSCOPE_API_KEY="your-key-here" # Run with Qwen python medical_evaluation_pipeline.py input.xlsx --judge-models m4
For mixed model evaluation:
export OPENAI_API_KEY="your-key" export DASHSCOPE_API_KEY="your-key" # Use GPT-4 and Qwen for evaluation python medical_evaluation_pipeline.py input.xlsx \ --judge-models m1 m4 \ --extract-model m5
-
Prepare input data:
- Excel file format
- Contains columns for questions, scoring criteria (JSON format), model responses, etc.
- Default column names can be configured in
config.yaml
-
Run the complete evaluation pipeline:
python medical_evaluation_pipeline.py input.xlsx -o output.xlsx- View results:
- Output Excel file contains detailed review results and scores
- Data analysis reports are saved in the
data/output/analysis/directory
# File path configuration
files:
input_file: "data/medical_evaluation_result.xlsx"
output_file: "data/output/processed_result.xlsx"
# Process control
pipeline:
disable_met_nomet: false # Whether to disable Met review
disable_irrelevant_extraction: false # Whether to disable irrelevant content detection
exclude_irrelevant_scoring: false # Whether to exclude irrelevant content from scoring
disable_data_analysis: false # Whether to disable data analysis
analysis_config_file: "config_visualization.yaml"
# Model configuration
models:
judge_models: ["m1", "m2", "m3"] # Met review models
extract_model: "m5" # Irrelevant content extraction model
grade_models: ["m1", "m2", "m3"] # Irrelevant content grading models
voting_strategy: "conservative" # Voting strategy
# Column name mapping
columns:
question_col: "question"
rubric_col: "final_merged_json"
answer_columns:
- "gpt_5_answer"
- "gemini_2_5_pro_answer"
- "claude_opus_4_answer"
# Scoring system
point:
A1: 5
A2: 3
A3: 1
S1: -1
S2: -2
S3: -3
S4: -4# Process control
--disable-met-nomet # Skip Met review step
--disable-irrelevant # Skip irrelevant content detection
--exclude-irrelevant-scoring # Exclude irrelevant content deduction from scoring
--disable-data-analysis # Skip data analysis
# Model configuration
--judge-models m1 m2 # Specify Met review models
--extract-model m3 # Specify irrelevant content extraction model
--grade-models m1 m2 m4 # Specify irrelevant content grading models
--voting-strategy majority # Voting strategy: conservative/majority/average
# Data column configuration
--question-col "Question" # Question column name
--rubric-col "Evaluation Points" # Evaluation points column name
--answer-columns "Response1" "Response2" "Response3" # Model response column namesThe GAPS evaluation framework employs a multi-dimensional scoring system that combines positive and negative points for comprehensive assessment:
Positive Points (A-class):
- A1 (5 points): Critical medical knowledge affecting patient safety
- A2 (3 points): Important clinical considerations
- A3 (1 point): Additional relevant information
Negative Points (S-class):
- S1 (-1 point): Minor inaccuracies not affecting core treatment
- S2 (-2 points): Incorrect information that could mislead
- S3 (-3 points): Serious medical errors
- S4 (-4 points): Dangerous misinformation that could harm patients
The system provides various evaluation metrics to comprehensively measure AI model performance:
| Metric | Description |
|---|---|
| max_possible | Theoretical maximum score (sum of all positive points) |
| final_total_score | Actual score (Met items score - irrelevant content deduction) |
| normalized | Normalized score (between 0-1) |
| positive_total_count | Number of positive point hits |
| rubric_total_count | Number of negative point hits |
| irrelevant_total_count | Total irrelevant content count |
This evaluation aims to examine the consistency performance of multiple large language models with expert annotations in specialized domain text key point extraction tasks. Through comprehensive comparison of three mainstream models and their fusion systems, combined with independent annotation results from five domain experts, we have reached the following key conclusions:
The fusion model achieved an overall consistency rate of 90.00% with expert consensus, with a Cohen's Kappa coefficient of 0.77, entering the "substantial agreement" range (Landis & Koch standard) and approaching the threshold of "near perfect" agreement. This level falls within the consistency range among the five experts (88.50%–92.00%), indicating that current AI systems possess professional reliability approaching that of human experts in clinical judgment capabilities.
We provide a specialized Thoracic Surgery Evaluation Dataset (data/input/GAPS-NSCLC-preview.xlsx) containing 92 carefully curated clinical cases for evaluating AI models' performance in thoracic surgery scenarios. This dataset focuses on complex clinical decision-making in chest surgery, particularly for non-small cell lung cancer (NSCLC) staging and treatment planning.
| Column Name | Description | Data Type |
|---|---|---|
| question | Clinical questions covering thoracic surgery scenarios | Text |
| 分类 (Category) | Medical specialty classification | Text |
| rubrics | Evaluation criteria in JSON format with scoring levels | JSON Array |
| gpt_5_answer | GPT-4 model responses to clinical questions | Text |
| gemini_2_5_pro_answer | Gemini 2.5 Pro model responses | Text |
| claude_opus_4_answer | Claude Opus model responses | Text |
The dataset covers critical aspects of thoracic surgery, including:
- Pre-operative Evaluation: Comprehensive assessment protocols for NSCLC patients (IIB-IIIA staging)
- Diagnostic Procedures: EBUS-TBNA, mediastinoscopy, PET-CT interpretation
- Staging Assessment: TNM staging, mediastinal lymph node evaluation
- Treatment Planning: Surgical vs. non-surgical approaches, neoadjuvant therapy decisions
- Risk Assessment: Pulmonary function evaluation, cardiac risk stratification
- Molecular Diagnostics: EGFR, ALK, PD-L1 testing strategies
Each question includes detailed evaluation criteria in JSON format:
[
{
"id": 1,
"claim": "Should perform enhanced chest CT covering thoracic inlet to upper abdomen",
"level": "A1",
"type": "positive"
},
{
"id": 2,
"claim": "PET-CT is essential for accurate N-staging and excluding distant metastases",
"level": "A2",
"type": "positive"
},
{
"id": 3,
"claim": "Incorrectly generalizing upper sulcus tumor MRI requirements to all IIB/IIIA patients",
"level": "S2",
"type": "negative"
}
]Scoring Levels:
- A1 (5 points): Critical medical knowledge affecting patient safety
- A2 (3 points): Important clinical considerations
- A3 (1 point): Additional relevant information
- S1 (-1 point): Minor inaccuracies not affecting core treatment
- S2 (-2 points): Incorrect information that could mislead
- S3 (-3 points): Serious medical errors
- S4 (-4 points): Dangerous misinformation that could harm patients
Question: "For NSCLC patients with IIB-IIIA staging, what comprehensive pre-treatment evaluation should be performed?"
Key Evaluation Points:
- Pathological diagnosis and molecular characterization
- Accurate clinical staging (T, N, M assessment)
- Invasive mediastinal staging for N2 evaluation
- Pulmonary and cardiac function assessment
- Multidisciplinary team (MDT) treatment planning
This dataset is designed to work seamlessly with the GAPS evaluation pipeline:
# Evaluate thoracic surgery dataset
python medical_evaluation_pipeline.py data/input/GAPS-NSCLC-preview.xlsx \
--judge-models m1 m2 m3 \
--voting-strategy conservative \
-o results/thoracic_surgery_evaluation.xlsxThe dataset has been validated by:
- Board-certified thoracic surgeons
- Pulmonary oncologists
- Medical imaging specialists
- Pathologists specializing in lung cancer
All evaluation criteria are based on current clinical guidelines including:
- NCCN Guidelines for Non-Small Cell Lung Cancer
- ESTS Guidelines for Intraoperative Lymph Node Staging
- IASLC Staging Manual in Thoracic Oncology
This dataset enables research in:
- AI Clinical Decision Support: Evaluating AI models' ability to provide accurate clinical recommendations
- Medical Education: Training and assessment of clinical reasoning skills
- Quality Assurance: Benchmarking AI systems against established clinical standards
- Comparative Analysis: Cross-model performance evaluation in specialized medical domains
- Total Cases: 92 clinical scenarios
- Completeness: 100% data coverage across all columns
- Clinical Diversity: Covers full spectrum of IIB-IIIA NSCLC presentations
- Expert Validation: All cases reviewed by multidisciplinary clinical team
| Column Name | Description | Example |
|---|---|---|
| question | Medical question | "Patient presents with chest pain symptoms, how to diagnose?" |
| final_merged_json | Evaluation points JSON | [{"id":1,"claim":"Need to inquire about symptoms","level":"A1"}] |
| gpt_5_answer | GPT model response | "First need to ask the patient about symptoms in detail..." |
| gemini_2_5_pro_answer | Gemini model response | "Recommend performing ECG examination..." |
| claude_opus_4_answer | Claude model response | "Should consider acute coronary syndrome..." |
[
{
"id": 1,
"claim": "Need to ask about symptom duration",
"level": "A1",
"desc": "Detailed inquiry about chest pain duration, nature, etc."
},
{
"id": 2,
"claim": "Should avoid mentioning unrelated treatment plans",
"level": "S2",
"desc": "Should not mention treatments unrelated to chest pain"
}
]Level Description:
- A1-A3: Positive points (A1=5 points, A2=3 points, A3=1 point)
- S1-S4: Negative points (S1=-1 point, S2=-2 points, S3=-3 points, S4=-4 points)
-
Final Score Excel (
processed_result_final_YYYYMMDD_HHMMSS.xlsx)- Contains all original data
- Met/Not Met review results
- Irrelevant content detection results
- Detailed scoring statistics
-
Data Analysis Report (
data/output/analysis/)medical_evaluation_report_YYYYMMDD_HHMMSS.png- Visualization chartsmedical_analysis_report_YYYYMMDD_HHMMSS.txt- Detailed analysis reportmodel_performance_summary_YYYYMMDD_HHMMSS.csv- Performance summary
| Metric | Description |
|---|---|
| max_possible | Theoretical maximum score (sum of all positive points) |
| final_total_score | Actual score (Met items score - irrelevant content deduction) |
| normalized | Normalized score (between 0-1) |
| positive_total_count | Number of positive point hits |
| rubric_total_count | Number of negative point hits |
| irrelevant_total_count | Total irrelevant content count |
The system supports parallel execution of Step 1 (Met Review) and Step 2 (Irrelevant Content Detection):
# Enable all steps (auto parallel)
python medical_evaluation_pipeline.py input.xlsx
# Enable only Met Review
python medical_evaluation_pipeline.py input.xlsx --disable-irrelevant
# Enable only Irrelevant Content Detection
python medical_evaluation_pipeline.py input.xlsx --disable-met-nometSupports three voting strategies for processing multi-model results:
- conservative: Conservative strategy, prioritizes more severe ratings
- majority: Majority voting, selects the result with the most votes
- average: Average strategy, calculates based on numerical average
python medical_evaluation_pipeline.py input.xlsx --voting-strategy conservative# Use specific review model combinations
python medical_evaluation_pipeline.py input.xlsx \
--judge-models m1 m2 \
--extract-model m5 \
--grade-models m1 m3 m4Create config_visualization.yaml for custom analysis configuration:
analysis:
input_file: "data/processed_result.xlsx"
columns:
model_scores:
GPT-5: "gpt_5_answer_judged_json_scores"
Gemini-2.5-Pro: "gemini_2_5_pro_answer_judged_json_scores"
Claude-Opus-4: "claude_opus_4_answer_judged_json_scores"
visualization:
charts:
bar_overall: true
boxplot_distribution: true
heatmap_category: true
radar_chart: true
style:
figure_size: [24, 16]
dpi: 300import asyncio
from clinical_answer_judge import MultiModelMergedJsonJudge
async def run_judge():
judge = MultiModelMergedJsonJudge(
judge_models=['m1', 'm2', 'm3'],
question_col='question',
rubric_col='rubrics'
)
result = await judge.process_excel_file('input.xlsx', 'output.xlsx')
return result
# Run the async function
result = asyncio.run(run_judge())from clinical_scoring_calculator import process_excel_data
result = process_excel_data(
input_file='judged_data.xlsx',
output_file='scored_data.xlsx',
model_columns=['gpt_5_answer_judged_json']
)from enhanced_analyzer import EnhancedMedicalAnalyzer
analyzer = EnhancedMedicalAnalyzer('config_visualization.yaml')
analyzer.run_analysis()# Process complete medical evaluation data
python medical_evaluation_pipeline.py medical_data.xlsx \
-o results/final_report.xlsx \
--judge-models m1 m2 m3 \
--voting-strategy conservative \
-v# Process only first 10 rows of data
python medical_evaluation_pipeline.py test_data.xlsx \
--max_rows 10 \
--disable-data-analysis# Use Chinese column names
python medical_evaluation_pipeline.py data.xlsx \
--question-col "Clinical Question" \
--rubric-col "Scoring Criteria" \
--answer-columns "GPT Response" "Claude Response"-
File Not Found Error
- Solution: Check if the input file path is correct
-
Column Name Mismatch
- Solution: Use
--question-col,--rubric-coland other parameters to specify correct column names
- Solution: Use
-
Insufficient Memory
- Solution: Use
--max_rowsparameter to limit the number of rows processed
- Solution: Use
-
API Call Failure
- Solution: Check model configuration and network connection
# Enable verbose logging to view processing details
python medical_evaluation_pipeline.py input.xlsx -v
# Keep temporary files for debugging
# Modify the cleanup() call in the codemedical_evaluation_pipeline.py- Main pipeline controllerclinical_answer_judge.py- Met/Not Met review moduleclinical_scoring_calculator.py- Score calculation moduleenhanced_analyzer.py- Data analysis and visualization modulegeneral_negative_points.py- Irrelevant content processing module
MedicalAiBenchEval/
├── clinical_answer_judge.py # Multi-model review processor
├── clinical_scoring_calculator.py # Score calculator
├── medical_evaluation_pipeline.py # Main pipeline controller
├── enhanced_analyzer.py # Data analyzer
├── judge_engine.py # Review engine
├── irrelevant_content_grading.py # Irrelevant content scoring
├── extract_irrelevant_content.py # Irrelevant content extraction
├── general_negative_points.py # Negative point processing
├── config.yaml # Main configuration file
├── config_model.yaml # Model configuration file
├── config_visualization.yaml # Visualization configuration
├── requirements.txt # Dependency list
└── data/ # Data directory
├── input/ # Input files
│ ├── GAPS-NSCLC-preview.xlsx # Thoracic surgery evaluation dataset (92 cases)
│ └── community_contributions.xlsx # Community contributed cases
└── output/ # Output files
└── analysis/ # Analysis reports
- Input Phase: Read Excel file, validate data format
- Parallel Processing: Simultaneously execute Met review and irrelevant content detection
- Result Merging: Intelligently merge outputs from both parallel steps
- Score Calculation: Calculate multi-dimensional scores based on merged results
- Analysis Output: Generate statistical reports and visualization charts
- Asynchronous parallel processing improves efficiency
- Batch processing reduces API call frequency
- Intelligent retry mechanism handles network exceptions
- Memory optimization handles large files
Welcome to submit Issues and Pull Requests to improve system functionality.
git clone <repository>
cd medical-ai-bench-eval
pip install -r requirements.txt- Follow PEP 8 coding standards
- Add detailed docstrings
- Include unit tests
- Update README documentation
If you use this system in your research, please cite our paper:
English:
Xiuyuan Chen, Tao Sun, Dexin Su, Ailing Yu, Junwei Liu, Zhe Chen, Gangzeng Jin, Xin Wang, Jingnan Liu, Hansong Xiao, Hualei Zhou, Dongjie Tao, Chunxiao Guo, Minghui Yang, Yuan Xia, Jing Zhao, Qianrui Fan, Yanyun Wang, Shuai Zhen, Kezhong Chen, Jun Wang, Zewen Sun, Heng Zhao, Tian Guan, Shaodong Wang, Geyun Chang, Jiaming Deng, Hongchengcheng Chen, Kexin Feng, Ruzhen Li, Jiayi Geng, Changtai Zhao, Jun Wang, Guihu Lin, Peihao Li, Liqi Liu, Peng Wei, Jian Wang, Jinjie Gu, Ping Wang, Fan Yang (2025). GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians. arXiv preprint arXiv:2510.13734. https://arxiv.org/abs/2510.13734
@article{chen2025gaps,
title={GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians},
author={Chen, Xiuyuan and Sun, Tao and Su, Dexin and Yu, Ailing and Liu, Junwei and Chen, Zhe and Jin, Gangzeng and Wang, Xin and Liu, Jingnan and Xiao, Hansong and Zhou, Hualei and Tao, Dongjie and Guo, Chunxiao and Yang, Minghui and Xia, Yuan and Zhao, Jing and Fan, Qianrui and Wang, Yanyun and Zhen, Shuai and Chen, Kezhong and Wang, Jun and Sun, Zewen and Zhao, Heng and Guan, Tian and Wang, Shaodong and Chang, Geyun and Deng, Jiaming and Chen, Hongchengcheng and Feng, Kexin and Li, Ruzhen and Geng, Jiayi and Zhao, Changtai and Wang, Jun and Lin, Guihu and Li, Peihao and Liu, Liqi and Wei, Peng and Wang, Jian and Gu, Jinjie and Wang, Ping and Yang, Fan},
journal={arXiv preprint arXiv:2510.13734},
year={2025},
url={https://arxiv.org/abs/2510.13734}
}We welcome contributions from the medical AI community to expand and improve our evaluation dataset! Your expertise can help build a more comprehensive benchmark for medical AI systems.
Create a new issue with the label data-contribution using the following template:
**Medical Question**
[Provide your clinical question here]
**Evaluation Rubrics (JSON Format)**
```json
[
{
"id": 1,
"claim": "Your evaluation criterion description",
"level": "A1|A2|A3|S1|S2|S3|S4",
"type": "positive|negative"
},
{
"id": 2,
"claim": "Another evaluation criterion",
"level": "A2",
"type": "positive"
}
]Medical Specialty [e.g., Cardiology, Pulmonology, Oncology, etc.]
Question Type [e.g., Diagnosis, Treatment, Risk Assessment, etc.]
Additional Context [Any relevant clinical context or references]
#### Option 2: Submit a Pull Request
1. Fork this repository
2. Add your data to `data/input/community_contributions.xlsx` following the existing format:
- **question**: Your medical question
- **rubrics**: JSON array of evaluation criteria
- **specialty**: Medical specialty area
- **difficulty**: Easy/Medium/Hard
- **source**: Your name/institution (optional)
3. Ensure your contribution follows these guidelines:
- **Clinical Relevance**: Questions should reflect real-world medical scenarios
- **Evidence-Based**: Evaluation criteria should be based on established medical guidelines
- **Clear Scoring**: Use our scoring system appropriately:
- **Positive Points**: A1 (5 pts), A2 (3 pts), A3 (1 pt)
- **Negative Points**: S1 (-1 pt), S2 (-2 pts), S3 (-3 pts), S4 (-4 pts)
### 📝 Contribution Guidelines
#### Evaluation Criteria Design
- **A1 (Critical, 5 points)**: Essential medical knowledge that could impact patient safety
- **A2 (Important, 3 points)**: Significant clinical considerations
- **A3 (Helpful, 1 point)**: Additional relevant information
- **S1 (Minor Error, -1 point)**: Small inaccuracies that don't affect core treatment
- **S2 (Moderate Error, -2 points)**: Incorrect information that could mislead
- **S3 (Major Error, -3 points)**: Serious medical errors
- **S4 (Critical Error, -4 points)**: Dangerous misinformation that could harm patients
#### Quality Standards
- Questions should be **clinically realistic** and based on actual medical scenarios
- Evaluation criteria must be **evidence-based** and reference current medical guidelines
- Each question should have **3-8 evaluation points** for balanced assessment
- Include both **positive criteria** (what should be mentioned) and **negative criteria** (what should be avoided)
### 🎯 Priority Areas
We especially welcome contributions in:
- **Rare Diseases**: Uncommon conditions that challenge AI systems
- **Multidisciplinary Cases**: Complex cases requiring multiple specialties
- **Cultural Considerations**: Cases reflecting diverse patient populations
- **Emergency Medicine**: Time-critical decision-making scenarios
- **Pediatric Medicine**: Age-specific medical considerations
### 📊 Data Format Example
```json
{
"question": "A 65-year-old patient presents with chest pain and shortness of breath. How would you approach the initial evaluation?",
"rubrics": [
{
"id": 1,
"claim": "Should obtain detailed history including pain characteristics, timing, and associated symptoms",
"level": "A1",
"type": "positive"
},
{
"id": 2,
"claim": "Must perform ECG and chest X-ray as initial diagnostic tests",
"level": "A1",
"type": "positive"
},
{
"id": 3,
"claim": "Should not delay evaluation with unnecessary tests in acute presentation",
"level": "S2",
"type": "negative"
}
],
"specialty": "Emergency Medicine",
"difficulty": "Medium"
}
- Community Review: Contributions will be reviewed by medical professionals
- Quality Check: Ensure clinical accuracy and appropriate difficulty
- Integration: Approved contributions will be merged into the main dataset
- Attribution: Contributors will be acknowledged in our contributor list
For questions about contributing, please:
- Open an issue with the
questionlabel - Join our discussion in the GitHub Discussions section
Thank you for helping build a better benchmark for medical AI evaluation! 🙏
This project is licensed under the MIT License - see the LICENSE file for details.
🤗 HuggingFace Dataset | 📄 arXiv | 🌐 中文 |