Name	Name	Last commit message	Last commit date
parent directory ..
.gitignore	.gitignore
MWBenchRubrics.json	MWBenchRubrics.json
MWBenchRubrics_norm.json	MWBenchRubrics_norm.json
README.md	README.md
calculate_scores.py	calculate_scores.py
finaltasks.json	finaltasks.json
video_evaluator.py	video_evaluator.py

MicroWorldBench Evaluation Code

Evaluation pipeline for MicroWorldBench (MWBench) — a benchmark for assessing AI-generated microscopic world videos across three dimensions: Scientific Accuracy, Visual Quality, and Instruction Following.

Repository Structure

.
├── video_evaluator.py       # Step 1 — extract frames & call the judge model
├── calculate_scores.py      # Step 2 — aggregate scores & generate report
├── MWBenchRubrics.json      # Per-task rubrics used by video_evaluator.py
├── MWBenchRubrics_norm.json # Weighted rubrics used by calculate_scores.py
├── finaltasks.json          # Task metadata (category labels, prompts)
├── eval_result/             # Output directory
│   ├── <model>_result.json  # Raw evaluation output (auto-generated)
│   └── evaluation_report.md # Summary report (auto-generated)
└── .env.example             # Environment variable template

Requirements

pip install requests opencv-python numpy tqdm

Setup

Copy the environment variable template and fill in your API credentials:

cp .env.example .env

Edit .env:

# API key for the evaluation model (GPT-4o or any OpenAI-compatible endpoint)
EVAL_API_KEY=your_key_here

# Optional: override the API base URL (defaults to https://api.openai.com)
EVAL_BASE_URL=https://api.openai.com

Export the variables before running:

export $(cat .env | xargs)

Usage

Step 1 — Evaluate Videos

Place your generated videos under a subdirectory named after the model:

./
└── <ModelName>/
    ├── 1.mp4
    ├── 2.mp4
    └── ...

Video filenames must be the task index (e.g. 42.mp4).

Edit the model_names list in video_evaluator.py (or pass via EvalConfig) then run:

python video_evaluator.py

Results are saved incrementally to eval_result/<ModelName>_result.json.
Re-running skips already-completed tasks, so the script is safe to interrupt and resume.

Key configuration options (`EvalConfig` in `video_evaluator.py`)

Field	Default	Description
`model`	`"gpt-4o"`	Judge model name
`max_workers`	`100`	Parallel threads
`num_frames`	`8`	Frames extracted per video
`retry_times`	`3`	API retry attempts per request
`test_mode`	`False`	Set `True` to process only `test_limit` videos
`test_limit`	`4`	Videos to process in test mode

Step 2 — Calculate Scores

Once evaluation results exist in eval_result/, run:

python calculate_scores.py

This reads all *_result.json files, computes normalised scores per dimension and category, and writes eval_result/evaluation_report.md.

Scoring Formula

For each task and dimension:

S      = Σ (score_i × weight_i × sign_i)
S_norm = max(0, S / Σ w_i+) × 100

where score_i ∈ {0, 1}, sign_i ∈ {+1, −1} (positive/penalty criterion), and Σ w_i+ sums the weights of positive criteria only.
The overall task score is the normalised combined score across all three dimensions.

Evaluation Dimensions

Dimension	Description
Scientific	Accuracy of the depicted biological/physical processes
Visual	Realism, clarity, and quality of the rendered visuals
Instruction	Adherence to the task prompt and detailed requirements

Task Categories

Category	Description
Organ-level	Organ-scale structures and physiological processes
Cellular-level	Cell-scale structures and interactions
Subcellular-level	Subcellular organelles and molecular-level processes

Output Format

eval_result/<ModelName>_result.json — array of objects, one per video:

{
  "index": 42,
  "video_path": "./ModelName/42.mp4",
  "scientific_eval": {
    "raw_response": "...",
    "parsed_result": { "scores": [1, 0, 1], "reasoning": "..." }
  },
  "visual_eval": { ... },
  "instruction_eval": { ... }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

MicroWorldBench Evaluation Code

Repository Structure

Requirements

Setup

Usage

Step 1 — Evaluate Videos

Key configuration options (`EvalConfig` in `video_evaluator.py`)

Step 2 — Calculate Scores

Scoring Formula

Evaluation Dimensions

Task Categories

Output Format

FilesExpand file tree

eval

Directory actions

More options

Directory actions

More options

Latest commit

History

eval

Folders and files

parent directory

README.md

MicroWorldBench Evaluation Code

Repository Structure

Requirements

Setup

Usage

Step 1 — Evaluate Videos

Key configuration options (EvalConfig in video_evaluator.py)

Step 2 — Calculate Scores

Scoring Formula

Evaluation Dimensions

Task Categories

Output Format

Key configuration options (`EvalConfig` in `video_evaluator.py`)