Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams.
We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility.
Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate knowledge, reasoning, and generation, providing insights on the path to general AGI.
Model | Math | Phy | Chem | Bio | Geo | Comp | Eng | Econ | Music | Hist | Overall |
---|---|---|---|---|---|---|---|---|---|---|---|
Closed-source Models | |||||||||||
GPT-Image-1 | 8.0 | 13.2 | 13.5 | 22.8 | 15.9 | 10.3 | 13.1 | 13.0 | 9.3 | 2.4 | 12.1 |
Seedream 4.0 | 2.6 | 3.5 | 5.9 | 18.6 | 10.6 | 6.9 | 11.7 | 5.2 | 0.0 | 7.3 | 7.2 |
Imagen-4-Ultra | 2.6 | 9.7 | 9.3 | 14.7 | 7.6 | 2.9 | 12.6 | 9.1 | 0.0 | 0.0 | 6.9 |
Gemini-2.5-Flash-Image | 0.7 | 7.1 | 4.2 | 5.1 | 4.5 | 4.9 | 10.0 | 1.3 | 1.5 | 0.0 | 3.9 |
Seedream 3.0 | 0.7 | 0.0 | 0.8 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.2 |
FLUX.1 Kontext max | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Open-source T2I Models | |||||||||||
Qwen-Image | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.3 |
HiDream-I1-Full | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
FLUX.1 dev | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
FLUX.1 Krea | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Stable Diffusion 3.5 Large | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Open-source Unified MLLMs | |||||||||||
BAGEL (thinking) | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
BAGEL | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Show-o2-7B | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Show-o2-1.5B-HQ | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
BLIP3o-NEXT-GRPO-Text-3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
BLIP3o-8B | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Janus-Pro | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Emu3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Model | Math | Phy | Chem | Bio | Geo | Comp | Eng | Econ | Music | Hist | Overall |
---|---|---|---|---|---|---|---|---|---|---|---|
Closed-source Models | |||||||||||
GPT-Image-1 | 52.0 | 66.4 | 53.4 | 74.6 | 73.9 | 55.6 | 65.5 | 65.8 | 52.6 | 67.4 | 62.6 |
Seedream 4.0 | 39.8 | 49.0 | 46.1 | 71.0 | 65.1 | 52.2 | 60.0 | 56.0 | 34.5 | 56.7 | 53.0 |
Imagen-4-Ultra | 35.9 | 57.4 | 44.5 | 68.1 | 66.9 | 40.1 | 65.6 | 59.7 | 38.4 | 57.8 | 53.4 |
Gemini-2.5-Flash-Image | 43.1 | 60.9 | 45.3 | 72.6 | 70.2 | 47.4 | 65.8 | 59.8 | 37.0 | 57.1 | 55.9 |
Seedream 3.0 | 18.6 | 21.5 | 18.3 | 32.2 | 38.2 | 15.3 | 26.5 | 12.5 | 21.6 | 29.2 | 23.4 |
FLUX.1 Kontext max | 23.5 | 25.6 | 19.2 | 38.3 | 47.5 | 20.9 | 28.9 | 22.3 | 25.4 | 33.5 | 28.5 |
Open-source T2I Models | |||||||||||
Qwen-Image | 18.9 | 26.3 | 15.3 | 32.1 | 49.6 | 18.9 | 32.0 | 20.3 | 23.4 | 38.6 | 27.5 |
HiDream-I1-Full | 16.7 | 17.7 | 13.5 | 27.3 | 36.2 | 15.4 | 24.4 | 18.8 | 21.3 | 31.8 | 22.3 |
FLUX.1 dev | 12.2 | 14.4 | 12.5 | 22.8 | 36.4 | 11.0 | 14.0 | 9.2 | 21.3 | 21.7 | 17.6 |
FLUX.1 Krea | 7.0 | 14.0 | 8.5 | 26.5 | 38.4 | 8.4 | 15.4 | 11.1 | 16.8 | 17.4 | 16.4 |
Stable Diffusion 3.5 Large | 12.2 | 13.2 | 10.7 | 21.8 | 38.8 | 6.6 | 16.3 | 8.0 | 24.1 | 18.0 | 17.0 |
Open-source Unified MLLMs | |||||||||||
BAGEL (thinking) | 11.7 | 13.8 | 11.9 | 15.2 | 28.5 | 6.2 | 10.7 | 6.3 | 14.7 | 16.0 | 13.5 |
BAGEL | 14.7 | 10.6 | 7.9 | 10.8 | 24.5 | 6.8 | 10.2 | 5.3 | 13.7 | 14.4 | 11.9 |
Show-o2-7B | 10.8 | 11.9 | 4.8 | 12.8 | 33.3 | 4.7 | 11.8 | 7.0 | 8.8 | 14.5 | 12.0 |
Show-o2-1.5B-HQ | 7.3 | 7.5 | 6.2 | 15.0 | 25.3 | 4.3 | 9.3 | 7.3 | 7.6 | 19.8 | 11.0 |
BLIP3o-NEXT-GRPO-Text-3 | 15.5 | 10.5 | 9.2 | 15.5 | 23.7 | 8.2 | 10.1 | 8.1 | 15.2 | 10.2 | 12.6 |
BLIP3o-8B | 6.4 | 5.5 | 4.7 | 7.0 | 16.7 | 3.6 | 8.4 | 2.5 | 6.0 | 11.2 | 7.2 |
Janus-Pro | 13.7 | 8.8 | 8.2 | 7.2 | 18.8 | 3.9 | 10.5 | 4.2 | 14.5 | 6.6 | 9.6 |
Emu3 | 11.3 | 0.6 | 0.6 | 5.6 | 34.6 | 5.1 | 16.5 | 1.9 | 5.8 | 6.2 | 8.8 |
Our data is stored in data/
. You can also download them from Huggingface. Additionally, images organized by taxonomy can be found here.
-
Install requirements:
pip install requests tqdm pillow
-
Set
openai_api_key
andopenai_base_url
(optional, if you want to use proxy) inrun_eval.py
for the gpt-5-20250807 evaluator and inference of gpt-image-1. -
Generate the images offline with your model based on the
prompt
values indata/annotations/All_Subjects.jsonl
. Save paths should be likegen_imgs/{id}.png
.
Run evaluation offline if images are already generated in gen_imgs/
:
python run_eval.py --data_dir ./data/ --img_save_dir ./gen_imgs --eval_save_dir ./eval_results
The eval results are saved to separate jsons under ./eval_results
for each sample.
The run_eval.py
script supports resuming from breakpoints. If your evaluation encounters an error midway, simply re-run the script.
Alternatively, you can add --run_inference
to inference and evaluation together (generate images online):
python run_eval.py --run_inference --data_dir ./data/ --img_save_dir ./gen_imgs --eval_save_dir ./eval_results
This script runs gpt-image-1 by default, which costs $185 on the full set ($160 for inference and $25 for evaluation). You can replace the inference_function
in the script with customized function for your model's inference.
Run the script to generate a detailed report for the eval results:
python cal_score.py --eval_results_dir ./eval_results
This should give a report like:
Report Example
================================================================================
Each score dimension:
- semantic_correctness: 0.47
- spelling: 1.48
- readability: 1.55
- logical_consistency: 0.7
================================================================================
Each score dimension (average) for each subject:
- Computer_Science:
semantic_correctness: 0.53
spelling: 1.68
readability: 1.43
logical_consistency: 0.66
- Physics:
semantic_correctness: 0.4
spelling: 1.7
readability: 1.41
logical_consistency: 0.5
- Biology:
semantic_correctness: 0.72
spelling: 1.28
readability: 1.59
logical_consistency: 1.02
- History:
semantic_correctness: 0.53
spelling: 1.32
readability: 1.68
logical_consistency: 0.85
- Math:
semantic_correctness: 0.24
spelling: 1.5
readability: 1.65
logical_consistency: 0.29
- Geography:
semantic_correctness: 0.62
spelling: 1.27
readability: 1.69
logical_consistency: 0.98
- Economics:
semantic_correctness: 0.56
spelling: 1.77
readability: 1.58
logical_consistency: 0.75
- Chemistry:
semantic_correctness: 0.33
spelling: 1.33
readability: 1.52
logical_consistency: 0.6
- Music:
semantic_correctness: 0.26
spelling: 1.42
readability: 1.5
logical_consistency: 0.46
- Engineering:
semantic_correctness: 0.56
spelling: 1.49
readability: 1.43
logical_consistency: 0.94
--------------------------------------------------------------------------------
Total number of eval results: 487
--------------------------------------------------------------------------------
Strict score:
- Computer_Science(47 samples): 10.2% - Physics(46 samples): 3.5% - Biology(46 samples): 12.2% - History(41 samples): 5.9% - Math(52 samples): 0.0% - Geography(52 samples): 7.7% - Economics(52 samples): 3.1% - Chemistry(52 samples): 4.6% - Music(52 samples): 0.0% - Engineering(47 samples): 6.8%
Average strict score: 5.4%
--------------------------------------------------------------------------------
Relaxed score:
- Computer_Science(47 samples): 44.8% - Physics(46 samples): 36.9% - Biology(46 samples): 56.1% - History(41 samples): 45.4% - Math(52 samples): 27.2% - Geography(52 samples): 50.7% - Economics(52 samples): 47.6% - Chemistry(52 samples): 32.4% - Music(52 samples): 27.8% - Engineering(47 samples): 47.0%
Average relaxed score: 41.6%
To run evaluation on the mini subset, you can add a --mini
argument when running run_eval.py
:
python run_eval.py --mini --data_dir ./data/ --img_save_dir ./gen_imgs --eval_save_dir ./eval_results
If you have already run evaluation on the full set, you can alternatively add --mini
when running cal_score.py
:
python cal_score.py --mini --eval_results_dir ./eval_results
Run the two commands simultaneously with --start_index
and --end_index
to split the evaluation into two parts:
# in window 1
python run_eval.py --start_index 0 --end_index 500 --data_dir ./data/ --img_save_dir ./gen_imgs --eval_save_dir ./eval_results
# in window 2
python run_eval.py --start_index 500 --end_index 1000 --data_dir ./data/ --img_save_dir ./gen_imgs --eval_save_dir ./eval_results
You can split evaluator into more parts for further speed-up.
For more examples, please refer to the appendix in our paper.
This project is released under the MIT license.
If you find our work helpful, please consider giving us a ⭐ and citing our paper:
@article{GenExam,
title={GenExam: A Multidisciplinary Text-to-Image Exam},
author = {Wang, Zhaokai and Yin, Penghao and Zhao, Xiangyu and Tian, Changyao and Qiao, Yu and Wang, Wenhai and Dai, Jifeng and Luo, Gen},
journal={arXiv preprint arXiv:2509.14232},
year={2025}
}