- [2026-01] π MM-BRIGHT Launch: We release the MM-BRIGHT benchmark, dataset, and evaluation code!
- [2026-01] π οΈ Code: Full evaluation code for all 4 tasks is released.
Existing retrieval benchmarks primarily consist of text-based queries where keyword or semantic matching is usually sufficient. Many real-world queries contain multimodal elementsβparticularly images such as diagrams, charts, and screenshotsβthat require intensive reasoning to identify relevant documents.
MM-BRIGHT bridges this gap as the first multimodal benchmark for reasoning-intensive retrieval.
| Feature | MM-BRIGHT |
|---|---|
| Total Queries | 2,803 |
| Domains | 29 diverse technical domains |
| Total Documents | 2.5M+ |
| Retrieval Tasks | 4 (increasing multimodal complexity) |
| Image Types | Photos, Diagrams, Charts, Screenshots, Scientific Figures |
| Source | Real-world Stack Exchange Q&A |
MM-BRIGHT evaluates retrieval across four tasks of increasing multimodal complexity:
| Task | Query | Target | Description |
|---|---|---|---|
| Task 1 | Text | Text | Text-to-text retrieval (baseline) |
| Task 2 | Text + Image | Text | Multimodal query β text documents |
| Task 3 | Text + Image | Image | Multimodal query β relevant images |
| Task 4 | Text + Image | Text + Image | Multimodal query β multimodal documents |
| Model | BM25 | Contriever | DiVeR | E5 | GritLM | OpenAI | Qwen2 | Rader | ReasonIR | SFR |
|---|---|---|---|---|---|---|---|---|---|---|
| Avg. | 8.5 | 20.1 | 32.2 | 25.3 | 25.3 | 28.8 | 28.1 | 24.9 | 28.6 | 26.9 |
| Model | BGE-VL | CLIP | GME-2B | GME-7B | Jina-CLIP | Nomic | SigLIP |
|---|---|---|---|---|---|---|---|
| Avg. | 10.0 | 10.4 | 19.5 | 22.0 | 23.0 | 27.6 | 10.8 |
Finding: Even state-of-the-art models struggle on MM-BRIGHT. BM25 achieves only 8.5 nDCG@10, while the best multimodal model (Nomic-Vision: 27.6) actually underperforms the best text-only model (DiVeR: 32.2).
STEM & Life Sciences (9 domains)
| Domain | Queries | Documents | Avg. Images/Query |
|---|---|---|---|
| Academia | 26 | 60,050 | 1.77 |
| Bioacoustics | 41 | 29,812 | 2.17 |
| Bioinformatics | 90 | 45,545 | 1.62 |
| Biology | 99 | 89,435 | 2.96 |
| Chemistry | 65 | 36,043 | 2.54 |
| Earth Science | 85 | 73,451 | 2.15 |
| Math | 45 | 151,867 | 2.64 |
| Medical Sciences | 55 | 240,844 | 1.85 |
| Physics | 100 | 338,291 | 2.45 |
Software & Technical Systems (8 domains)
| Domain | Queries | Documents | Avg. Images/Query |
|---|---|---|---|
| Apple | 14 | 29,285 | 2.14 |
| Ask Ubuntu | 35 | 90,198 | 2.09 |
| Bitcoin | 64 | 29,595 | 1.48 |
| Crypto | 74 | 24,054 | 1.50 |
| GIS | 44 | 20,705 | 2.98 |
| Quantum Computing | 88 | 127,009 | 1.84 |
| Robotics | 30 | 11,185 | 2.33 |
| Salesforce | 10 | 8,890 | 2.50 |
Social Sciences & Humanities (6 domains)
| Domain | Queries | Documents | Avg. Images/Query |
|---|---|---|---|
| Christianity | 30 | 37,875 | 1.47 |
| Economics | 31 | 18,431 | 1.84 |
| Islam | 27 | 14,079 | 1.33 |
| Law | 30 | 26,142 | 1.23 |
| Philosophy | 50 | 137,860 | 1.58 |
| Psychology | 87 | 328,520 | 1.67 |
Applied Domains (6 domains)
| Domain | Queries | Documents | Avg. Images/Query |
|---|---|---|---|
| Aviation | 125 | 203,938 | 2.41 |
| Gaming | 26 | 68,321 | 1.85 |
| PM | 50 | 93,376 | 1.56 |
| Quant | 34 | 64,044 | 1.38 |
| Sustainability | 62 | 32,365 | 1.61 |
| Travel | 68 | 68,063 | 1.84 |
git clone https://github.com/mm-bright/MM-BRIGHT.git
cd MM-BRIGHT
pip install -r requirements.txtThe dataset is automatically loaded from Hugging Face:
from datasets import load_dataset
# Load documents
docs = load_dataset("mm-bright/MM-BRIGHT", "documents", split="academia")
# Load queries (Task 1/2)
queries = load_dataset("mm-bright/MM-BRIGHT", "examples", split="academia")
# Load multimodal queries (Task 3/4)
mm_queries = load_dataset("mm-bright/MM-BRIGHT", "examples_multimodal", split="academia")python run_task1.py --dataset_dir . --model bm25 --domains academia biology chemistrypython run_task2.py --dataset_dir . --model nomic-vision --domains academia biologypython run_task3.py --dataset_dir . --model clip --domains academia biologypython run_task4.py --dataset_dir . --model clip --domains academia biologyUse the experiment runner to evaluate all models across all domains:
# Dry run - see all commands
python run_experiments.py --dry_run
# Execute all experiments
python run_experiments.py --dataset_dir .
# Run specific tasks only
python run_experiments.py --dataset_dir . --tasks 1 2MM-BRIGHT/
βββ run_task1.py # Task 1: Text β Text
βββ run_task2.py # Task 2: Text+Image β Text
βββ run_task3.py # Task 3: Text+Image β Image
βββ run_task4.py # Task 4: Text+Image β Text+Image
βββ run_experiments.py # Batch experiment runner
βββ src/
β βββ data.py # HuggingFace data loading
β βββ caching.py # Embedding cache management
β βββ eval_runner.py # Unified evaluation framework
β βββ utils.py # Shared utilities
β βββ models/ # Custom model definitions
β β βββ gritlm7b.py
β β βββ nvmmembed.py
β βββ retrievers/ # Task-specific retrievers
β βββ task1_text.py
β βββ task2_multimodal.py
β βββ task3_image.py
β βββ task4_pair.py
βββ outputs/ # Evaluation results
| Benchmark | #Queries | #Domains | Modality | Reasoning | Multi-Task |
|---|---|---|---|---|---|
| BRIGHT | 1,384 | 12 | Text | β | β |
| RAR-b | 45,745 | 17 | Text | β | β |
| WebQA | 7,540 | Open | IT β IT | β | β |
| UNIIR | 190K | 10 | Mixed | β | β |
| ViDoRe | 3,810 | 10 | T β IT | β | β |
| MMEB | 36K | 36 | Mixed | β | β |
| MM-BRIGHT (Ours) | 2,803 | 29 | Mixed | β | β |
If you use MM-BRIGHT in your work, please cite our paper:
soonThis project is licensed under CC-BY-4.0.
MM-BRIGHT is built on top of the excellent BRIGHT benchmark and extends it to the multimodal domain. We thank the Stack Exchange community for providing the raw data that makes this benchmark possible.
