Skip to content
View mm-bright's full-sized avatar

Block or report mm-bright

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
MM-BRIGHT/README.md

MM-BRIGHT: A Multi-Task Multimodal Benchmark for Reasoning-Intensive Retrieval

Hugging Face Datasets License

Overview of MM-BRIGHT Tasks

🚨 News

  • [2026-01] πŸš€ MM-BRIGHT Launch: We release the MM-BRIGHT benchmark, dataset, and evaluation code!
  • [2026-01] πŸ› οΈ Code: Full evaluation code for all 4 tasks is released.

πŸ“– Overview

Existing retrieval benchmarks primarily consist of text-based queries where keyword or semantic matching is usually sufficient. Many real-world queries contain multimodal elementsβ€”particularly images such as diagrams, charts, and screenshotsβ€”that require intensive reasoning to identify relevant documents.

MM-BRIGHT bridges this gap as the first multimodal benchmark for reasoning-intensive retrieval.

Key Features

Feature MM-BRIGHT
Total Queries 2,803
Domains 29 diverse technical domains
Total Documents 2.5M+
Retrieval Tasks 4 (increasing multimodal complexity)
Image Types Photos, Diagrams, Charts, Screenshots, Scientific Figures
Source Real-world Stack Exchange Q&A

Four Retrieval Tasks

MM-BRIGHT evaluates retrieval across four tasks of increasing multimodal complexity:

Task Query Target Description
Task 1 Text Text Text-to-text retrieval (baseline)
Task 2 Text + Image Text Multimodal query β†’ text documents
Task 3 Text + Image Image Multimodal query β†’ relevant images
Task 4 Text + Image Text + Image Multimodal query β†’ multimodal documents

πŸ† Leaderboard

Task 1: Text-to-Text Retrieval (nDCG@10)

Model BM25 Contriever DiVeR E5 GritLM OpenAI Qwen2 Rader ReasonIR SFR
Avg. 8.5 20.1 32.2 25.3 25.3 28.8 28.1 24.9 28.6 26.9

Task 2: Multimodal-to-Text Retrieval (nDCG@10)

Model BGE-VL CLIP GME-2B GME-7B Jina-CLIP Nomic SigLIP
Avg. 10.0 10.4 19.5 22.0 23.0 27.6 10.8

Finding: Even state-of-the-art models struggle on MM-BRIGHT. BM25 achieves only 8.5 nDCG@10, while the best multimodal model (Nomic-Vision: 27.6) actually underperforms the best text-only model (DiVeR: 32.2).


πŸ“Š Dataset Statistics

Domains by Category

STEM & Life Sciences (9 domains)
Domain Queries Documents Avg. Images/Query
Academia 26 60,050 1.77
Bioacoustics 41 29,812 2.17
Bioinformatics 90 45,545 1.62
Biology 99 89,435 2.96
Chemistry 65 36,043 2.54
Earth Science 85 73,451 2.15
Math 45 151,867 2.64
Medical Sciences 55 240,844 1.85
Physics 100 338,291 2.45
Software & Technical Systems (8 domains)
Domain Queries Documents Avg. Images/Query
Apple 14 29,285 2.14
Ask Ubuntu 35 90,198 2.09
Bitcoin 64 29,595 1.48
Crypto 74 24,054 1.50
GIS 44 20,705 2.98
Quantum Computing 88 127,009 1.84
Robotics 30 11,185 2.33
Salesforce 10 8,890 2.50
Social Sciences & Humanities (6 domains)
Domain Queries Documents Avg. Images/Query
Christianity 30 37,875 1.47
Economics 31 18,431 1.84
Islam 27 14,079 1.33
Law 30 26,142 1.23
Philosophy 50 137,860 1.58
Psychology 87 328,520 1.67
Applied Domains (6 domains)
Domain Queries Documents Avg. Images/Query
Aviation 125 203,938 2.41
Gaming 26 68,321 1.85
PM 50 93,376 1.56
Quant 34 64,044 1.38
Sustainability 62 32,365 1.61
Travel 68 68,063 1.84

βš™οΈ Setup & Installation

1. Clone and Install

git clone https://github.com/mm-bright/MM-BRIGHT.git
cd MM-BRIGHT
pip install -r requirements.txt

2. Dataset Access

The dataset is automatically loaded from Hugging Face:

from datasets import load_dataset

# Load documents
docs = load_dataset("mm-bright/MM-BRIGHT", "documents", split="academia")

# Load queries (Task 1/2)
queries = load_dataset("mm-bright/MM-BRIGHT", "examples", split="academia")

# Load multimodal queries (Task 3/4)
mm_queries = load_dataset("mm-bright/MM-BRIGHT", "examples_multimodal", split="academia")

πŸš€ Running Evaluations

Task 1: Text-to-Text Retrieval

python run_task1.py --dataset_dir . --model bm25 --domains academia biology chemistry

Task 2: Multimodal Query β†’ Text Documents

python run_task2.py --dataset_dir . --model nomic-vision --domains academia biology

Task 3: Multimodal Query β†’ Images

python run_task3.py --dataset_dir . --model clip --domains academia biology

Task 4: Multimodal Query β†’ Multimodal Documents

python run_task4.py --dataset_dir . --model clip --domains academia biology

Run All Experiments

Use the experiment runner to evaluate all models across all domains:

# Dry run - see all commands
python run_experiments.py --dry_run

# Execute all experiments
python run_experiments.py --dataset_dir .

# Run specific tasks only
python run_experiments.py --dataset_dir . --tasks 1 2

πŸ“ Project Structure

MM-BRIGHT/
β”œβ”€β”€ run_task1.py          # Task 1: Text β†’ Text
β”œβ”€β”€ run_task2.py          # Task 2: Text+Image β†’ Text
β”œβ”€β”€ run_task3.py          # Task 3: Text+Image β†’ Image
β”œβ”€β”€ run_task4.py          # Task 4: Text+Image β†’ Text+Image
β”œβ”€β”€ run_experiments.py    # Batch experiment runner
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data.py           # HuggingFace data loading
β”‚   β”œβ”€β”€ caching.py        # Embedding cache management
β”‚   β”œβ”€β”€ eval_runner.py    # Unified evaluation framework
β”‚   β”œβ”€β”€ utils.py          # Shared utilities
β”‚   β”œβ”€β”€ models/           # Custom model definitions
β”‚   β”‚   β”œβ”€β”€ gritlm7b.py
β”‚   β”‚   └── nvmmembed.py
β”‚   └── retrievers/       # Task-specific retrievers
β”‚       β”œβ”€β”€ task1_text.py
β”‚       β”œβ”€β”€ task2_multimodal.py
β”‚       β”œβ”€β”€ task3_image.py
β”‚       └── task4_pair.py
└── outputs/              # Evaluation results

πŸ“Š Benchmark Comparison

Benchmark #Queries #Domains Modality Reasoning Multi-Task
BRIGHT 1,384 12 Text βœ… βœ…
RAR-b 45,745 17 Text βœ… ❌
WebQA 7,540 Open IT β†’ IT ❌ ❌
UNIIR 190K 10 Mixed ❌ βœ…
ViDoRe 3,810 10 T β†’ IT ❌ ❌
MMEB 36K 36 Mixed ❌ βœ…
MM-BRIGHT (Ours) 2,803 29 Mixed βœ… βœ…

πŸ“ Citation

If you use MM-BRIGHT in your work, please cite our paper:

soon

πŸ“„ License

This project is licensed under CC-BY-4.0.


πŸ™ Acknowledgments

MM-BRIGHT is built on top of the excellent BRIGHT benchmark and extends it to the multimodal domain. We thank the Stack Exchange community for providing the raw data that makes this benchmark possible.

Popular repositories Loading

  1. MM-BRIGHT MM-BRIGHT Public

    Python 7 3

  2. mm-bright.github.io mm-bright.github.io Public

    JavaScript