💾 Datasets for RAG Benchmarking

You cannot improve what you cannot measure. Production-grade RAG requires rigorous evaluation against ground-truth datasets. This list covers general-purpose benchmarks and domain-specific corpora.

🏅 The Leaderboards

Before choosing a model, check these live leaderboards:

MTEB (Massive Text Embedding Benchmark)
- The gold standard for choosing an embedding model (Retrieval, Clustering, Reranking quality).
Open Compass
- Comprehensive LLM evaluation which includes retrieving capabilities.
Hugging Face Open LLM Leaderboard
- General LLM performance.

📚 General Knowledge (Open Domain QA)

MS MARCO - 1M+ Queries.
- Making AI the first truly conversational search engine.
- (Best For: Retrieval)
HotpotQA - 113k pairs.
- Question answering requiring multi-hop reasoning.
- (Best For: Reasoning)
Natural Questions (NQ) - 300k+.
- Real user queries issued to Google Search.
- (Best For: Realism)
TriviaQA - 95k.
- Reading comprehension dataset containing triples of (question, answer, evidence).
- (Best For: Factuality)

📑 Long-Context & Document Understanding

SQuAD 2.0 - Stanford Question Answering Dataset.
- Includes unanswerable questions.
- (Best For: Hallucination Detection)
Qasper - Question answering over NLP papers.
- (Best For: Technical/Scientific)
NarrativeQA - QA over collected stories (books and movie scripts).
- (Best For: Long Context)

🧪 Synthetic Data Generation

Don't have a dataset? Generate one from your own internal documents.

Ragas Synthetic Data Generator
- create "Golden Datasets" (Question-Answer-Context triples) automatically.
LlamaIndex Data Generator
- Built-in utils to generate questions from your indexed nodes.

(back to main resource)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💾 Datasets for RAG Benchmarking

🏅 The Leaderboards

📚 General Knowledge (Open Domain QA)

📑 Long-Context & Document Understanding

🧪 Synthetic Data Generation

FilesExpand file tree

datasets.md

Latest commit

History

datasets.md

File metadata and controls

💾 Datasets for RAG Benchmarking

🏅 The Leaderboards

📚 General Knowledge (Open Domain QA)

📑 Long-Context & Document Understanding

🧪 Synthetic Data Generation