You cannot improve what you cannot measure. Production-grade RAG requires rigorous evaluation against ground-truth datasets. This list covers general-purpose benchmarks and domain-specific corpora.
Before choosing a model, check these live leaderboards:
- MTEB (Massive Text Embedding Benchmark)
- The gold standard for choosing an embedding model (Retrieval, Clustering, Reranking quality).
- Open Compass
- Comprehensive LLM evaluation which includes retrieving capabilities.
- Hugging Face Open LLM Leaderboard
- General LLM performance.
- MS MARCO - 1M+ Queries.
- Making AI the first truly conversational search engine.
- (Best For: Retrieval)
- HotpotQA - 113k pairs.
- Question answering requiring multi-hop reasoning.
- (Best For: Reasoning)
- Natural Questions (NQ) - 300k+.
- Real user queries issued to Google Search.
- (Best For: Realism)
- TriviaQA - 95k.
- Reading comprehension dataset containing triples of (question, answer, evidence).
- (Best For: Factuality)
- SQuAD 2.0 - Stanford Question Answering Dataset.
- Includes unanswerable questions.
- (Best For: Hallucination Detection)
- Qasper - Question answering over NLP papers.
- (Best For: Technical/Scientific)
- NarrativeQA - QA over collected stories
(books and movie scripts).
- (Best For: Long Context)
Don't have a dataset? Generate one from your own internal documents.
- Ragas Synthetic Data Generator
- create "Golden Datasets" (Question-Answer-Context triples) automatically.
- LlamaIndex Data Generator
- Built-in utils to generate questions from your indexed nodes.