Skip to content

Latest commit

 

History

History
65 lines (48 loc) · 2.38 KB

File metadata and controls

65 lines (48 loc) · 2.38 KB

💾 Datasets for RAG Benchmarking

You cannot improve what you cannot measure. Production-grade RAG requires rigorous evaluation against ground-truth datasets. This list covers general-purpose benchmarks and domain-specific corpora.


🏅 The Leaderboards

Before choosing a model, check these live leaderboards:


📚 General Knowledge (Open Domain QA)

  • MS MARCO - 1M+ Queries.
    • Making AI the first truly conversational search engine.
    • (Best For: Retrieval)
  • HotpotQA - 113k pairs.
    • Question answering requiring multi-hop reasoning.
    • (Best For: Reasoning)
  • Natural Questions (NQ) - 300k+.
    • Real user queries issued to Google Search.
    • (Best For: Realism)
  • TriviaQA - 95k.
    • Reading comprehension dataset containing triples of (question, answer, evidence).
    • (Best For: Factuality)

📑 Long-Context & Document Understanding

  • SQuAD 2.0 - Stanford Question Answering Dataset.
    • Includes unanswerable questions.
    • (Best For: Hallucination Detection)
  • Qasper - Question answering over NLP papers.
    • (Best For: Technical/Scientific)
  • NarrativeQA - QA over collected stories (books and movie scripts).
    • (Best For: Long Context)

🧪 Synthetic Data Generation

Don't have a dataset? Generate one from your own internal documents.


(back to main resource)