A standardized and fair evaluation framework and leaderboard for Retrieval-Augmented Generation (RAG) systems.
RAGQA-Leaderboard aims to provide researchers and developers with a unified and reproducible benchmark for evaluating the performance of RAG models. We have integrated a suite of popular and high-frequency question-answering datasets and offer a streamlined, one-click evaluation pipeline that generates detailed reports, making model comparison and analysis easier than ever.
- [20251119]: We have updated the evaluation results for several models on on our Hugging Face Check out the latest results on Hugging Face!
-
π Standardized Evaluation Framework: Provides a unified and fair evaluation pipeline, ensuring that different models are compared under the same conditions for reproducible results.
-
π Comprehensive Dataset Integration:
- Integrates a wide range of popular QA datasets used in the RAG domain.
- Covers diverse question types including Single-Hop, Multi-Hop, and Domain-Specific scenarios.
- Includes benchmarks like
HotpotQA,PopQA,MusiqueQA,TriviaQA, and more.
-
π Multi-Dimensional Metrics:
- Supports core evaluation metrics such as Accuracy, F1 Score, and Exact Match to provide a holistic view of model performance.
-
π One-Click Reporting:
-
π§© Modular RAG Evaluation: Go beyond end-to-end testing. This framework allows for the isolated evaluation of individual RAG componentsβsuch as the Retriever and the Generatorβenabling targeted analysis and debugging.
-
π Flexible Model Inference:
- API-based: Evaluate models served via API endpoints (e.g., OpenAI, Anthropic, or custom-hosted models).
- Local Inference: Supports high-performance, offline evaluation of local models using libraries like vLLM for maximum speed and efficiency.
Our comprehensive evaluation dataset, RAG-QA, is publicly available on the Hugging Face Hub.
![RAG-QA-Leaderboard]
-
Clone the repository:
git clone https://github.com/AQ-MedAI/RagQALeaderboard cd RagQALeaderboard/ -
Install dependencies:
pip install -r requirements.txt
-
Download Data
# Make sure hf CLI is installed: pip install -U "huggingface_hub[cli]"
hf download AQ-MedAI/RAG-OmniQA --repo-type=dataset- (Optional) Install local inference dependencies
If you want to use local models (e.g.,
transformersorvllm):
pip install transformers vllm- (Optional) Install API inference dependencies If you want to use the OpenAI API for inference:
pip install openaiTo evaluate a model, use the following make command:
export EVAL_MODEL=<model_path>
make eval-allThis will evaluate the model for all datasets specified in the Makefile.
If you want to evaluate a specific dataset, use:
export EVAL_MODEL=<model_path>
make eval-single DATASETS="hotpotqa popqa"Alternatively, you can use the Python script directly:
python eval.py --model-name "Qwen3" --model-path "/path/to/model" --eval-dataset hotpotqa popqaif you want to run with api:
python eval.py --model-name <model_name> --model-path <api_url> --api-key <api_key>
You can modify the configuration files in the config/ directory (e.g., api_prompt_config_en.json) to customize evaluation parameters.
After evaluation, HTML reports and JSON results will be saved in the reports/ directory.
You can also run the following command to get
python get_report.py --result-dir <your_result_dir>RAGQA-Leaderboard/
βββ src/ # Core source code
β βββ data.py # Data processing module
β βββ report/ # Report generation module
β βββ models/ # Model interface module
β βββ eval_main.py # Evaluation entry point
β βββ logger.py # Logging module
βββ data/ # Datasets
βββ config/ # Configuration files
βββ tests/ # Unit tests
βββ reports/ # Output directory for evaluation reports
βββ eval.py # Main evaluation script
βββ README.md # Project documentation
βββ Makefile # Automation scripts
We have curated a comprehensive dataset by collecting and processing popular question-answering (QA) datasets. Our processing pipeline ensures that each question is paired with its corresponding golden document(s) and a set of noise documents (approximately 50). Additionally, we provide a large-scale retrieval pool consisting of approximately 1.09 million documents from Wikipedia.
The detailed construction process is as follows:
We collected a total of 30,135 queries from three main categories of QA datasets:
- Single-Hop: We adopted the data split from MIRAGE and selected queries from NQ, TriviaQA, and PopQA.
- Multi-Hop: We included all queries from HotpotQA, MuSiQue-Ans, and 2WikiMultiHopQA.
- Domain-Specific: We selected 500 queries from the PubMedQA test set.
- Single-Hop: For these datasets, we directly used the golden documents provided in the MIRAGE project.
- Multi-Hop: The HotpotQA (distractor setting), MuSiQue-Ans, and 2WikiMultiHopQA datasets inherently provide multiple golden documents for each multi-hop question, which we used directly.
- Domain-Specific: For PubMedQA, questions are generated from article abstracts. We treat these source abstracts as the golden documents.
We employed Contriever-MS as our retriever. For each question, we retrieved the top 50 documents from our Wikipedia corpus. These retrieved documents, after removing any exact matches with the corresponding golden document(s), constitute the set of noise documents.
The final retrieval pool was constructed by merging all golden and noise documents from the entire collection and then performing a final deduplication to ensure a unique set of documents.
- HotpotQA: Multi-hop question answering dataset.
- PopQA: Single-hop question answering dataset.
- MusiqueQA: Multi-hop question answering dataset.
- TriviaQA: General knowledge question answering dataset.
- 2Wiki: Multi-hop question answering dataset.
- PubmedQA: Biomedical question answering dataset.
config/api_prompt_config_en.json: Default configuration for English evaluation.config/api_prompt_config_ch.json: Default configuration for Chinese evaluation.config/default_prompt_config.json: General configuration file.
@misc{RagQALeaderboard2026,
author = {AQ-MedAI},
title = {RagQALeaderboard: A Comprehensive Leaderboard for RAG-based Medical Question Answering},
year = {2026},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/AQ-MedAI/RagQALeaderboard}}
}
Run the following command to execute the test cases:
pytest tests/We gratefully acknowledge the creators and maintainers of the publicly available datasets integrated into RAGQA-Leaderboard. Specifically:
- HotpotQA (Yang et al., EMNLP 2018):A multi-hop question answering dataset.Paper Link
- PopQA (Mallen et al., ACL 2023):A factoid question answering dataset.Paper Link
- MusiqueQA (Trivedi et al., TACL 2022):Multi-hop compositional QA dataset.Paper Link
- TriviaQA (Joshi et al., ACL 2017):Large-scale QA dataset.Paper Link
- 2Wiki (Ho et al., NAACL 2021):Multi-hop complex QA dataset.Paper Link
- PubmedQA (Jin et al., BioRxiv 2019): Biomedical QA dataset. Paper Link
These datasets are copyright of their respective authors and we use them solely for research and non-commercial evaluation purposes. Please cite their works appropriately if you use the leaderboard or these datasets in your own publication.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
We welcome contributions! Feel free to submit issues and pull requests. For major changes, please open an issue first to discuss what you would like to change.
- Author: AQ-Med Team
- Email: tanzhehao.tzh@antgroup.com, jiaoyihan.yh@antgroup.com
