RAGQA-Leaderboard

A standardized and fair evaluation framework and leaderboard for Retrieval-Augmented Generation (RAG) systems.

RAGQA-Leaderboard aims to provide researchers and developers with a unified and reproducible benchmark for evaluating the performance of RAG models. We have integrated a suite of popular and high-frequency question-answering datasets and offer a streamlined, one-click evaluation pipeline that generates detailed reports, making model comparison and analysis easier than ever.

📢 Updates

[20251119]: We have updated the evaluation results for several models on on our Hugging Face Check out the latest results on Hugging Face!

✨ Key Features

📊 Standardized Evaluation Framework: Provides a unified and fair evaluation pipeline, ensuring that different models are compared under the same conditions for reproducible results.
📚 Comprehensive Dataset Integration:
- Integrates a wide range of popular QA datasets used in the RAG domain.
- Covers diverse question types including Single-Hop, Multi-Hop, and Domain-Specific scenarios.
- Includes benchmarks like HotpotQA, PopQA, MusiqueQA, TriviaQA, and more.
📈 Multi-Dimensional Metrics:
- Supports core evaluation metrics such as Accuracy, F1 Score, and Exact Match to provide a holistic view of model performance.
📄 One-Click Reporting:
- Generate comprehensive evaluation reports with a single command.
- Outputs reports in both HTML for easy visualization and analysis, and JSON for programmatic use. This makes it effortless to analyze and compare performance across different models.
🧩 Modular RAG Evaluation: Go beyond end-to-end testing. This framework allows for the isolated evaluation of individual RAG components—such as the Retriever and the Generator—enabling targeted analysis and debugging.
🚀 Flexible Model Inference:
- API-based: Evaluate models served via API endpoints (e.g., OpenAI, Anthropic, or custom-hosted models).
- Local Inference: Supports high-performance, offline evaluation of local models using libraries like vLLM for maximum speed and efficiency.

🚀 Download the Dataset

Our comprehensive evaluation dataset, RAG-QA, is publicly available on the Hugging Face Hub. ![RAG-QA-Leaderboard]

🛠️ Installation

Clone the repository:

git clone https://github.com/AQ-MedAI/RagQALeaderboard
cd RagQALeaderboard/

Install dependencies:
```
pip install -r requirements.txt
```
Download Data

# Make sure hf CLI is installed: pip install -U "huggingface_hub[cli]"
hf download AQ-MedAI/RAG-OmniQA --repo-type=dataset

(Optional) Install local inference dependencies If you want to use local models (e.g., transformers or vllm):

pip install transformers vllm

(Optional) Install API inference dependencies If you want to use the OpenAI API for inference:

pip install openai

📋 Usage

1. Run Evaluation

To evaluate a model, use the following make command:

export EVAL_MODEL=<model_path> 
make eval-all

This will evaluate the model for all datasets specified in the Makefile.

If you want to evaluate a specific dataset, use:

export EVAL_MODEL=<model_path>
make eval-single DATASETS="hotpotqa popqa"

Alternatively, you can use the Python script directly:

python eval.py --model-name "Qwen3" --model-path "/path/to/model" --eval-dataset hotpotqa popqa

if you want to run with api:

python eval.py --model-name <model_name> --model-path <api_url> --api-key <api_key>

2. Customize Configuration

You can modify the configuration files in the config/ directory (e.g., api_prompt_config_en.json) to customize evaluation parameters.

3. Generate Reports

After evaluation, HTML reports and JSON results will be saved in the reports/ directory.

You can also run the following command to get

python get_report.py --result-dir <your_result_dir>

📂 Project Structure

RAGQA-Leaderboard/
├── src/                    # Core source code
│   ├── data.py             # Data processing module
│   ├── report/             # Report generation module
│   ├── models/             # Model interface module
│   ├── eval_main.py        # Evaluation entry point
│   └── logger.py           # Logging module
├── data/                   # Datasets
├── config/                 # Configuration files
├── tests/                  # Unit tests
├── reports/                # Output directory for evaluation reports
├── eval.py                 # Main evaluation script
├── README.md               # Project documentation
└── Makefile                # Automation scripts

🔍 Datasets Collection

Dataset Construction

We have curated a comprehensive dataset by collecting and processing popular question-answering (QA) datasets. Our processing pipeline ensures that each question is paired with its corresponding golden document(s) and a set of noise documents (approximately 50). Additionally, we provide a large-scale retrieval pool consisting of approximately 1.09 million documents from Wikipedia.

The detailed construction process is as follows:

1. Question Collection

We collected a total of 30,135 queries from three main categories of QA datasets:

Single-Hop: We adopted the data split from MIRAGE and selected queries from NQ, TriviaQA, and PopQA.
Multi-Hop: We included all queries from HotpotQA, MuSiQue-Ans, and 2WikiMultiHopQA.
Domain-Specific: We selected 500 queries from the PubMedQA test set.

2. Golden Document Collection

Single-Hop: For these datasets, we directly used the golden documents provided in the MIRAGE project.
Multi-Hop: The HotpotQA (distractor setting), MuSiQue-Ans, and 2WikiMultiHopQA datasets inherently provide multiple golden documents for each multi-hop question, which we used directly.
Domain-Specific: For PubMedQA, questions are generated from article abstracts. We treat these source abstracts as the golden documents.

3. Noise Document Collection

We employed Contriever-MS as our retriever. For each question, we retrieved the top 50 documents from our Wikipedia corpus. These retrieved documents, after removing any exact matches with the corresponding golden document(s), constitute the set of noise documents.

4. Retrieval Pool Construction

The final retrieval pool was constructed by merging all golden and noise documents from the entire collection and then performing a final deduplication to ensure a unique set of documents.

📊 Supported Datasets

HotpotQA: Multi-hop question answering dataset.
PopQA: Single-hop question answering dataset.
MusiqueQA: Multi-hop question answering dataset.
TriviaQA: General knowledge question answering dataset.
2Wiki: Multi-hop question answering dataset.
PubmedQA: Biomedical question answering dataset.

⚙️ Configuration

config/api_prompt_config_en.json: Default configuration for English evaluation.
config/api_prompt_config_ch.json: Default configuration for Chinese evaluation.
config/default_prompt_config.json: General configuration file.

Cite Us

@misc{RagQALeaderboard2026,
  author = {AQ-MedAI},
  title = {RagQALeaderboard: A Comprehensive Leaderboard for RAG-based Medical Question Answering},
  year = {2026},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/AQ-MedAI/RagQALeaderboard}}
}

🧪 Testing

Run the following command to execute the test cases:

pytest tests/

🙏 Acknowledgements

We gratefully acknowledge the creators and maintainers of the publicly available datasets integrated into RAGQA-Leaderboard. Specifically:

HotpotQA (Yang et al., EMNLP 2018):A multi-hop question answering dataset.Paper Link
PopQA (Mallen et al., ACL 2023):A factoid question answering dataset.Paper Link
MusiqueQA (Trivedi et al., TACL 2022):Multi-hop compositional QA dataset.Paper Link
TriviaQA (Joshi et al., ACL 2017):Large-scale QA dataset.Paper Link
2Wiki (Ho et al., NAACL 2021):Multi-hop complex QA dataset.Paper Link
PubmedQA (Jin et al., BioRxiv 2019): Biomedical QA dataset. Paper Link

These datasets are copyright of their respective authors and we use them solely for research and non-commercial evaluation purposes. Please cite their works appropriately if you use the leaderboard or these datasets in your own publication.

📜 License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

🤝 Contributing

We welcome contributions! Feel free to submit issues and pull requests. For major changes, please open an issue first to discuss what you would like to change.

📞 Contact

Author: AQ-Med Team
Email: tanzhehao.tzh@antgroup.com, jiaoyihan.yh@antgroup.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAGQA-Leaderboard

📢 Updates

✨ Key Features

🚀 Download the Dataset

🛠️ Installation

📋 Usage

1. Run Evaluation

2. Customize Configuration

3. Generate Reports

📂 Project Structure

🔍 Datasets Collection

Dataset Construction

1. Question Collection

2. Golden Document Collection

3. Noise Document Collection

4. Retrieval Pool Construction

📊 Supported Datasets

⚙️ Configuration

Cite Us

🧪 Testing

🙏 Acknowledgements

📜 License

🤝 Contributing

📞 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
image/README		image/README
src		src
tests		tests
.DS_Store		.DS_Store
CITATION.cff		CITATION.cff
LEGAL.md		LEGAL.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
eval.py		eval.py
get_report.py		get_report.py
requirements.txt		requirements.txt
start_eval.sh		start_eval.sh

Folders and files

Latest commit

History

Repository files navigation

RAGQA-Leaderboard

📢 Updates

✨ Key Features

🚀 Download the Dataset

🛠️ Installation

📋 Usage

1. Run Evaluation

2. Customize Configuration

3. Generate Reports

📂 Project Structure

🔍 Datasets Collection

Dataset Construction

1. Question Collection

2. Golden Document Collection

3. Noise Document Collection

4. Retrieval Pool Construction

📊 Supported Datasets

⚙️ Configuration

Cite Us

🧪 Testing

🙏 Acknowledgements

📜 License

🤝 Contributing

📞 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages