Skip to content

AQ-MedAI/RagQALeaderboard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RAGQA-Leaderboard

A standardized and fair evaluation framework and leaderboard for Retrieval-Augmented Generation (RAG) systems.

RAGQA-Leaderboard aims to provide researchers and developers with a unified and reproducible benchmark for evaluating the performance of RAG models. We have integrated a suite of popular and high-frequency question-answering datasets and offer a streamlined, one-click evaluation pipeline that generates detailed reports, making model comparison and analysis easier than ever.

πŸ“’ Updates


✨ Key Features

  • πŸ“Š Standardized Evaluation Framework: Provides a unified and fair evaluation pipeline, ensuring that different models are compared under the same conditions for reproducible results.

  • πŸ“š Comprehensive Dataset Integration:

    • Integrates a wide range of popular QA datasets used in the RAG domain.
    • Covers diverse question types including Single-Hop, Multi-Hop, and Domain-Specific scenarios.
    • Includes benchmarks like HotpotQA, PopQA, MusiqueQA, TriviaQA, and more.
  • πŸ“ˆ Multi-Dimensional Metrics:

    • Supports core evaluation metrics such as Accuracy, F1 Score, and Exact Match to provide a holistic view of model performance.
  • πŸ“„ One-Click Reporting:

    • Generate comprehensive evaluation reports with a single command.

    • Outputs reports in both HTML for easy visualization and analysis, and JSON for programmatic use. This makes it effortless to analyze and compare performance across different models.

      1762234593115

  • 🧩 Modular RAG Evaluation: Go beyond end-to-end testing. This framework allows for the isolated evaluation of individual RAG componentsβ€”such as the Retriever and the Generatorβ€”enabling targeted analysis and debugging.

  • πŸš€ Flexible Model Inference:

    • API-based: Evaluate models served via API endpoints (e.g., OpenAI, Anthropic, or custom-hosted models).
    • Local Inference: Supports high-performance, offline evaluation of local models using libraries like vLLM for maximum speed and efficiency.

πŸš€ Download the Dataset

Our comprehensive evaluation dataset, RAG-QA, is publicly available on the Hugging Face Hub. ![RAG-QA-Leaderboard]


πŸ› οΈ Installation

  1. Clone the repository:

    git clone https://github.com/AQ-MedAI/RagQALeaderboard
    cd RagQALeaderboard/
  2. Install dependencies:

    pip install -r requirements.txt
  3. Download Data

# Make sure hf CLI is installed: pip install -U "huggingface_hub[cli]"
hf download AQ-MedAI/RAG-OmniQA --repo-type=dataset
  1. (Optional) Install local inference dependencies If you want to use local models (e.g., transformers or vllm):
pip install transformers vllm
  1. (Optional) Install API inference dependencies If you want to use the OpenAI API for inference:
pip install openai

πŸ“‹ Usage

1. Run Evaluation

To evaluate a model, use the following make command:

export EVAL_MODEL=<model_path> 
make eval-all

This will evaluate the model for all datasets specified in the Makefile.

If you want to evaluate a specific dataset, use:

export EVAL_MODEL=<model_path>
make eval-single DATASETS="hotpotqa popqa"

Alternatively, you can use the Python script directly:

python eval.py --model-name "Qwen3" --model-path "/path/to/model" --eval-dataset hotpotqa popqa

if you want to run with api:

python eval.py --model-name <model_name> --model-path <api_url> --api-key <api_key>

2. Customize Configuration

You can modify the configuration files in the config/ directory (e.g., api_prompt_config_en.json) to customize evaluation parameters.

3. Generate Reports

After evaluation, HTML reports and JSON results will be saved in the reports/ directory.

You can also run the following command to get

python get_report.py --result-dir <your_result_dir>

πŸ“‚ Project Structure

RAGQA-Leaderboard/
β”œβ”€β”€ src/                    # Core source code
β”‚   β”œβ”€β”€ data.py             # Data processing module
β”‚   β”œβ”€β”€ report/             # Report generation module
β”‚   β”œβ”€β”€ models/             # Model interface module
β”‚   β”œβ”€β”€ eval_main.py        # Evaluation entry point
β”‚   └── logger.py           # Logging module
β”œβ”€β”€ data/                   # Datasets
β”œβ”€β”€ config/                 # Configuration files
β”œβ”€β”€ tests/                  # Unit tests
β”œβ”€β”€ reports/                # Output directory for evaluation reports
β”œβ”€β”€ eval.py                 # Main evaluation script
β”œβ”€β”€ README.md               # Project documentation
└── Makefile                # Automation scripts

πŸ” Datasets Collection

Dataset Construction

We have curated a comprehensive dataset by collecting and processing popular question-answering (QA) datasets. Our processing pipeline ensures that each question is paired with its corresponding golden document(s) and a set of noise documents (approximately 50). Additionally, we provide a large-scale retrieval pool consisting of approximately 1.09 million documents from Wikipedia.

The detailed construction process is as follows:

1. Question Collection

We collected a total of 30,135 queries from three main categories of QA datasets:

  • Single-Hop: We adopted the data split from MIRAGE and selected queries from NQ, TriviaQA, and PopQA.
  • Multi-Hop: We included all queries from HotpotQA, MuSiQue-Ans, and 2WikiMultiHopQA.
  • Domain-Specific: We selected 500 queries from the PubMedQA test set.

2. Golden Document Collection

  • Single-Hop: For these datasets, we directly used the golden documents provided in the MIRAGE project.
  • Multi-Hop: The HotpotQA (distractor setting), MuSiQue-Ans, and 2WikiMultiHopQA datasets inherently provide multiple golden documents for each multi-hop question, which we used directly.
  • Domain-Specific: For PubMedQA, questions are generated from article abstracts. We treat these source abstracts as the golden documents.

3. Noise Document Collection

We employed Contriever-MS as our retriever. For each question, we retrieved the top 50 documents from our Wikipedia corpus. These retrieved documents, after removing any exact matches with the corresponding golden document(s), constitute the set of noise documents.

4. Retrieval Pool Construction

The final retrieval pool was constructed by merging all golden and noise documents from the entire collection and then performing a final deduplication to ensure a unique set of documents.

πŸ“Š Supported Datasets

  • HotpotQA: Multi-hop question answering dataset.
  • PopQA: Single-hop question answering dataset.
  • MusiqueQA: Multi-hop question answering dataset.
  • TriviaQA: General knowledge question answering dataset.
  • 2Wiki: Multi-hop question answering dataset.
  • PubmedQA: Biomedical question answering dataset.

βš™οΈ Configuration

  • config/api_prompt_config_en.json: Default configuration for English evaluation.
  • config/api_prompt_config_ch.json: Default configuration for Chinese evaluation.
  • config/default_prompt_config.json: General configuration file.

Cite Us

@misc{RagQALeaderboard2026,
  author = {AQ-MedAI},
  title = {RagQALeaderboard: A Comprehensive Leaderboard for RAG-based Medical Question Answering},
  year = {2026},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/AQ-MedAI/RagQALeaderboard}}
}

πŸ§ͺ Testing

Run the following command to execute the test cases:

pytest tests/

πŸ™ Acknowledgements

We gratefully acknowledge the creators and maintainers of the publicly available datasets integrated into RAGQA-Leaderboard. Specifically:

These datasets are copyright of their respective authors and we use them solely for research and non-commercial evaluation purposes. Please cite their works appropriately if you use the leaderboard or these datasets in your own publication.

πŸ“œ License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.


🀝 Contributing

We welcome contributions! Feel free to submit issues and pull requests. For major changes, please open an issue first to discuss what you would like to change.


πŸ“ž Contact

Releases

No releases published

Packages

 
 
 

Contributors