Overview

MIRIAD: Augmenting LLMs with millions of medical query-response pairs

Qinyue Zheng^1†, Salman Abdullah^2†, Sam Rawal MD³, Cyril Zakka MD⁴, Sophie Ostmeier MD^2,5, Maximilian Purk MD⁶, Eduardo Reis MD⁷,
Eric Topol MD⁸, Jure Leskovec PhD², Michael Moor MD, PhD¹

¹Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland ²Department of Computer Science, Stanford University, Stanford, CA, USA ³Department of Internal Medicine, Mayo Clinic, Phoenix, AZ, USA ⁴Hugging Face, Manhattan, New York City, NY, USA ⁵Department of Radiology, Stanford University, Stanford, CA, USA ⁶Hasso-Plattner-Institute for Digital Engineering, University of Potsdam, Potsdam, Germany ⁷Center for Artificial Intelligence in Medicine and Imaging, Stanford University, Stanford, CA, USA ⁸Scripps Translational Science Institute, San Diego, CA, USA

TL;DR: Million-scale medical query-response pairs, which are grounded by peer-reviewed biomedical literature, enable diverse downstream tasks and enhance knowledge reliability of LLMs.

Overview

MIRIAD is a large-scale, curated corpus of 5,821,948 medical instruction–response pairs, each grounded in peer-reviewed literature. Generated via a semi-automated pipeline combining LLM rewriting, grounding, filtering and expert annotation, MIRIAD operationalizes medical knowledge in a format that LLMs can reliably use. MIRIAD boosts accuracy in medical question answering, enables the detection of medical hallucinations, and can support clinical users via MIRIAD-Atlas, a visual interface for semantically organized browsing and knowledge retrieval. MIRIAD lays the groundwork for safer, more grounded medical AI across clinical care and biomedical research.

Repo Contents

Data Generation: Code scripts used for MIRIAD data generation.
Quality Control: Code scripts used for MIRIAD quality control, including the human expert annotation streamlit app, quality filtering code.
RAG Pipeline: Code scripts used for RAG experiments, and medical hallucination detection experiments.
Demo: Demo notebook for a quick start, including simple Qdrant Retrieval pipeline with MIRIAD as the external corpus, RAG on MedMCQA with MIRIAD.
Discipline Categorization: the final curated 56 disciplines within MIRIAD.
MIRIAD Atlas Vis: Atlas demo with 300k MIRIAD subset.

System Requirements

Hardware Requirements

To run the full pipeline of MIRIAD, including embedding and LLM inference, systems with GPUs are recommended. For example, running the pipeline with Llama 3.1–8B-Instruct requires a minimum of one NVIDIA A100 GPU (40GB) or equivalent. CPU-only systems can be used for basic tasks (e.g., data loading, QA pair inspection), but will be significantly slower and are not suitable for LLM inference or dense embedding generation at scale.

Software Requirements

Linux-based systems (Ubuntu 22.04) are recommended for best compatibility and performance. Dependencies are specified in the provided requirements.txt with Python 3.10.12. The full environment can be installed via pip in a Python virtual environment. On a typical desktop with a stable internet connection, installation takes approximately 5-10 minutes.

Installation Guide

Clone and set up environment

git clone https://github.com/eth-medical-ai-lab/MIRIAD.git
cd rag-pipeline

python3 -m venv env
source .venv/bin/activate
pip install -r requirements.txt

Demo

To facilitate reproducibility and intuitive exploration, we provide a demo notebook, which can be found as demo/playbook.ipynb. The notebook includes the following subsections:

MIRIAD Dataset Loading
RAG Pipeline
Demo for the retrieval results
Demo for solving 500 questions from MedMCQA with the aid of MIRIAD
Demo for MIRIAD Atlas

Lightweight RAG demo

In the notebook we showcase how MIRIAD can be used as an external knowledge source in a RAG pipeline. The notebook runs a lightweight benchmark using 500 random questions from the MedMCQA dataset. For each question, the top-k relevant QA pairs are retrieved from MIRIAD and concatenated with the prompt to a language model. The final output includes the predicted answers and an aggregated accuracy score. This demo can be executed on a single GPU. To run the notebook, please make sure to set up the qdrant vector database first.

MIRIAD Altas demo

To support visual inspection and exploration, we also provide a static HTML demo—MIRIAD Atlas that allows users to browse the structured medical QA pairs. The HTML file can be downloaded from Google Drive and opened locally in any modern web browser. It supports keyword search, topic filtering, and exploration of discipline clusters.

Instruction for use

See Data Generation and Quality Control for MIRIAD Dataset Creation.

To directly use the off-the-shelf MIRIAD Dataset, quick start:

Docker and Qdrant Vector Database setup

Step 1: Install Docker Engine on Ubuntu.

Feel free to skip this step if you already have docker running on your machine
For more detailed information, checkout the official dockerdocs

1. Uninstall old versions to prevent conflictions

for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done

2. Set up Docker's `apt` repository

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

3. Install the Docker packages

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

4. Verify that the installation is successful by running the `hello-world` image

sudo docker run hello-world

Step 2: Setup Qdrant database image

Feel free to skip this step if you already have qdrant image running.
For more detailed information, we encourage you to checkout qdrant documentations.

1. Download the latest Qdrant image from Dockerhub

sudo docker pull qdrant/qdrant

2. Then, run the service:

sudo docker run -p 6333:6333 -p 6334:6334 \
    -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
    qdrant/qdrant

Now you have qdrant image set up.

Step 3: Prepare MIRIAD Index in Qdrant for later retrieval-augmentation generation (RAG)

Quickstart

Clone + set up environment

git clone https://github.com/eth-medical-ai-lab/MIRIAD.git

python3 -m venv env
source .venv/bin/activate
pip install -r requirements.txt

Embed Dataset and Build Qdrant Vector Database

Embedding parameters are recorded in config.yaml. Tune to specify MIRIAD dataset version, embedding model, batch size etc.
Install and run Qdrant database with ./qdrant.sh. Make sure to change the /local1/qdrant/miriad/qdrant_storage to your local directory before starting the new container. Feel free to skip this step if you've already had qdrant running.
Run python distributed_embedding_db.py to embed MIRIAD. Given the scale of MIRIAD, we recommend to have at least 1 GPU for faster embedding. Change world_size in config.yaml accordingly based on the number of GPUs you use.
Run ./upsert_embeddings.sh to upload the embeddings to local qdrant database
See demo/playbook.ipynb for examples of performing RAG with Qdrant.

RAG Pipeline Overview

RAG Experiments on Medical Question-Answering benchmarks
RAG Experiments on Hallucination Detection benchmark

End to end pipeline:

Embed MIRIAD dataset, and setup vector database for efficient storage and retrieval
- Can do "qa", "passage_text", "question" only or "answer" only embeddings
- Supports any huggingface embedding models, including sentenct-transformers, BAAI/bge etc.
Evaluate on various QA benchmark datasets via run_evaluation script. Pass in arguments:
- dataset + split (e.g. [medmcqa, dev])
- in_context_mode True or False indicates whether to use retrieval-augmented-generation or not
- top_k parameter (if in_context_mode is True)
- Further configurations can be tuned in eval_config.yaml to specify world_size, embedding model, backbone llm etc. Supports commercial API calls or huggingface for local moddels

Run Evaluation

Run evaluation as follows:

python run_evaluation.py

Results

To ensure better reproducibility, we provide a notebook at rag_pipeline/log_results_checking.ipynb for you to quickly checkout your own experiment results.

📖 Citation

Hope you find this project helpful! Kindly cite our paper at

@misc{zheng2025miriadaugmentingllmsmillions,
      title={MIRIAD: Augmenting LLMs with millions of medical query-response pairs}, 
      author={Qinyue Zheng and Salman Abdullah and Sam Rawal and Cyril Zakka and Sophie Ostmeier and Maximilian Purk and Eduardo Reis and Eric J. Topol and Jure Leskovec and Michael Moor},
      year={2025},
      eprint={2506.06091},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.06091}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
data_generation		data_generation
demo		demo
discipline_categorization		discipline_categorization
preprocessing		preprocessing
quality_control		quality_control
rag_pipeline		rag_pipeline
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MIRIAD: Augmenting LLMs with millions of medical query-response pairs

Contents

To load the dataset, run:

Overview

Repo Contents

System Requirements

Hardware Requirements

Software Requirements

Installation Guide

Clone and set up environment

Demo

Lightweight RAG demo

MIRIAD Altas demo

Instruction for use

Docker and Qdrant Vector Database setup

Step 1: Install Docker Engine on Ubuntu.

1. Uninstall old versions to prevent conflictions

2. Set up Docker's apt repository

3. Install the Docker packages

4. Verify that the installation is successful by running the hello-world image

Step 2: Setup Qdrant database image

1. Download the latest Qdrant image from Dockerhub

2. Then, run the service:

Step 3: Prepare MIRIAD Index in Qdrant for later retrieval-augmentation generation (RAG)

Quickstart

Embed Dataset and Build Qdrant Vector Database

RAG Pipeline Overview

Run Evaluation

Results

📖 Citation

Licensing

Dataset

Code

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Set up Docker's `apt` repository

4. Verify that the installation is successful by running the `hello-world` image

Packages