Qinyue Zheng1†, Salman Abdullah2†, Sam Rawal MD3, Cyril Zakka MD4, Sophie Ostmeier MD2,5, Maximilian Purk MD6, Eduardo Reis MD7,
Eric Topol MD8, Jure Leskovec PhD2, Michael Moor MD, PhD1
1Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland 2Department of Computer Science, Stanford University, Stanford, CA, USA 3Department of Internal Medicine, Mayo Clinic, Phoenix, AZ, USA 4Hugging Face, Manhattan, New York City, NY, USA 5Department of Radiology, Stanford University, Stanford, CA, USA 6Hasso-Plattner-Institute for Digital Engineering, University of Potsdam, Potsdam, Germany 7Center for Artificial Intelligence in Medicine and Imaging, Stanford University, Stanford, CA, USA 8Scripps Translational Science Institute, San Diego, CA, USA
TL;DR: Million-scale medical query-response pairs, which are grounded by peer-reviewed biomedical literature, enable diverse downstream tasks and enhance knowledge reliability of LLMs.
from datasets import load_dataset
dataset = load_dataset("miriad/miriad-5.8M", split="train") # for the 5.8M versionor
from datasets import load_dataset
dataset = load_dataset("miriad/miriad-4.4M", split="train") # for the 4.4M versionMIRIAD is a large-scale, curated corpus of 5,821,948 medical instruction–response pairs, each grounded in peer-reviewed literature. Generated via a semi-automated pipeline combining LLM rewriting, grounding, filtering and expert annotation, MIRIAD operationalizes medical knowledge in a format that LLMs can reliably use. MIRIAD boosts accuracy in medical question answering, enables the detection of medical hallucinations, and can support clinical users via MIRIAD-Atlas, a visual interface for semantically organized browsing and knowledge retrieval. MIRIAD lays the groundwork for safer, more grounded medical AI across clinical care and biomedical research.
- Data Generation: Code scripts used for MIRIAD data generation.
- Quality Control: Code scripts used for MIRIAD quality control, including the human expert annotation streamlit app, quality filtering code.
- RAG Pipeline: Code scripts used for RAG experiments, and medical hallucination detection experiments.
- Demo: Demo notebook for a quick start, including simple Qdrant Retrieval pipeline with MIRIAD as the external corpus, RAG on MedMCQA with MIRIAD.
- Discipline Categorization: the final curated 56 disciplines within MIRIAD.
- MIRIAD Atlas Vis: Atlas demo with 300k MIRIAD subset.
To run the full pipeline of MIRIAD, including embedding and LLM inference, systems with GPUs are recommended. For example, running the pipeline with Llama 3.1–8B-Instruct requires a minimum of one NVIDIA A100 GPU (40GB) or equivalent. CPU-only systems can be used for basic tasks (e.g., data loading, QA pair inspection), but will be significantly slower and are not suitable for LLM inference or dense embedding generation at scale.
Linux-based systems (Ubuntu 22.04) are recommended for best compatibility and performance. Dependencies are specified in the provided requirements.txt with Python 3.10.12. The full environment can be installed via pip in a Python virtual environment. On a typical desktop with a stable internet connection, installation takes approximately 5-10 minutes.
git clone https://github.com/eth-medical-ai-lab/MIRIAD.git
cd rag-pipeline
python3 -m venv env
source .venv/bin/activate
pip install -r requirements.txtTo facilitate reproducibility and intuitive exploration, we provide a demo notebook, which can be found as demo/playbook.ipynb. The notebook includes the following subsections:
- MIRIAD Dataset Loading
- RAG Pipeline
- Demo for the retrieval results
- Demo for solving 500 questions from MedMCQA with the aid of MIRIAD
- Demo for MIRIAD Atlas
In the notebook we showcase how MIRIAD can be used as an external knowledge source in a RAG pipeline. The notebook runs a lightweight benchmark using 500 random questions from the MedMCQA dataset. For each question, the top-k relevant QA pairs are retrieved from MIRIAD and concatenated with the prompt to a language model. The final output includes the predicted answers and an aggregated accuracy score. This demo can be executed on a single GPU. To run the notebook, please make sure to set up the qdrant vector database first.
To support visual inspection and exploration, we also provide a static HTML demo—MIRIAD Atlas that allows users to browse the structured medical QA pairs. The HTML file can be downloaded from Google Drive and opened locally in any modern web browser. It supports keyword search, topic filtering, and exploration of discipline clusters.
See Data Generation and Quality Control for MIRIAD Dataset Creation.
To directly use the off-the-shelf MIRIAD Dataset, quick start:
- Feel free to skip this step if you already have docker running on your machine
- For more detailed information, checkout the official dockerdocs
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get updatesudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-pluginsudo docker run hello-world- Feel free to skip this step if you already have qdrant image running.
- For more detailed information, we encourage you to checkout qdrant documentations.
sudo docker pull qdrant/qdrantsudo docker run -p 6333:6333 -p 6334:6334 \
-v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
qdrant/qdrantNow you have qdrant image set up.
Clone + set up environment
git clone https://github.com/eth-medical-ai-lab/MIRIAD.git
python3 -m venv env
source .venv/bin/activate
pip install -r requirements.txt- Embedding parameters are recorded in
config.yaml. Tune to specify MIRIAD dataset version, embedding model, batch size etc. - Install and run Qdrant database with
./qdrant.sh. Make sure to change the/local1/qdrant/miriad/qdrant_storageto your local directory before starting the new container. Feel free to skip this step if you've already had qdrant running. - Run
python distributed_embedding_db.pyto embed MIRIAD. Given the scale of MIRIAD, we recommend to have at least 1 GPU for faster embedding. Changeworld_sizeinconfig.yamlaccordingly based on the number of GPUs you use. - Run
./upsert_embeddings.shto upload the embeddings to local qdrant database - See
demo/playbook.ipynbfor examples of performing RAG with Qdrant.
- RAG Experiments on Medical Question-Answering benchmarks
- RAG Experiments on Hallucination Detection benchmark
End to end pipeline:
-
Embed MIRIAD dataset, and setup vector database for efficient storage and retrieval
- Can do "qa", "passage_text", "question" only or "answer" only embeddings
- Supports any huggingface embedding models, including sentenct-transformers, BAAI/bge etc.
-
Evaluate on various QA benchmark datasets via
run_evaluationscript. Pass in arguments:- dataset + split (e.g. [medmcqa, dev])
in_context_modeTrue or False indicates whether to use retrieval-augmented-generation or not- top_k parameter (if in_context_mode is True)
- Further configurations can be tuned in
eval_config.yamlto specify world_size, embedding model, backbone llm etc. Supports commercial API calls or huggingface for local moddels
Run evaluation as follows:
python run_evaluation.pyTo ensure better reproducibility, we provide a notebook at rag_pipeline/log_results_checking.ipynb for you to quickly checkout your own experiment results.
Hope you find this project helpful! Kindly cite our paper at
@misc{zheng2025miriadaugmentingllmsmillions,
title={MIRIAD: Augmenting LLMs with millions of medical query-response pairs},
author={Qinyue Zheng and Salman Abdullah and Sam Rawal and Cyril Zakka and Sophie Ostmeier and Maximilian Purk and Eduardo Reis and Eric J. Topol and Jure Leskovec and Michael Moor},
year={2025},
eprint={2506.06091},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.06091},
}For detailed licensing of the dataset, please feel free to refer to this page.
All code in this repository is released under the MIT License (see LICENSE).
