A pipeline for automatically discovering, extracting, and structuring radiology datasets from the literature.
data/: contains the final dataset table (e.g.,radiology_db.csv)notebooks/: tutorials and exploratory data analysis notebooksscripts/: scripts for running the database building pipelineradiology_dataset_db/: source code for querying PubMed, extracting dataset metadata, and building the databasetests/: pytest-based testing suite
conda create -n vllm python=3.10 -y
conda activate vllm
pip install vllmpython -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--port 8001 \
--enforce-eager \
--enable-auto-tool-choice \
--tool-call-parser hermesgit clone git@github.com:pachterlab/radiology_dataset_db.git
cd radiology_dataset_db
conda create -n radiology_dataset_db python=3.10 -y
conda activate radiology_dataset_db
pip install -e .3. Rename .env_sample to .env and fill in your Entrez email and API key (optional but recommended for higher rate limits)
Modify .env and radiology_dataset_db/config.py as needed to customize PubMed query/LLM settings. Runtime defaults like modality/output paths are now configured via CLI args in scripts/build_db.py. Then run:
python scripts/build_db.py --database-modality MODALITY- Radiology:
--database-modality radiology - Single-cell RNA-seq:
--database-modality scrnaseq - Bulk genomics (e.g., bulk RNA-seq, WGS/WXS):
--database-modality bulk_genomics - Spatial transcriptomics:
--database-modality spatial_transcriptomics
To enable parallel extraction, increase --num-threads (for example --num-threads 8).
- Define new pubmed query and extraction instructions in
radiology_dataset_db/config.py - Implement new dataset schema class and extraction function in
radiology_dataset_db/extract_MODALITY_dataset_information_llm.py - Import and call the new extraction function in
scripts/build_db.pyand add a conditional to check the modality type - Optionally, update .github/workflows/update_dbs.yml to run the pipeline for the new modality on a schedule
- Optionally, add some ground truth papers to tests/conftest.py, and add to get_modality_info in tests/test_llm_output.py to check that the new extraction function is working as expected
All instructions are notaded in the code with comments like
#* add additional extraction instructions and functions for other modalities here, e.g. genomics, pathology, etc
Example codex prompt used to add a scRNA-seq dataset schema and extraction function:
Pleas write a module very similar to extract_radiology_dataset_information_llm.py called extract_scrnaseq_dataset_information_llm.py that looks for scRNA-seq/snRNA-seq data. It should look for name, num_patients, sequencing_technology (eg 10X, SMARTSEQ, Parse, etc), disease, species, tissue, cell/nuclei. Also have fields for paper_title, paper_link, paper_year etc that get populated afterwards. Add instructions in config.py, and add an extra condition to build_db.py (areas to edit are marked by "#*"). Add integration test structure in test_llm_output.py and add a placeholderground truth paper to conftest.py to test the new extraction function.
pytest
pytest -m integration
(not all tests need to pass because LLM has some randomness, but most should pass consistently)
pytest -m ""