Demystifying Large Language Models for Medicine: A Primer

Introduction

This tutorial is an extension of concepts and best practices outlined in the paper "Demystifying Large Language Models for Medicine: A Primer". Large language models (LLMs) represent a transformative class of artificial intelligence (AI) tools that can be used for a variety of tasks. Here, we provide example scripts related to a relevant healthcare task - clinical trial matching - and demonstrate important concepts including tokenization, temperature, chain-of-thought prompting, few-shot learning, retrieval-augmented generation (RAG), and data preparation for fine-tuning.

Figure 1. Overview of the proposed systematic approach to utilizing large language models in medicine.

Task formulation

When defining a medical need that may be addressed by an LLM, a user must first understand the core capabilities of LLMs. We classify LLM capabilities into five broad categories: structurization, summarization, translation, knowledge & reasoning, and multi-modal data processing.

Figure 2. An overview of five common task formulations enabled by LLMs in medicine.

Choosing a large language model

Users should choose an appropriate LLM based on the task characteristics. We categorize these characteristics into four main categories: model interface, data modality, context length, and medical capability.

Figure 3. Considerations for choosing an LLM.

A diversity of LLMs have been evaluated for their medical capability. Below, we summarize several popular LLMs.

Table 1. Characteristics of different LLMs, sorted by the best reported MedQA-USMLE (4 options) score. T: text; I: image; V: video; A: audio.

LLM	Weights	Size	Interface	Modality	Context	MedQA
o1-preview	Closed	NA	Web, API	T	128k	94.9%
o3-mini	Closed	NA	Web, API	T	200k	92.7%
DeepSeek-R1	Open	671B	Web, API, Local	T	128k	92.0%
Med-Gemini	Closed	NA	Web, API	T, I, V, A	1M, 2M	91.1%
GPT-4	Closed	NA	Web, API	T, I	8k, 32k, 128k	90.2%
Med-PaLM 2	Closed	NA	API	T	8k	86.5%
Llama 3	Open	8B, 70B, 405B	API, Local	T	8k	80.9%
GPT-3.5	Closed	NA	Web, API	T	4k, 16k	68.7%
Med-PaLM	Closed	540B	API	T	8k	67.6%
Gemini 1.0	Closed	NA	Web, API	T, I, V	32k	67.0%
Mixtral	Open	8x7B	API, Local	T	32k	64.1%
Mistral	Open	7B	API, Local	T	8k, 32k	59.6%
Llama 2	Open	7B, 70B	API, Local	T	4k	47.8%
Claude 3	Closed	NA	Web, API	T, I	200k	N/A

Prompt engineering

Once a user has formulated a task and selected an appropriate LLM, they must carefully consider the prompt (input content) given to the model. Additionally, users may consider implementing fine-tuning techniques to improve the performance of their model.

Figure 4. An overview of prompt engineering and fine-tuning techniques.

Many techniques have been used within prompt engineering. Common methods includes few-shot learning, tool learning, chain-of-thought prompting, retrieval-augmented generation, and fine tuning.

Table 2. Characteristics of different methods to use LLMs.

Method	Requirements	Pros	Cons	Examples
Few-shot learning	Several exemplars	- Dealing with edge cases - Specifying expected styles	Exemplars might introduce biases	MedPrompt
Tool learning	Application programming interfaces	Providing domain functionalities	Relies on the curation of tools	GeneGPT, EHRAgent, ChemCrow
Chain-of-thought prompting	Additional prompt text ("Let's think step-by-step.")	- Providing explanations - Improving performance	Hard to parse (mitigated by structured output)	MedPrompt
Retrieval-augmented generation	A knowledge base or document collection	- Providing up-to-date knowledge - Reducing hallucinations	Depends on the quality of the retrieved content	Almanac, MedRAG
Fine-tuning	Data annotations and compute	- Improving performance - Shorten the prompt	Costly and resource-intensive	MEDITRON, PMC-LLaMA, DRG-LLaMA

Case studies

Before exploring the example scripts, we also suggest reviewing the following studies which leverage large language models for various healthcare tasks.

Table 3. Representative case studies of utilizing large language models in medicine.

Study	Task	LLM(s)	Technique	Evaluation
Van Veen et al.	Summarization	FLAN-T5, FLAN-UL2, Alpaca, Med-Alpaca, Vicuna, Llama-2	Few-shot learning, fine-tuning	Automatic and manual evaluation of clinical summarization
Singhal et al.	Knowledge and Reasoning	PaLM, Flan-PaLM	Few-shot learning, chain-of-thought prompting, fine-tuning	MCQ evaluation and manual evaluation of question answering
Wang et al.	Structurization	Llama, ClinicalBERT	Fine-tuning	Automatic classification evaluation and manual error analysis
Mirza et al.	Translation	GPT-4	Baseline prompting	Manual evaluation of clinical translation by clinicians and legal experts
Zhang et al.	Multi-modality	BiomedGPT	Fine-tuning	MCQ evaluation and manual evaluation of visual tasks

Hands-on Tutorials

Colab Link	Title	Content
	LLM Basics	Tokenization, single-turn Prompting, multi-turn prompting, temperature
	Chain-of-Thought	Loading and processing the trial matching dataset, direct prompting baseline, chain-of-thought prompting
	Few-shot Learning	Loading and processing the trial matching dataset with few-shot demonstration selection, direct prompting baseline, few-shot prompting
	RAG	Loading and processing the trial matching dataset, direct prompting baseline, retrieval-augmented generation with PubMed API
	Fine-tuning	Loading and processing the trial matching dataset, preparing the data for fine-tuning GPT models

Acknowledgements

This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine.

Disclaimer

This tutorial shows the results of research conducted in the Division of Intramural Research, NCBI/NLM. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tutorial. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI's disclaimer policy is available.

Citation

If you find our work useful, please cite it by:

@article{jin2024demystifying,
  title={Demystifying large language models for medicine: A primer},
  author={Jin, Qiao and Wan, Nicholas and Leaman, Robert and Tian, Shubo and Wang, Zhizheng and Yang, Yifan and Wang, Zifeng and Xiong, Guangzhi and Lai, Po-Ting and Zhu, Qingqing and others},
  journal={ArXiv},
  pages={arXiv--2410},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Demystifying Large Language Models for Medicine: A Primer

Introduction

Task formulation

Choosing a large language model

Prompt engineering

Case studies

Hands-on Tutorials

Acknowledgements

Disclaimer

Citation

About

Uh oh!

Releases

Packages

Contributors 2

License

ncbi-nlp/LLM-Medicine-Primer

Folders and files

Latest commit

History

Repository files navigation

Demystifying Large Language Models for Medicine: A Primer

Introduction

Task formulation

Choosing a large language model

Prompt engineering

Case studies

Hands-on Tutorials

Acknowledgements

Disclaimer

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages