ncbi-nlp
diff --git a/‎README.md‎
Lines changed: 54 additions & 0 deletions b/‎README.md‎
Lines changed: 54 additions & 0 deletions
diff --git a/‎ed_evaluation/README.md‎
Lines changed: 41 additions & 0 deletions b/‎ed_evaluation/README.md‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎ed_evaluation/dataset/__init__.py‎ b/‎ed_evaluation/dataset/__init__.py‎
diff --git a/‎ed_evaluation/results/__init__.py‎ b/‎ed_evaluation/results/__init__.py‎
diff --git a/‎ed_evaluation/step1_selecting_calcs.py‎
Lines changed: 64 additions & 0 deletions b/‎ed_evaluation/step1_selecting_calcs.py‎
Lines changed: 64 additions & 0 deletions
diff --git a/‎ed_evaluation/step2_using_calcs.py‎
Lines changed: 164 additions & 0 deletions b/‎ed_evaluation/step2_using_calcs.py‎
Lines changed: 164 additions & 0 deletions
diff --git a/‎ed_evaluation/step3_ranking_results.py‎
Lines changed: 67 additions & 0 deletions b/‎ed_evaluation/step3_ranking_results.py‎
Lines changed: 67 additions & 0 deletions
@@ -0,0 +1,54 @@
+# AgentMD: Empowering Language Agents for Risk Prediction with Large-Scale Clinical Tool Learning
+
+## Introduction
+
+**Background**: Clinical calculators play a vital role in healthcare by offering accurate evidence-based predictions for various purposes, such as diagnosis and prognosis. Nevertheless, their widespread utilization is often hindered by usability challenges and poor dissemination. Augmenting large language models (LLMs) with extensive collections of clinical calculators presents an opportunity to overcome these obstacles and improve workflow efficiency, but the scalability of manual curation and machine adoption poses a significant challenge. 
+
+**Methods**: We introduce AgentMD, a novel LLM-based autonomous agent capable of curating and applying calculators across various clinical contexts. Using the medical literature, AgentMD first curates a diverse set of clinical calculators with executable functions as an automatic tool maker. With the curated tools, AgentMD autonomously selects and applies the relevant clinical calculators to any given patient as a tool user. 
+
+**Findings**: As a tool maker, AgentMD has curated RiskCalcs, a large collection of 2,164 diverse clinical calculators with over 85% accuracy for quality checks and over 90% pass rate for unit tests. With regards to its tool using capability, AgentMD significantly outperforms chain-of-thought prompting with GPT-4 (87.7% vs. 40.9% accuracy) on RiskQA, an end-to-end benchmark manually annotated in this work. Further evaluation results on 698 emergency department admission notes confirm that AgentMD accurately computes medical risks with real-world patient data at an individual level. Moreover, AgentMD can provide population-level insights for institutional risk management as demonstrated using 9,822 patient notes from MIMIC-III.
+
+**Interpretation**: Our study illustrates the exceptional capabilities of language agents to learn clinical calculators and to further utilize curated calculators for both individual patient care and at-scale healthcare analytics.
+
+## Configuration
+
+To run TrialGPT, one needs to first set up the OpenAI API either directly through OpenAI or through Microsoft Azure. Here we use Microsoft Azure because it is compliant with the Health Insurance Portability and Accountability Act (HIPAA). Please set the environment variables accordingly:
+
+```bash
+export OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT_URL
+export OPENAI_API_KEY=YOUR_AZURE_OPENAI_API_KEY
+```
+
+The code has been tested with Python 3.9.13 using CentOS Linux release 7.9.2009 (Core). Please install the required Python packages by:
+
+```bash
+pip install -r requirements.txt
+```
+
+## Instructions
+
+This repository contains the evaluation scripts for three use cases of AgentMD:
+
+- **Evaluation on the RiskQA dataset**. The RiskQA dataset, the RiskCalcs toolkit, and the evaluation code are available under `./riskqa_evaluation`. Please follow the instructions in `./riskqa_evaluation/README.md` to use AgentMD.
+- **Evaluation with ED patients**. We put the AgentMD code for our experiments with Yale ED provider notes under `./ed_evaluation`. However, due to privacy concerns, we are not able to release the ED provider notes from Yale Medicine. As such, users would need to use their own clinical notes to run it.
+- **Evaluation with MIMIC patients**. The preprocessing code for MIMIC-III notes as well as the AgentMD code are available under `./mimic_evaluation`. One would need to first download and preprocess the MIMIC-III dataset to run the code, following the instructions at `./ed_evaluation`.
+
+## Acknowledgments
+
+This research was supported by the NIH Intramural Research Program, National Library of Medicine, and 1K99LM014024.
+
+## Disclaimer
+
+This tool shows the results of research conducted in the Computational Biology Branch, NCBI/NLM. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI's disclaimer policy is available.
+
+## Citation
+
+If you find this repo helpful, please cite AgentMD by:
+```bibtex
+@article{jin2024agentmd,
+  title={AgentMD: Empowering Language Agents for Risk Prediction with Large-Scale Clinical Tool Learning},
+  author={Jin, Qiao and Wang, Zhizheng and Yang, Yifan and Zhu, Qingqing and Wright, Donald and Huang, Thomas and Wilbur, W John and He, Zhe and Taylor, Andrew and Chen, Qingyu and others},
+  journal={arXiv preprint arXiv:2402.13225},
+  year={2024}
+}
+```
@@ -0,0 +1,41 @@
+## Data
+
+Due to privacy concerns, we cannot release the emergency department notes from Yale Medicine. This direcoty provides the code that can work on the user-provided clinical notes from emergency care.  
+
+## Step 1: AgentMD tool selection
+
+```bash
+python step1_selecting_calcs.py 
+```
+
+## Step 2: AgentMD tool computation
+
+```bash
+python step2_selecting_calcs.py 
+```
+
+## Step 3: AgentMD tool summarization and scoring
+
+```bash
+python step3_ranking_results.py 
+```
+
+## Acknowledgments
+
+This research was supported by the NIH Intramural Research Program, National Library of Medicine, and 1K99LM014024.
+
+## Disclaimer
+
+This tool shows the results of research conducted in the Computational Biology Branch, NCBI/NLM. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI's disclaimer policy is available.
+
+## Citation
+
+If you find this repo helpful, please cite AgentMD by:
+```bibtex
+@article{jin2024agentmd,
+  title={AgentMD: Empowering Language Agents for Risk Prediction with Large-Scale Clinical Tool Learning},
+  author={Jin, Qiao and Wang, Zhizheng and Yang, Yifan and Zhu, Qingqing and Wright, Donald and Huang, Thomas and Wilbur, W John and He, Zhe and Taylor, Andrew and Chen, Qingyu and others},
+  journal={arXiv preprint arXiv:2402.13225},
+  year={2024}
+}
+```
@@ -0,0 +1,64 @@
+__author__ = "qiao"
+
+"""
+select the calculators to use for each patient
+"""
+
+import json
+import os
+from openai import AzureOpenAI
+import pandas as pd
+
+import sys
+
+client = AzureOpenAI(
+	api_version="2023-09-01-preview",
+	azure_endpoint=os.getenv("OPENAI_ENDPOINT"),
+	api_key=os.getenv("OPENAI_API_KEY"),
+)
+
+if __name__ == "__main__":
+
+	model = sys.argv[1]
+
+	# loading cached results
+	output_path = f"results/{model}_calc_selections.json"
+
+	if os.path.exists(output_path):
+		outputs = json.load(open(output_path))
+	else:
+		outputs = {}
+	
+	system = "You are a helpful assistant and your task is to select the calculators that a given patient is eligible for. Here are the candidate calculators:\n"
+	system += open("tools/calc_desc.txt", "r").read()
+	system += 'Please first explain what calculators a given patient is eligible for, and then output the list of calculaotr IDs. Please output a JSON dict formatted as Dict{"explanation": Str(explanation), "calculators": List[Int(ID)]}. Please be strict.'
+
+	notes = pd.read_csv("dataset/notes_deidentified_verified.csv")
+
+	for _, row in notes.iterrows():
+		note_id = row["PAT_ENC_CSN_ID"]
+
+		if note_id in outputs:
+			continue
+
+		note = row["deid_text_combined"]
+		prompt = "Here is the patient note:\n"
+		prompt += note + "\n"
+		prompt += "Output in JSON: "
+
+		messages = [
+			{"role": "system", "content": system},
+			{"role": "user", "content": prompt}
+		]
+
+		response = client.chat.completions.create(
+			model=model,
+			messages=messages,
+			temperature=0,
+		)
+
+		output = response.choices[0].message.content
+		outputs[note_id] = output 
+
+		with open(output_path, "w") as f:
+			json.dump(outputs, f, indent=4)
@@ -0,0 +1,164 @@
+__author__ = "qiao"
+
+"""
+Apply the selected tools to each patient with AgentMD.
+"""
+
+import json
+import contextlib
+import os
+import io
+import re
+import traceback
+import sys
+import pandas as pd
+
+from openai import AzureOpenAI
+
+client = AzureOpenAI(
+	api_version="2023-09-01-preview",
+	azure_endpoint=os.getenv("OPENAI_ENDPOINT"),
+	api_key=os.getenv("OPENAI_API_KEY"),
+)
+
+def capture_exec_output_and_errors(code):
+	"""
+	Executes the given code and captures its printed output and any error messages.
+
+	Parameters:
+	code (str): The Python code to execute.
+
+	Returns:
+	str: The captured output and error messages of the executed code.
+	"""
+	globals = {}
+
+	with io.StringIO() as buffer, contextlib.redirect_stdout(buffer), contextlib.redirect_stderr(buffer):
+		try:
+			exec(code, globals)
+		except Exception as e:
+			# Print the error to the buffer
+			traceback.print_exc()
+		
+		return buffer.getvalue()
+
+
+def extract_python_code(text):
+	pattern = r"```python\n(.*?)```"
+	matches = re.findall(pattern, text, re.DOTALL)
+	return "\n".join(matches)
+
+
+def apply_calc(patient, calc, model):
+	"""apply the calculator to a specific patient"""
+	system = "You are a helpful assistant. Your task is to apply a medical calculator to a patient and interpret the result. You can write Python scripts and the user will execute them for you. The Python function in the calculator has already been in the enviroment, which you can re-use or revise if there is bug. Your responses will be used for research purposes only. Please start with \"Summary: \" to summarize the messages in one paragraph if you have finished the task. Please make sure to include the raw results of the calculator in the summary."
+
+	prompt = "Here is the calculator:\n"
+	prompt += calc + "\n"
+	prompt += "Here is the patient information:\n"	
+	prompt += patient + "\n"
+	prompt += "Please apply this calculator to the patient. If there are missing values, please make a range estimation based on best and worst case scenarios inferred from the calculator computing logics. Please write Python scripts and print the results to help the computation. I will provide the stdout to you."
+
+	prompt_code = extract_python_code(prompt)
+
+	messages = [
+		{"role": "system", "content": system},
+		{"role": "user", "content": prompt},
+	]
+	
+	n = 0
+
+	while True:
+
+		response = client.chat.completions.create(
+			model=model,
+			messages=messages,
+			temperature=0,
+		)
+
+		output = response.choices[0].message
+
+		n += 1
+		print(f"Round {n}")
+		print(output.content)
+
+		message = output 
+		messages.append({"role": "assistant", "content": message.content})
+
+		if "Summary: " in message.content:
+			return messages, message.content.split("Summary: ")[-1]
+		
+		else:
+			message_code = extract_python_code(message.content)
+
+			if message_code:
+				code = prompt_code + message_code
+				output = "I have executed your code, and the output is:\n" + capture_exec_output_and_errors(code)
+			
+			else:
+				output = "If you have sucessfully applied the calculator. Please start a new message with \"Summary: \"."
+
+			messages.append(
+				{"role": "user", "content": output}
+			)
+
+		if n >= 20:
+			return messages, "Failed"
+
+
+if __name__ == "__main__":
+	# gpt-4-32k or gpt-35-turbo-16k
+	model = sys.argv[1]
+
+	# loading the cached results
+	output_path = f"results/patient_results_{model}.json"
+
+	if os.path.exists(output_path):
+		output = json.load(open(output_path))
+	else:
+		output = {}
+	
+	# tool selection results
+	patient_tools = json.load(open(f"results/{model}_calc_selections.json"))
+
+	id2calc_txt = json.load(open("tools/id2calculator.json"))
+
+	# loading the patient dataset
+	notes = pd.read_csv("dataset/notes_deidentified_verified.csv")
+
+	for _, row in notes.iterrows():
+		patient_id = str(row["PAT_ENC_CSN_ID"])
+		note = row["deid_text_combined"]
+
+		if patient_id not in output:
+			output[patient_id] = {}
+
+		tool_selection = patient_tools[patient_id]
+
+		try:
+			tools = json.loads(tool_selection)["calculators"]
+		except:
+			continue
+
+		for tool in tools:
+			tool = str(tool)
+
+			# already cached
+			if tool in output[patient_id]:
+				continue
+			
+			# errors in the selection step
+			if tool not in id2calc_txt:
+				continue
+
+			calc_text = id2calc_txt[tool]
+
+			try:
+				messages, summary = apply_calc(note, calc_text, model)
+			except:
+				continue
+
+			output[patient_id][tool] = [summary, messages]
+
+			with open(output_path, "w") as f:
+				json.dump(output, f, indent=4)
@@ -0,0 +1,67 @@
+__author__ = "qiao"
+
+"""
+Give an overall score for the ranking patients.
+"""
+
+import json
+import os
+from openai import AzureOpenAI
+import sys
+
+client = AzureOpenAI(
+	api_version="2023-09-01-preview",
+	azure_endpoint=os.getenv("OPENAI_ENDPOINT"),
+	api_key=os.getenv("OPENAI_API_KEY"),
+)
+
+if __name__ == "__main__":
+	# model choice
+	model = sys.argv[1]
+
+	# loading cached results
+	output_path = f"results/calc2results_score_{model}.json"
+
+	if os.path.exists(output_path):
+		outputs = json.load(open(output_path))
+	else:
+		outputs = {}
+	
+
+	# loading the calculator2results dict
+	calc2results = json.load(open(f"results/calc2results_{model}.json"))
+
+	for calc_id, pid2results in calc2results.items():
+
+		# if calc id not even in outputs, initalize the dict
+		if calc_id not in outputs:
+			outputs[calc_id] = {}
+
+		# pid and the computed summary
+		for pid, summary in pid2results.items():
+			
+			# already cached, then continue
+			if pid in outputs[calc_id]:
+				continue
+			
+			system = "You are a helpful medical assistant, and your task is to give an overall score (0-100) given a summary of medical calculation. Higher scores denote more urgent and severe conditions that require immediate attention. If the calculation result contains a wide range, give a low score due to its uncertainty."
+
+			prompt = f"Here is the summary: {summary}\n"
+			prompt += "Output only the overall score (0-100): "
+
+			messages = [
+				{"role": "system", "content": system},
+				{"role": "user", "content": prompt}
+			]
+
+			response = client.chat.completions.create(
+				model=model,
+				messages=messages,
+				temperature=0,
+			)
+
+			output = response.choices[0].message.content
+			outputs[calc_id][pid] = output
+
+			with open(output_path, "w") as f:
+				json.dump(outputs, f, indent=4)