Prompts 2 Table

Welcome to the Prompts 2 table code repo! This contains some code to get started using this workflow for information extraction from medical texts Check out the pre-print here

And the interactive docs page is here

Right now I only have the Quick Start guide up. If you are interested in utilizing this code, feel free to open an issue or send me an email at david.hein@utsouthwestern.edu As mentioned in the pre-print, this specific workflow is not intended to be the "primary product", we instead point readers to our tables (in both the main text and supplement) that introduce higher level considerations for using LLMs for clinical information extraction.

Prompts 2 Table

Getting started

NOTE: In .vscode/settings.json I have the word wrap for .json files turned on. This makes editing the schemas easier. There are also some recommended plugins in .vscode/extensions.json

Setting up the env: This project uses uv to set up the environment, see the docs for uv here. The venv can be created with uv sync followed by source .venv/bin/activate
Adding connections: First you'll need to check out the example.env and create your own .env so that you have LLM connections available for Prompt flow to use
Adding data: Data can be in either a csv format with the columns report_id and report_text or in a JSONL file with those same keys. See the example jsonl data and example csv data
Running a batch: The example_workbook walks through running a batch of data through the pipeline
Modifying a schema: The schema can be modified to use different sets of labels and instructions. When adding a new entity, first determine what entity type it is (see below), and add it under the key for that entity type, along with the required fields. See schema.py for info on the required keys for each entity type. Also there are three included example schemas.
Modifying prompts: If modifications to the prompt templates are needed, they can be found in the Jinja templates for the respective entity type (see below again)

Tips for editing the schema & prompts

The prompt templates should contain abstract instructions relevant to all entities of that class. Ideally, the examples given in these should not contain labels actually found in the schema to avoid biasing generation. These prompt templates all follow a markdown document format. Also these prompts have not had major changes since about August 2024. Given evolving LLM capabilities they could probably be shortened. Models are much more capable at providing properly structured JSON outputs than before. Also there are new methods for providing a template for generation that could be very useful. This would need to be integrated into the DAG in such a way that it can work with both VLLM and Azure OpenAI.
The schema is where entity specific instructions can go, in the fields for segmentation and standardization instructions.
Using the structured vocabulary for IHC results is currently explained mostly in the entity specific instructions in the kidney template. Moving forward we may move these to the prompt jinja template to keep with the consistency of having general instructions in the templates and specific instructions in the schema.

Tips for inference

I've found managing connections manually with a .env file is easier to work with than adding them through the Prompt flow VS Code plugin
To reduce token usage and increase performance, pre processing of raw report text can be helpful. i.e. removing dislcaimers and MD signatures
The flows are setup to use vllm for inference with open weight models. For these, a context window of about 6000-8000 tokens is needed, especially if reports are long. Strong performance was found with FP8 quantized models, thus their use is encouraged to increase the total throughput. Also since large portions of the prompts are reused, enabling automatic prefix caching can be helpful
I typically use a temperature of 0 so this is hardcoded, but can be modified in the flow yamls
Spell checking plugins for VSCode are useful for ensuring typos are not present in the schema

Prompt flows

The app directory contains three subdirectories, each representing a specific type of data extraction flow:

feature_report_flow Entities with one label per report
feature_specimen_flow Entities with one label per specimen
panel_specimen_flow Entities for which a panel of tests exist (like IHC/FISH) where we want the specimen, block, test name, and test result for all instances in the report

Each of these subdirectories contains similar files and follows a consistent structure for defining and executing a Prompt flow.

Prompt flow Structure

A typical Prompt flow consists of the following files:

load_[type]_[coverage].py: This file contains code to load and prepare the input data for the Prompt flow.
segment_[type]_[coverage].jinja2: This file contains a Jinja template for the prompt used in the segmentation step of the Prompt flow. The template includes placeholders for inserting specific information such as labels and custom instructions. The purpose of this LLM call is to segment the relevant text out of the report.
standardize_[type]_[coverage].jinja2: This file contains a Jinja template for the prompt used in the standardization step of the Prompt flow. Similar to the segmentation template, it includes placeholders for specific information. The purpose of this flow is to make standardized data and labels from the segmented text.
build_output_[type]_[coverage].py: This file contains code to combine the outputs from the previous steps, add metadata, and return the final output of the Prompt flow.
flow.dag.yaml: This file defines the structure and configuration of the Prompt flow. It specifies the inputs, outputs, and nodes of the flow, along with their respective sources, inputs, and connections. If you have the Prompt flow VS Code tool installed you can open these up in a visual editor for a really nice view of the inputs and outputs of each node, the LLM connections being use and their settings, and a nice view of the DAG.

Helper functions

There are several helper function included for running flows, as well as utilities for data validation

schema.py Contains pydantic data models for I/O and for defining the structure of the extraction schema
get_json_outputs.py Returns a pandas series of flow outputs, so you can look at the reasoning responses
prep_data.py Contains helpers for getting data prepared for a flow
fix_corrupted_json.py Contains helpers for fixing outputs from LLMs that may not be JSON serializable
flat_results.py Contains functions to extract and organize relevant portions of the JSON outputs into a nice table
run_pf_wrapper.py This is the main wrapper function that sets up everything for a batch flow run

Adding Connections

You need to define variables for each connection name you intend to use. The variables follow the pattern {CONNECTION_NAME_UPPER}_VARIABLE_NAME.

See the example.env file for documentation

Roadmap

Adding the ability to use JSON schema for guided generation
Adding some way of detecting problematic reports and flagging for review
Checking out reasoning models
Clarifying the use of "Other- fill in the blank" label categories for helping with consistency
Improve logging

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.vscode		.vscode
app		app
docs		docs
example_data		example_data
.flake8		.flake8
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
_quarto.yml		_quarto.yml
example.env		example.env
example_workflow_notebook.ipynb		example_workflow_notebook.ipynb
index.qmd		index.qmd
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prompts 2 Table

Getting started

Tips for editing the schema & prompts

Tips for inference

Prompt flows

Prompt flow Structure

Helper functions

Adding Connections

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

DavidHein96/prompts_to_table

Folders and files

Latest commit

History

Repository files navigation

Prompts 2 Table

Getting started

Tips for editing the schema & prompts

Tips for inference

Prompt flows

Prompt flow Structure

Helper functions

Adding Connections

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages