Clinical Pathways Extraction Pipeline

Overview

This project automates the extraction and structuring of clinical pathway information from public VA Cancer Clinical Pathways PDFs. The pipeline converts PDFs to standardized images, uses Claude AI to extract and structure the clinical decision flows, and creates optimized summaries for patient matching algorithms.

Purpose

Clinical pathways standardize evidence-based practices to ensure high-quality, cost-effective care for patients. This tool enables rapid extraction of this information for further analysis, comparison, and integration with clinical systems, with the ultimate goal of matching patients to the most appropriate clinical pathways.

Prerequisites

Anaconda or Miniconda installed
Internet connection for API calls to Claude
API key for Anthropic's Claude

Installation

Clone this repository or download the scripts to your local machine
Run the environment setup script:
```
chmod +x condaenv.sh
./condaenv.sh
```
Activate the conda environment:
```
conda activate clinical_pathways
```

Configuration

On first run, you'll be prompted to enter your Claude API key, which will be stored in a config.ini file for future use.

Workflow

1. Download Clinical Pathway PDFs

PDFs are sourced from the VA's public clinical pathways website: https://www.cancer.va.gov/clinical-pathways.html

Place all downloaded PDFs in a folder named pdfs in the project directory.

2. Convert PDFs to Images

python convert_pdfs.py

This script:

Processes each PDF in the pdfs folder
Resizes each page to 1280px width
Crops to 648px height to focus on the important content
Saves images in a structured format: ripimg/[pdf_name]/pg[page_number].png

3. Extract Clinical Pathway Information

python extract_pathways.py

This script:

Processes each PDF's images
Skips title slides (page 1)
Uses Claude AI to analyze each page and extract structured information
Generates an initial summary of the clinical pathway
Saves all extracted data as JSON files in the extracted_pathways folder

Output Format

The extraction produces JSON files with the following structure:

{
  "pathway_name": "cancer_type",
  "processed_at": "timestamp",
  "responses": [
    {
      "page": 2,
      "image_file": "pg2.png",
      "response": "structured clinical pathway text",
      "thinking": "Claude's analysis process"
    },
    ...,
    {
      "page": "summary",
      "response": "comprehensive pathway summary",
      "thinking": "Claude's synthesis process"
    }
  ]
}

4. Generate Complete Summaries

python complete_summary.py

This script:

Takes the previously extracted page analyses from each pathway
Processes the full content (without truncation) when possible
Creates comprehensive summaries that include all critical pathway elements
Logs any pathways that exceed token limits for later processing
Saves detailed summaries to the complete_summaries folder

5. Create Matching-Optimized Summaries

python matching_summary.py

This script:

Condenses each complete pathway summary into approximately 400 words
Focuses specifically on information needed for patient matching:
- Key diagnostic tests
- Specific medical conditions and criteria
- Relevant biomarkers and classifications
- Essential treatments and medications
Saves these optimized summaries in both JSON and plain text formats
Creates a consolidated file with all pathway summaries
Output is stored in the matching_summaries folder

6. HPC Integration (External)

The matching summaries are designed to be imported into an HPC environment where:

Patient medical records are summarized
LLaMA or other models compare patient summaries against pathway summaries
The most appropriate clinical pathway is identified

Output Files

PDF Conversion

ripimg/[pdf_name]/pg[number].png: Individual page images

Initial Extraction

extracted_pathways/[pdf_name]_extracted.json: Structured page-by-page analyses with initial summary

Complete Summaries

complete_summaries/[pdf_name]_complete_summary.json: Comprehensive pathway summaries

Matching Summaries

matching_summaries/[pdf_name]_matching.json: Condensed 400-word summaries in JSON format
matching_summaries/[pdf_name]_matching.txt: Plain text summaries
matching_summaries/all_pathway_summaries.txt: Consolidated file with all pathway summaries

File Descriptions

condaenv.sh: Creates and configures the conda environment
convert_pdfs.py: Converts PDFs to properly sized images
extract_pathways.py: Analyzes images and extracts structured pathway data
complete_summary.py: Generates comprehensive summaries from extracted data
matching_summary.py: Creates optimized summaries for patient matching
config.ini: Stores API key and configuration parameters

Troubleshooting

PDF Conversion Issues

Ensure you have installed poppler via conda as specified in the setup script
Verify PDF files aren't password-protected or corrupted

API Errors

Verify your API key is correct
Check your network connection
Ensure your account has sufficient API credits

Token Limits

If complete_summary.py reports pathways that need truncation, you may need to:
- Process those pathways in chunks
- Adjust the API configuration for larger token limits
- Simplify the system prompt to save tokens

Limitations and Considerations

The extraction quality depends on the clarity and structure of the source PDFs
Large or complex PDFs may require adjustments to the image sizing parameters
API rate limits may affect processing of large batches
The 400-word matching summaries are optimized for LLaMA's context window limitations

Next Steps

The extracted and condensed pathway data can be used to:

Match patients to appropriate clinical pathways based on their medical records
Create interactive visualizations of clinical decision trees
Compare treatment approaches across different conditions
Generate patient education materials
Support clinical decision-making in healthcare environments

License

This project is intended for research and educational purposes. Clinical pathways should always be verified by qualified medical professionals before clinical application.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
complete_summaries		complete_summaries
extracted_pathways		extracted_pathways
matching_summaries		matching_summaries
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
complete_summary.py		complete_summary.py
condaenv.sh		condaenv.sh
convert_pdfs.py		convert_pdfs.py
extract_pathway.py		extract_pathway.py
matching_summaries.py		matching_summaries.py
pdfs_processed.png		pdfs_processed.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Clinical Pathways Extraction Pipeline

Overview

Purpose

Prerequisites

Installation

Configuration

Workflow

1. Download Clinical Pathway PDFs

2. Convert PDFs to Images

3. Extract Clinical Pathway Information

Output Format

4. Generate Complete Summaries

5. Create Matching-Optimized Summaries

6. HPC Integration (External)

Output Files

PDF Conversion

Initial Extraction

Complete Summaries

Matching Summaries

File Descriptions

Troubleshooting

PDF Conversion Issues

API Errors

Token Limits

Limitations and Considerations

Next Steps

License

About

Uh oh!

Releases

Packages

Languages

License

mikeS141618/Clinical-Pathways-Extraction-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Clinical Pathways Extraction Pipeline

Overview

Purpose

Prerequisites

Installation

Configuration

Workflow

1. Download Clinical Pathway PDFs

2. Convert PDFs to Images

3. Extract Clinical Pathway Information

Output Format

4. Generate Complete Summaries

5. Create Matching-Optimized Summaries

6. HPC Integration (External)

Output Files

PDF Conversion

Initial Extraction

Complete Summaries

Matching Summaries

File Descriptions

Troubleshooting

PDF Conversion Issues

API Errors

Token Limits

Limitations and Considerations

Next Steps

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages