This project automates the extraction and structuring of clinical pathway information from public VA Cancer Clinical Pathways PDFs. The pipeline converts PDFs to standardized images, uses Claude AI to extract and structure the clinical decision flows, and creates optimized summaries for patient matching algorithms.
Clinical pathways standardize evidence-based practices to ensure high-quality, cost-effective care for patients. This tool enables rapid extraction of this information for further analysis, comparison, and integration with clinical systems, with the ultimate goal of matching patients to the most appropriate clinical pathways.
- Anaconda or Miniconda installed
- Internet connection for API calls to Claude
- API key for Anthropic's Claude
-
Clone this repository or download the scripts to your local machine
-
Run the environment setup script:
chmod +x condaenv.sh ./condaenv.sh
-
Activate the conda environment:
conda activate clinical_pathways
On first run, you'll be prompted to enter your Claude API key, which will be stored in a config.ini
file for future use.
PDFs are sourced from the VA's public clinical pathways website: https://www.cancer.va.gov/clinical-pathways.html
Place all downloaded PDFs in a folder named pdfs
in the project directory.
python convert_pdfs.py
This script:
- Processes each PDF in the
pdfs
folder - Resizes each page to 1280px width
- Crops to 648px height to focus on the important content
- Saves images in a structured format:
ripimg/[pdf_name]/pg[page_number].png
python extract_pathways.py
This script:
- Processes each PDF's images
- Skips title slides (page 1)
- Uses Claude AI to analyze each page and extract structured information
- Generates an initial summary of the clinical pathway
- Saves all extracted data as JSON files in the
extracted_pathways
folder
The extraction produces JSON files with the following structure:
{
"pathway_name": "cancer_type",
"processed_at": "timestamp",
"responses": [
{
"page": 2,
"image_file": "pg2.png",
"response": "structured clinical pathway text",
"thinking": "Claude's analysis process"
},
...,
{
"page": "summary",
"response": "comprehensive pathway summary",
"thinking": "Claude's synthesis process"
}
]
}
python complete_summary.py
This script:
- Takes the previously extracted page analyses from each pathway
- Processes the full content (without truncation) when possible
- Creates comprehensive summaries that include all critical pathway elements
- Logs any pathways that exceed token limits for later processing
- Saves detailed summaries to the
complete_summaries
folder
python matching_summary.py
This script:
- Condenses each complete pathway summary into approximately 400 words
- Focuses specifically on information needed for patient matching:
- Key diagnostic tests
- Specific medical conditions and criteria
- Relevant biomarkers and classifications
- Essential treatments and medications
- Saves these optimized summaries in both JSON and plain text formats
- Creates a consolidated file with all pathway summaries
- Output is stored in the
matching_summaries
folder
The matching summaries are designed to be imported into an HPC environment where:
- Patient medical records are summarized
- LLaMA or other models compare patient summaries against pathway summaries
- The most appropriate clinical pathway is identified
ripimg/[pdf_name]/pg[number].png
: Individual page images
extracted_pathways/[pdf_name]_extracted.json
: Structured page-by-page analyses with initial summary
complete_summaries/[pdf_name]_complete_summary.json
: Comprehensive pathway summaries
matching_summaries/[pdf_name]_matching.json
: Condensed 400-word summaries in JSON formatmatching_summaries/[pdf_name]_matching.txt
: Plain text summariesmatching_summaries/all_pathway_summaries.txt
: Consolidated file with all pathway summaries
condaenv.sh
: Creates and configures the conda environmentconvert_pdfs.py
: Converts PDFs to properly sized imagesextract_pathways.py
: Analyzes images and extracts structured pathway datacomplete_summary.py
: Generates comprehensive summaries from extracted datamatching_summary.py
: Creates optimized summaries for patient matchingconfig.ini
: Stores API key and configuration parameters
- Ensure you have installed poppler via conda as specified in the setup script
- Verify PDF files aren't password-protected or corrupted
- Verify your API key is correct
- Check your network connection
- Ensure your account has sufficient API credits
- If complete_summary.py reports pathways that need truncation, you may need to:
- Process those pathways in chunks
- Adjust the API configuration for larger token limits
- Simplify the system prompt to save tokens
- The extraction quality depends on the clarity and structure of the source PDFs
- Large or complex PDFs may require adjustments to the image sizing parameters
- API rate limits may affect processing of large batches
- The 400-word matching summaries are optimized for LLaMA's context window limitations
The extracted and condensed pathway data can be used to:
- Match patients to appropriate clinical pathways based on their medical records
- Create interactive visualizations of clinical decision trees
- Compare treatment approaches across different conditions
- Generate patient education materials
- Support clinical decision-making in healthcare environments
This project is intended for research and educational purposes. Clinical pathways should always be verified by qualified medical professionals before clinical application.