Skip to content

atomikkus/mole_path_str

Repository files navigation

πŸ₯ Molecular Pathology Report Processing API

A FastAPI-based pipeline that converts molecular pathology PDF reports (IHC, FISH, NGS, PCR) into structured JSON data using Mistral AI's OCR and language models.

πŸš€ Features

  • Complete PDF β†’ JSON Pipeline: Upload medical PDFs and get structured JSON data
  • Multiple Report Types: Supports IHC, FISH, NGS, and PCR reports.
  • Mistral AI Integration: Uses Mistral OCR for PDF text extraction and AI models for data structuring
  • Pydantic Models: Structured data validation using medical report-specific models
  • RESTful API: Clean endpoints for programmatic access
  • Async & Error Handling: Comprehensive error handling and async processing

πŸ“ Project Structure

mole_path_str/
β”œβ”€β”€ app.py                  # FastAPI application (main pipeline)
β”œβ”€β”€ pdf_to_markdown.py      # PDF to markdown conversion using Mistral OCR
β”œβ”€β”€ md_to_json.py           # Markdown to structured JSON using AI
β”œβ”€β”€ ihc_models.py           # Pydantic models for IHC reports
β”œβ”€β”€ fish_models.py          # Pydantic models for FISH reports
β”œβ”€β”€ ngs_models.py           # Pydantic models for NGS reports
β”œβ”€β”€ pcr_models.py           # Pydantic models for PCR reports
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ .env.example            # Environment variables template
└── README.md               # This file

πŸ› οΈ Setup

Option 1: Docker Deployment (Recommended)

Prerequisites

  • Docker installed
  • Mistral AI API key

Quick Start with .env file

# 1. Clone and navigate
git clone <repository>
cd mole_path_str

# 2. Set up environment file
cp .env.example .env
# Edit .env and add your Mistral API key

# 3. Build Docker image
docker build -t molepath-str-api .

# 4. Run container with .env file
docker run -d \
  --name mole-path-api \
  -p 8001:8001 \
  --env-file .env \
  molepath-str-api

Alternative: Manual Environment Variables

# If you prefer to set variables manually
docker run -d \
  --name mole-path-api \
  -p 8000:8000 \
  -e MISTRAL_API_KEY=your_actual_api_key_here \
  -e LOG_LEVEL=INFO \
  mole-path-api

Docker Management Commands

# View logs
docker logs -f mole-path-api

# Stop container
docker stop mole-path-api

# Remove container
docker rm mole-path-api

# Restart container
docker restart mole-path-api

Option 2: Local Development

Prerequisites

  • Python 3.8+
  • Mistral AI API key

Setup Steps

# 1. Clone and navigate
git clone <repository>
cd mole_path_str

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Environment configuration
cp .env.example .env
# Edit .env and add your Mistral API key

# 5. Run the application
python app.py

Get your Mistral API key from: https://console.mistral.ai/

The API will be available at: http://localhost:8000

🌐 API Endpoints

1. Health Check

GET /health

curl http://localhost:8000/health

2. Process IHC Reports

POST /ihc_structured/
Content-Type: multipart/form-data

curl -X POST "http://localhost:8000/ihc_structured/" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@ihc_report.pdf"

3. Process FISH Reports

POST /fish_structured/
Content-Type: multipart/form-data

curl -X POST "http://localhost:8000/fish_structured/" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@fish_report.pdf"

4. Process NGS Reports

POST /ngs_structured/
Content-Type: multipart/form-data

curl -X POST "http://localhost:8000/ngs_structured/" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@ngs_report.pdf"

5. Process PCR Reports

POST /pcr_structured/
Content-Type: multipart/form-data

curl -X POST "http://localhost:8000/pcr_structured/" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@pcr_report.pdf"

πŸ“Š Data Models

IHC Reports:

  • Marker results with percentage, intensity, and scores.

FISH Reports:

  • Marker results with ratio.

NGS Reports:

  • Genetic variants, TMB value, and clinical significance.

PCR Reports:

  • Marker results with exon and codon information.

πŸ”§ Command Line Usage

Process IHC Reports:

python md_to_json.py ihc_report.md --report_type ihc --output_file result.json

Process FISH Reports:

python md_to_json.py fish_report.md --report_type fish --output_file result.json

Process NGS Reports:

python md_to_json.py ngs_report.md --report_type ngs --output_file result.json

Process PCR Reports:

python md_to_json.py pcr_report.md --report_type pcr --output_file result.json

Convert PDF to Markdown:

python pdf_to_markdown.py --input_pdf report.pdf
# Creates: report_no_images.md, report_with_images.md, report.txt

πŸ“ Docker Deployment

Simple Docker Setup

The project includes a Dockerfile for easy containerization:

# Build the image
docker build -t molepath-str-api .

# Run the container
docker run -d \
  --name mole-path-api \
  -p 8001:8001 \
  --env-file .env \
  molepath-str-api

Environment Variables

The containerized application uses the following environment variables:

MISTRAL_API_KEY=your_api_key_here  # Required
LOG_LEVEL=INFO                     # Optional (DEBUG, INFO, WARNING, ERROR)

Container Features

  • πŸ”’ Security: Runs as non-root user
  • πŸ₯ Health Checks: Built-in health monitoring
  • πŸ”„ File Cleanup: Automatic temporary file cleanup
  • πŸ“¦ Optimized: Multi-layer caching for faster builds

πŸ§ͺ Testing

Manual Testing:

  1. Health Check: Visit http://localhost:8000/health
  2. API Documentation: Visit http://localhost:8000/docs (Swagger UI)
  3. Test with cURL: Use the example commands above

Docker Testing:

# Test API health
curl http://localhost:8000/health

# Test IHC endpoint
curl -X POST "http://localhost:8000/ihc_structured/" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@sample_ihc.pdf"

# Test FISH endpoint
curl -X POST "http://localhost:8000/fish_structured/" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@sample_fish.pdf"

# Test NGS endpoint
curl -X POST "http://localhost:8000/ngs_structured/" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@sample_ngs.pdf"

# Test PCR endpoint
curl -X POST "http://localhost:8000/pcr_structured/" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@sample_pcr.pdf"

πŸ“ Supported Report Types

IHC Reports:

  • IHC: Immunohistochemistry

FISH Reports:

  • FISH: Fluorescence In Situ Hybridization

NGS Reports:

  • NGS: Next Generation Sequencing

PCR Reports:

  • PCR: Polymerase Chain Reaction

⚠️ Requirements

  • Python 3.8+
  • Mistral AI API Key (required for OCR and AI processing)
  • PDF files containing medical reports in standard format

πŸ›‘οΈ Error Handling

The API provides comprehensive error handling:

  • 400: Invalid file format or empty content
  • 500: Processing errors or API issues
  • Detailed logging for debugging
  • Async processing with background task cleanup

πŸ”’ Security Notes

  • API keys are loaded from environment variables
  • Temporary files are automatically cleaned up
  • CORS is configured (adjust for production)

πŸ“ž Support

For issues or questions:

  1. Check the API health endpoint: /health
  2. Review logs for detailed error information
  3. Ensure your Mistral API key is valid and has sufficient credits

Built with: FastAPI, Mistral AI, Pydantic, and modern Python tools

About

API for extraction of structured data from molecular pathology reports

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published