A FastAPI-based pipeline that converts molecular pathology PDF reports (IHC, FISH, NGS, PCR) into structured JSON data using Mistral AI's OCR and language models.
- Complete PDF β JSON Pipeline: Upload medical PDFs and get structured JSON data
- Multiple Report Types: Supports IHC, FISH, NGS, and PCR reports.
- Mistral AI Integration: Uses Mistral OCR for PDF text extraction and AI models for data structuring
- Pydantic Models: Structured data validation using medical report-specific models
- RESTful API: Clean endpoints for programmatic access
- Async & Error Handling: Comprehensive error handling and async processing
mole_path_str/
βββ app.py # FastAPI application (main pipeline)
βββ pdf_to_markdown.py # PDF to markdown conversion using Mistral OCR
βββ md_to_json.py # Markdown to structured JSON using AI
βββ ihc_models.py # Pydantic models for IHC reports
βββ fish_models.py # Pydantic models for FISH reports
βββ ngs_models.py # Pydantic models for NGS reports
βββ pcr_models.py # Pydantic models for PCR reports
βββ requirements.txt # Python dependencies
βββ .env.example # Environment variables template
βββ README.md # This file
- Docker installed
- Mistral AI API key
# 1. Clone and navigate
git clone <repository>
cd mole_path_str
# 2. Set up environment file
cp .env.example .env
# Edit .env and add your Mistral API key
# 3. Build Docker image
docker build -t molepath-str-api .
# 4. Run container with .env file
docker run -d \
--name mole-path-api \
-p 8001:8001 \
--env-file .env \
molepath-str-api# If you prefer to set variables manually
docker run -d \
--name mole-path-api \
-p 8000:8000 \
-e MISTRAL_API_KEY=your_actual_api_key_here \
-e LOG_LEVEL=INFO \
mole-path-api# View logs
docker logs -f mole-path-api
# Stop container
docker stop mole-path-api
# Remove container
docker rm mole-path-api
# Restart container
docker restart mole-path-api- Python 3.8+
- Mistral AI API key
# 1. Clone and navigate
git clone <repository>
cd mole_path_str
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Environment configuration
cp .env.example .env
# Edit .env and add your Mistral API key
# 5. Run the application
python app.pyGet your Mistral API key from: https://console.mistral.ai/
The API will be available at: http://localhost:8000
GET /health
curl http://localhost:8000/healthPOST /ihc_structured/
Content-Type: multipart/form-data
curl -X POST "http://localhost:8000/ihc_structured/" \
-H "Content-Type: multipart/form-data" \
-F "file=@ihc_report.pdf"POST /fish_structured/
Content-Type: multipart/form-data
curl -X POST "http://localhost:8000/fish_structured/" \
-H "Content-Type: multipart/form-data" \
-F "file=@fish_report.pdf"POST /ngs_structured/
Content-Type: multipart/form-data
curl -X POST "http://localhost:8000/ngs_structured/" \
-H "Content-Type: multipart/form-data" \
-F "file=@ngs_report.pdf"POST /pcr_structured/
Content-Type: multipart/form-data
curl -X POST "http://localhost:8000/pcr_structured/" \
-H "Content-Type: multipart/form-data" \
-F "file=@pcr_report.pdf"- Marker results with percentage, intensity, and scores.
- Marker results with ratio.
- Genetic variants, TMB value, and clinical significance.
- Marker results with exon and codon information.
python md_to_json.py ihc_report.md --report_type ihc --output_file result.jsonpython md_to_json.py fish_report.md --report_type fish --output_file result.jsonpython md_to_json.py ngs_report.md --report_type ngs --output_file result.jsonpython md_to_json.py pcr_report.md --report_type pcr --output_file result.jsonpython pdf_to_markdown.py --input_pdf report.pdf
# Creates: report_no_images.md, report_with_images.md, report.txtThe project includes a Dockerfile for easy containerization:
# Build the image
docker build -t molepath-str-api .
# Run the container
docker run -d \
--name mole-path-api \
-p 8001:8001 \
--env-file .env \
molepath-str-apiThe containerized application uses the following environment variables:
MISTRAL_API_KEY=your_api_key_here # Required
LOG_LEVEL=INFO # Optional (DEBUG, INFO, WARNING, ERROR)- π Security: Runs as non-root user
- π₯ Health Checks: Built-in health monitoring
- π File Cleanup: Automatic temporary file cleanup
- π¦ Optimized: Multi-layer caching for faster builds
- Health Check: Visit http://localhost:8000/health
- API Documentation: Visit http://localhost:8000/docs (Swagger UI)
- Test with cURL: Use the example commands above
# Test API health
curl http://localhost:8000/health
# Test IHC endpoint
curl -X POST "http://localhost:8000/ihc_structured/" \
-H "Content-Type: multipart/form-data" \
-F "file=@sample_ihc.pdf"
# Test FISH endpoint
curl -X POST "http://localhost:8000/fish_structured/" \
-H "Content-Type: multipart/form-data" \
-F "file=@sample_fish.pdf"
# Test NGS endpoint
curl -X POST "http://localhost:8000/ngs_structured/" \
-H "Content-Type: multipart/form-data" \
-F "file=@sample_ngs.pdf"
# Test PCR endpoint
curl -X POST "http://localhost:8000/pcr_structured/" \
-H "Content-Type: multipart/form-data" \
-F "file=@sample_pcr.pdf"- IHC: Immunohistochemistry
- FISH: Fluorescence In Situ Hybridization
- NGS: Next Generation Sequencing
- PCR: Polymerase Chain Reaction
- Python 3.8+
- Mistral AI API Key (required for OCR and AI processing)
- PDF files containing medical reports in standard format
The API provides comprehensive error handling:
- 400: Invalid file format or empty content
- 500: Processing errors or API issues
- Detailed logging for debugging
- Async processing with background task cleanup
- API keys are loaded from environment variables
- Temporary files are automatically cleaned up
- CORS is configured (adjust for production)
For issues or questions:
- Check the API health endpoint:
/health - Review logs for detailed error information
- Ensure your Mistral API key is valid and has sufficient credits
Built with: FastAPI, Mistral AI, Pydantic, and modern Python tools