A comprehensive AI-powered PDF analysis system that extracts, analyzes, and enhances PDF documents with intelligent image recognition and web search capabilities.
PDF Image Analyzer is a cutting-edge solution that transforms static PDF documents into intelligent, searchable content. By combining advanced document processing with state-of-the-art AI vision models, it extracts images, analyzes their content, and enhances them with contextual information from web searches.
- π― Advanced ChartGemma Integration: State-of-the-art chart analysis with intelligent questioning
- π§ AI-Generated Prompts: Dynamic prompt generation using AI models for optimal chart analysis
- π Multi-Chart Detection: Automatic detection and specialized handling of complex multi-panel figures
- π§ Enhanced Chart Understanding: Specialized analysis for line charts, bar charts, pie charts, scatter plots, etc.
- β‘ Full Platform Support: Integrated across Streamlit APP, Docker deployment, and Jupyter notebooks
- π€ Intelligent Multi-Chart Workflow: Smart handling of complex visualizations
- Automatic detection of multi-panel figures and subplot arrangements
- Dynamic workflow adjustment: DePlot skipped for multi-chart images
- AI-generated intelligent prompts tailored to image complexity and content
- π Three-Level Chart Analysis Pipeline: Comprehensive chart understanding system
- Level 1: Gemma AI visual analysis for general understanding and multi-chart detection
- Level 2: DePlot data extraction for structured data tables (single charts only)
- Level 3: ChartGemma specialized analysis with intelligent AI-generated questioning
- π Enhanced Web Search: Migrated to
ddgs
package with improved error handling and fallback mechanisms - π¨ Dark Theme Compatibility: Removed white backgrounds for better IDE integration
- π Comprehensive Analysis Output: DATA_VISUALIZATION images show 3 analysis levels, CONCEPTUAL images show 2 levels
- π§ Production-Ready Notebooks: Updated
PDF_Parsing.ipynb
(formerlyReproduce_Two_Stages.ipynb
) with complete multi-chart support and intelligent prompt generation - New modular
Functions/
package with clean, testable functions:- PDFβJSON (
pdf_to_json.py
) - Image analysis, web search, and enrichment (
image_analysis.py
) - DePlot chart extraction with robust parser (
chart_extraction.py
) - Two-step processing pipeline (
pipeline_steps.py
) - Final JSON verification (no file output, returns structured results) (
verification.py
) - Simple logging setup (
utils_logging.py
)
- PDFβJSON (
- Reproducible two-stage notebook
PDF_Parsing.ipynb
to separately run:- PDFβJSON; 2) JSONβEnhancedβNLP-ready + inline verification
- Headless DePlot debug tester
Functions/debug_deplot_test.py
(prints structure only)
- π Smart PDF Processing - Extract text and images using Docling with VLM pipeline
- π€ Multi-AI Provider Support - Compatible with OpenAI, Google Gemini, Anthropic Claude, and LM Studio
- π― Intelligent ChartGemma Integration - Advanced chart analysis with AI-generated prompts
- π§ Dynamic Prompt Generation: AI models create optimal questions for each chart type and complexity
- π Multi-Chart Intelligence: Specialized handling of complex multi-panel visualizations
- βοΈ Adaptive Analysis: Context-aware questioning for line charts, bar charts, pie charts, scatter plots, etc.
- π Enhanced DePlot Chart Extraction - Robust chart data extraction with AI verification and categorical X-axis support
- π Advanced Web Search Integration - Automatic contextual enhancement with native APIs and DuckDuckGo fallback
- π Interactive Reports - Generate comprehensive HTML evaluation reports
- π³ Production Ready - Docker containerization for scalable deployment
- π₯οΈ User-Friendly GUI - Streamlit web interface for easy interaction
- π Research Tools - Specialized notebooks for chart data extraction and analysis
For users with Python 3.11+ and PyTorch environment:
# Clone the repository
git clone https://github.com/hongyu-liao/Capstone.git
cd Capstone/docker_deployment
# Install dependencies (from our torch260 environment)
pip install -r requirements.txt
# Run directly
python main.py your_document.pdf
For users without Python/PyTorch environment - Docker provides everything:
# Clone the repository
git clone https://github.com/hongyu-liao/Capstone.git
cd Capstone/docker_deployment
# Build the complete environment (includes Python, CUDA, all models)
docker build -t pdf-analyzer .
# Run with your PDF
docker run --gpus all --rm \
-v $(pwd)/input:/app/input \
-v $(pwd)/output:/app/output \
pdf-analyzer
# Clone the repository
git clone https://github.com/hongyu-liao/Capstone.git
cd Capstone/PDF_Analyzer_App
# Install dependencies
pip install -r requirements.txt
# Run the application
streamlit run app.py
This project has been tested and confirmed compatible with:
- PyTorch: 2.6.0 (recommended)
- CUDA: 12.6 (officially supported), 12.9 (user-tested compatible)
- Python: 3.11+
- Docling: 2.40.0+
Important Notes:
- PyTorch 2.6.0 officially supports CUDA 12.6, but higher CUDA versions (like 12.9) may also be compatible
- For different CUDA/PyTorch versions, use the Jupyter notebooks with manual environment configuration
- Use the pre-configured Docker deployment or Streamlit APP for optimal compatibility
- Automatically handles dependencies and version conflicts
If your CUDA or PyTorch versions differ from the recommended versions:
All Platforms (Recommended):
- Use
PDF_Parsing.ipynb
with theFunctions/
folder - Cross-platform compatibility using Ollama for AI model management
- HuggingFace integration for specialized chart analysis (ChartGemma, DePlot)
- Works on Windows, macOS, and Linux
LM Studio Users (Windows-optimized):
- Use
PDF_Parsing_LMStudio.ipynb
for LM Studio-specific workflow - Requires LM Studio installation and model service mounting
- Optimized for Windows environments with local model management
- Docling Official Docs: https://docling-project.github.io/docling/
- PyTorch Installation Guide: https://pytorch.org/get-started/previous-versions/
- CUDA Compatibility Matrix: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/
Perfect for server environments, CI/CD pipelines, and production use cases.
Prerequisites:
- Docker Desktop installed
- NVIDIA GPU with 8GB+ VRAM (recommended)
- NVIDIA drivers (any recent version)
- NVIDIA Container Toolkit
- 16GB+ RAM
- 50GB+ free disk space
Quick Setup:
cd docker_deployment
./deploy.bat # Windows
./deploy.sh # Linux/macOS
Features:
- β Headless operation (no GUI required)
- β GPU acceleration support
- β SmolDocling VLM pipeline
- β Automated dependency management
- β Volume mounting for input/output
- β Production-optimized logging
Ideal for researchers, analysts, and interactive document processing.
Prerequisites:
- Python 3.11+
- AI provider API keys (OpenAI, Gemini, Claude) OR LM Studio
Setup:
# Navigate to GUI application
cd PDF_Analyzer_App
# Install dependencies
pip install -r requirements.txt
# Configure AI provider (choose one):
# - Set API keys in sidebar for cloud providers
# - Install and run LM Studio for local processing
# Launch application
streamlit run app.py
Features:
- β Interactive web interface
- β Real-time processing monitoring
- β Multiple AI provider support
- β Batch processing capabilities
- β Interactive evaluation reports
- β Advanced filtering and sorting
Provider | Features | Setup |
---|---|---|
Google Gemini | Native web search, latest models | Set GOOGLE_API_KEY |
OpenAI GPT | GPT-4o, web browsing* | Set OPENAI_API_KEY |
Anthropic Claude | Advanced reasoning | Set ANTHROPIC_API_KEY |
Provider | Features | Setup |
---|---|---|
LM Studio | Privacy, offline | Install LM Studio + vision model |
*Web browsing availability varies by model and account type
graph TD
A[PDF Input] --> B[Docling Extraction]
B --> C[Image Detection]
C --> D[AI Analysis]
D --> E{Image Type?}
E -->|Data Visualization| F[Extract Data Points]
E -->|Conceptual| G[Web Search]
F --> H[Enhanced JSON]
G --> H
H --> I[NLP-Ready Output]
H --> J[HTML Reports]
- π Document Processing - Extract text and images using Docling VLM pipeline
- π Image Analysis - Classify images as informative vs. non-informative
- π€ AI Enhancement - Generate detailed descriptions using vision models
- π Web Context - Add relevant background information via web search
- π Report Generation - Create interactive HTML evaluation reports
- πΎ Output Creation - Generate multiple output formats for different use cases
PDF-Image-Analyzer/
βββ π₯οΈ PDF_Analyzer_App/ # Streamlit GUI Application
β βββ app.py # Main application entry point
β βββ pdf_processor.py # PDF to JSON conversion
β βββ image_analyzer.py # AI image analysis
β βββ api_manager.py # Multi-provider AI interface
β βββ html_report_generator.py # Report generation
β βββ output/ # Processing results
β
βββ π§© Functions/ # Modular function package (new)
β βββ pdf_to_json.py # LM Studio + Docling PDFβJSON
β βββ image_analysis.py # Image analysis + web search + enrichment
β βββ chart_extraction.py # DePlot extraction + robust parser
β βββ pipeline_steps.py # Step1 (enhance) + Step2 (NLP-ready)
β βββ verification.py # Final JSON validator (returns dict)
β βββ utils_logging.py # Logging setup
β βββ debug_deplot_test.py # Headless DePlot tester
β
βββ π³ docker_deployment/ # Production Docker Setup
β βββ main.py # Containerized processing engine
β βββ Dockerfile # Container configuration
β βββ deploy.sh/.bat # Automated deployment
β βββ quick-start-*.sh/.bat # One-click installers
β βββ docs/ # Deployment documentation
β
βββ π Research Notebooks/ # Development & Analysis Tools
β βββ PDF_extract_and_Picture_Describe.ipynb # Original research
β βββ data_extract.ipynb # Chart data extraction
β βββ json2html.ipynb # Format conversion utilities
β
βββ π Sample Data/
β βββ Sample Papers/ # Test PDF documents
β βββ Sample Line Chart/ # Chart analysis examples
β
βββ PDF_Parsing.ipynb # Universal 2-stage pipeline notebook (Ollama + HuggingFace)
β # Stage 1: PDFβJSON; Stage 2: JSONβEnhancedβNLP-ready + verification
β # Cross-platform: Windows, macOS, Linux
βββ PDF_Parsing_LMStudio.ipynb # LM Studio optimized notebook (Windows-focused)
β # Same functionality but uses LM Studio for local model management
β
βββ π Documentation/
βββ PROJECT_FUNCTIONS_DOCUMENTATION.md # Complete function reference
βββ README.md # This file
GUI Application:
# Upload multiple PDFs through the web interface
# Processing happens automatically with progress tracking
# Results are organized in tabs for easy navigation
Docker Command Line:
# Place PDFs in input directory
docker run --gpus all -v ./input:/app/input -v ./output:/app/output pdf-analyzer:latest
# Results appear in output directory
# Enhanced JSON, NLP-ready versions, and HTML reports generated
# Add new AI provider in api_manager.py
def _call_custom_provider(self, image_uri: str, prompt: str, max_tokens: int):
# Implement custom API integration
pass
# Chart data extraction
from data_extract import get_color_mask, find_axes_automatically
# JSON to HTML conversion
from json2html import convert_json_to_html
# Original LM Studio workflow
from PDF_extract_and_Picture_Describe import convert_pdf_with_lmstudio
filename.json
- Original Docling extraction with embedded imagesfilename_enhanced.json
- Added AI analysis and web contextfilename_nlp_ready.json
- Text-only version optimized for NLP processing
filename_report.html
- Interactive image analysis reportfilename_complete_report.html
- Comprehensive evaluation dashboard
- Interactive image viewing with click-to-enlarge
- Web search results with source attribution
- Processing statistics and quality metrics
- Evaluation checklists for systematic assessment
- Exportable results and downloadable assets
- Extract and analyze figures from research papers
- Generate searchable databases of academic content
- Create enhanced digital libraries
- Process financial reports and presentations
- Extract insights from charts and visualizations
- Generate summaries for executive briefings
- Digitize and enhance document archives
- Create searchable content databases
- Automate document classification
- Preprocess documents for ML pipelines
- Extract structured data from unstructured sources
- Generate training datasets for vision models
We welcome contributions! Please see our contributing guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
# Clone repository
git clone https://github.com/yourusername/yourrepo.git
cd yourrepo
# Install development dependencies
pip install -r PDF_Analyzer_App/requirements.txt
pip install -r docker_deployment/requirements.txt
# Install research dependencies
pip install jupyter opencv-python matplotlib
# Run tests
python -m pytest
Minimum:
- 8GB RAM
- 20GB free disk space
- Python 3.11+
- Internet connection (for AI APIs and web search)
Recommended:
- 16GB+ RAM
- NVIDIA GPU with 8GB+ VRAM
- 50GB+ free disk space
- High-speed internet connection
Core Dependencies:
- docling>=2.40.0 (PDF processing)
- streamlit>=1.47.1 (GUI interface)
- requests, pandas, numpy (data processing)
- Pillow (image handling)
AI Provider SDKs:
- openai (OpenAI GPT)
- google-genai (Google Gemini)
- anthropic (Claude)
Optional Dependencies:
- opencv-python (chart analysis)
- matplotlib (visualization)
- jupyter (research notebooks)
This project is licensed under the MIT License - see the LICENSE file for details.
- Docling Team for the excellent PDF processing framework
- Streamlit for the intuitive web interface framework
- AI Providers (OpenAI, Google, Anthropic) for powerful vision models
- LM Studio for local AI processing capabilities
- DuckDuckGo for privacy-respecting web search
- π§ Issues: GitHub Issues
- π Documentation: Function Reference
- π³ Docker Guide: Docker Deployment
- π§ Troubleshooting: Docker Troubleshooting
- β Complete Docker containerization
- β One-click deployment scripts
- β Multi-platform support (Windows/Linux/macOS)
- β GitHub Release automation
- β Comprehensive English documentation
- β Production-ready error handling
- π§© Introduced
Functions/
package (clean modularization) - π§ͺ Added
PDF_Parsing.ipynb
(formerlyReproduce_Two_Stages.ipynb
) to independently run two-stage processing - π Added headless DePlot tester and robust parser improvements
- β Integrated inline final JSON verification (no file output)
- π§ AI-Generated Prompt System: Dynamic ChartGemma prompt generation using AI models
- π― Enhanced APP Integration: Full intelligent prompt generation support in Streamlit APP
- π³ Docker Intelligence Upgrade: Advanced prompt generation with model-based and enhanced fallback methods
- β‘ Performance Optimization: Smart content-based prompt selection for different chart types
- π§ Complete Notebook Parity: APP and Docker now match notebook's intelligent analysis capabilities
- π€ Interactive Model Selection: Docker now prompts for Hugging Face model addresses with intelligent defaults
- π Flexible PDF Conversion: Choose between SmolDocling (recommended) or custom Hugging Face models
- π Progress Tracking: Real-time progress display showing current image processing status
- π― Streamlined Output: Always generates both Enhanced and NLP-ready JSON files with correct naming
- π³ Enhanced Docker UX: Better user interaction and configuration options for production deployments
- π Universal Notebook:
PDF_Parsing.ipynb
- True cross-platform compatibility (Windows, macOS, Linux) - π Hybrid AI Integration: Combines Ollama (cross-platform) + HuggingFace (specialized) for optimal performance
- π― Smart Model Distribution: Ollama for general analysis, ChartGemma/DePlot for specialized chart processing
- π§ LM Studio Specialization:
PDF_Parsing_LMStudio.ipynb
for Windows users preferring LM Studio workflow - β‘ Optimized Performance: Each AI model used for its strengths - Ollama for compatibility, HuggingFace for accuracy
- π Enhanced Documentation: Clear platform support matrix and model usage guidelines
β If this project helps you, please give it a star on GitHub! β