Skip to content
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
131 commits
Select commit Hold shift + click to select a range
f9ed2aa
Initial project setup with dependencies and license
Mar 28, 2025
8001947
Add project README with documentation and usage instructions
Mar 28, 2025
d97da0c
Add core utility functions for integrations
Mar 28, 2025
74a12b0
Add data schemas for the OCR pipeline
Mar 28, 2025
db67c1e
Add OCR comparison pipeline framework
Mar 28, 2025
377263b
Add model evaluation step for OCR comparison
Mar 28, 2025
f09655d
Add main execution script for OCR pipeline
Mar 28, 2025
0de7871
Add image encoding, metrics calculation and prompt utilities
Mar 28, 2025
25347d3
Add OCR implementation steps for Gemma3 and Mistral models
Mar 28, 2025
2387351
Add Streamlit web interface for interactive OCR comparison
Mar 28, 2025
71b8b94
Add sample images for OCR testing
Mar 28, 2025
e885c25
update README
Mar 28, 2025
c6b5ce6
Add configuration settings for OCR pipeline
Mar 28, 2025
2b45149
Add standalone script for quick OCR comparison without ZenML
Mar 28, 2025
3ea4f93
Add poetry.lock file to lock dependencies
Mar 28, 2025
66d0de1
Add pip requirements file for non-Poetry installations
Mar 28, 2025
a747636
Add ImageDescription pydantic model for the OCR pipeline
Mar 28, 2025
1c4ee21
update readme
Mar 28, 2025
c70c6e9
update assets dir structure
Mar 28, 2025
b60e1a6
update prompt.py: add confidence for extracted text
Mar 29, 2025
08a8fc6
add confidence field to ImageDescription
Mar 29, 2025
b14b16a
organize assets dir
Mar 29, 2025
4882760
remove integrations utils: remove mlflow, and docker for now
Mar 29, 2025
991298b
add pipelines __init__ file
Mar 29, 2025
235993a
add other init files
Mar 29, 2025
e34e924
add run_ocr.py file: reusable ocr step for various models
Mar 29, 2025
9210dd2
delete run_gemma3_ocr and run_mistral_ocr in favor of unified run_ocr…
Mar 29, 2025
4892b3f
update ocr_pipeline to save ocr results and visualizations
Mar 29, 2025
168e3c5
update save ocr results step
Mar 29, 2025
43da0dc
add configs dir with ocr_config yaml file
Mar 29, 2025
ffa8601
add config util file
Mar 29, 2025
b7d0b93
add io utils file for loading images/files and for saving results
Mar 29, 2025
21711ca
add ocr model utils file for processing chat completions for various …
Mar 29, 2025
6a3fd96
add step for loading images/files
Mar 29, 2025
19193a4
add run.py file
Mar 29, 2025
5b9afed
pass each result dict to save_ocr_results step to resolve StepInterfa…
Mar 29, 2025
82f2dd7
update ocr_pipeline to pass model_names in addition to results when s…
Mar 29, 2025
d191735
refactor error_analysis and metrics under 1 file and remove confidenc…
Mar 29, 2025
4928919
add detailed html string containing ocr results, error analysis and o…
Mar 29, 2025
b4c5987
update config file with default image folder
Mar 29, 2025
34cb61a
update app.py
Mar 29, 2025
625156a
Add detailed metrics to streamlit UI app
Mar 29, 2025
efcdd1d
Update README.md: fix broken links, and add project organization
Mar 29, 2025
023ca45
remove pyproject.toml and poetry.lock in favor of single requirements…
Mar 29, 2025
b3f1137
update requirements.txt
Mar 29, 2025
5f01ba3
remove mlflow from readme
Mar 29, 2025
7efe6d7
delete unused assets
Mar 30, 2025
682ebd8
update prompt to clarify entitiy output
Mar 30, 2025
3f739b9
refactor app.py: simplify and remove entities/description
Mar 31, 2025
4c00594
update assets
Mar 31, 2025
70bbacf
add model_info util forgetting model metadata throughout the app
Mar 31, 2025
3e894e8
update save results to dynamically render model info
Mar 31, 2025
c5ff0ff
move ImageDescription schema into prompt
Mar 31, 2025
cb22d7d
update steps to integrate changes
Mar 31, 2025
70c6f45
update ocr pipeline
Mar 31, 2025
a33b38c
update run and run_compare_ocr entrypoint files
Mar 31, 2025
8f44467
refactor main.py -- remove ground-trith-texts flag
Mar 31, 2025
ee4bc8d
update README.md
Mar 31, 2025
1063522
update ocr_config.yaml: remove image_patterns, update keys
Mar 31, 2025
d8c6ded
update assets
Mar 31, 2025
3be4143
delete config.yaml
Mar 31, 2025
5e0897a
update ocr model utils to process images with ollama api rather than …
Mar 31, 2025
9e06ff7
remove running_from_ui param
Mar 31, 2025
27e610c
add run_ollama_ocr_from_ui function for streamlit app
Mar 31, 2025
bbfd4e6
add docker settings for ocr pipeline
Mar 31, 2025
d1eb77e
update run_compare_ocr
Mar 31, 2025
17e1a4c
refactor run_ocr
Mar 31, 2025
0de8ddc
add Dockerfile and .dockerignore
Mar 31, 2025
16f3b77
update requirements.txt
Mar 31, 2025
b65894b
refactor: revert to use litellm+instructor for ollama models
Mar 31, 2025
9b4e584
edit api_key access for OpenAI client
Mar 31, 2025
87c05df
add logos for different ocr models
Apr 1, 2025
73c3aba
update utils for multi-model ocr
Apr 1, 2025
58ac407
update steps for multi-model ocr changes
Apr 1, 2025
6a0295f
add model_configs util for centralized model configuration and client…
Apr 1, 2025
4d96ae3
rename ocr_model_utils to ocr_processing
Apr 1, 2025
991e89f
update pipeline for multi-model ocr and update docker settings
Apr 1, 2025
0c05d71
update entrypoint run files
Apr 1, 2025
9ad7c8a
refactor streamlit app for readability and allow any number of models…
Apr 1, 2025
e316cfc
update README
Apr 1, 2025
ff58276
update ocr_config.yaml to integrate new changes
Apr 1, 2025
0e3638c
update assets
Apr 1, 2025
9095d01
update assets
Apr 1, 2025
7ec8df3
improve UI for comparing and displaying multiple models
Apr 1, 2025
28d698a
add extract json util
Apr 1, 2025
70a9c5e
add .env.example
Apr 1, 2025
e5f5924
update pipeline
Apr 1, 2025
7e7e108
update ocr processing, metrics, and model configs
Apr 1, 2025
233f95c
update utils and revamp evaluate_models
Apr 1, 2025
903923b
pass MODEL_CONFIGS in streamlit ocr processing
Apr 1, 2025
a3be487
set defaults for ground_truth_model for project
Apr 1, 2025
515aa0c
add images for README
Apr 1, 2025
e877918
update requirements.txt
Apr 1, 2025
f328c41
remove demo_models from utils
Apr 1, 2025
46bfb7f
update README: add images, links, docs
Apr 1, 2025
f1ec1b5
simplify assets dir
Apr 3, 2025
575a0d6
delete unused assets
Apr 3, 2025
6ecc9b5
update steps
Apr 3, 2025
d6cc858
add loaders.py
Apr 3, 2025
92e3fa0
update utils
Apr 3, 2025
8e70923
separate batch ocr from evaluation into 2 pipelines
Apr 3, 2025
7c2e498
add ground_truth_texts samples matching images provided
Apr 3, 2025
5b82b7e
remove run_compare_ocr and main.py in favor of run.py
Apr 3, 2025
987477d
update config.yaml to align with zenml config file definition
Apr 3, 2025
82a4bcc
update README
Apr 3, 2025
f501c4a
update .env.example
Apr 3, 2025
8447714
re-integrate docker settings and cleanup config.yaml
Apr 4, 2025
7062613
update image links in README
Apr 4, 2025
c31edfa
update visualization img
Apr 4, 2025
868e3b7
update assets
Apr 4, 2025
7496685
refactor utils
Apr 7, 2025
a792081
move visualization logic into util file
Apr 7, 2025
32a1bbc
refactor steps: remove local saving/loading and integrate artifacts
Apr 7, 2025
637655a
refactor pipelines: remove save/load from local dirs + split config f…
Apr 7, 2025
a7e4fa8
split config into dedicated config files for each pipeline
Apr 7, 2025
fc45c9d
add schemas dir
Apr 7, 2025
394e8ce
update requirements.txt
Apr 7, 2025
aa0190f
update run.py: simplify args, and integrate new config structure
Apr 7, 2025
5dcf313
delete main.py
Apr 7, 2025
94d8e62
update configs
Apr 7, 2025
09e847f
add docker settings in pipeline definitions
Apr 7, 2025
51c8eae
update README
Apr 7, 2025
df81509
add analyse and Labour to .typos.toml
Apr 7, 2025
bbbd82c
cleanup pipelines and add small html visualization for batch pipeline
Apr 8, 2025
bcaf3df
cleanup utils
Apr 8, 2025
2d4b4bf
update steps/evaluate models to use combined dataframe from updated r…
Apr 8, 2025
09c5704
update loader to work with Dataframe directly, and not a dict
Apr 8, 2025
cb91618
coerce potential lists being returned in model responses to strings
Apr 8, 2025
b443344
update README
Apr 8, 2025
894366f
cleanup configs and run_ocr
Apr 8, 2025
d8a654d
cleanup requirements.txt
Apr 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions omni-reader/.env.example
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
OPENAI_API_KEY=your_openai_api_key
MISTRAL_API_KEY=your_mistral_api_key
OLLAMA_HOST=base_url_for_ollama_host # defaults to "http://localhost:11434/api/generate" if not set
271 changes: 119 additions & 152 deletions omni-reader/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,48 @@
# OmniReader - Multi-model text extraction comparison
# OmniReader

OmniReader is a document processing workflow that ingests unstructured documents (PDFs, images, scans) and extracts text using multiple OCR models. It provides side-by-side comparison of extraction results, highlighting differences in accuracy, formatting, and content recognition. The multi-model approach allows users to evaluate OCR performance across different document types, languages, and formatting complexity. OmniReader delivers reproducible, automated, and cloud-agnostic analysis, with comprehensive metrics on extraction quality, processing time, and confidence scores for each model. It also supports parallel processing for faster batch operations and can compare an arbitrary number of models simultaneously.
A scalable multi-model text extraction solution for unstructured documents.

<div align="center">
<img src="assets/docs/pipeline_dags.png" alt="Pipeline DAG" width="600" />
</div>

✨ **Extract Structured Text from Any Document**
OmniReader is built for teams who routinely work with unstructured documents (e.g., PDFs, images, scanned forms) and want a scalable workflow for structured text extraction. It provides an end-to-end batch OCR pipeline with optional multi-model comparison to help ML engineers evaluate different OCR solutions before deployment.

<div align="center">
<img src="assets/docs/visualization.png" alt="HTML Visualization of OCR Results" width="800"/>
<p><em>HTML visualization showing metrics and comparison results from the OCR pipeline</em></p>
</div>

## 🌟 Key Features

- **End-to-end workflow management** from evaluation to production deployment
- **Multi-model comparison** to identify the best model for your specific document types
- **Scalable batch processing** that can handle enterprise document volumes
- **Quantitative evaluation metrics** to inform business and technical decisions
- **ZenML integration** providing reproducibility, cloud-agnostic deployment, and monitoring

## 🎭 How It Works

OmniReader provides two primary pipeline workflows that can be run separately:

1. **Batch OCR Pipeline**: Run large batches of documents through a single model to extract structured text and metadata.
2. **Evaluation Pipeline**: Compare multiple OCR models side-by-side and generate evaluation reports using CER/WER and HTML visualizations against ground truth text files.

Behind the scenes, OmniReader leverages state-of-the-art vision-language models and ZenML's MLOps framework to create a reproducible, scalable document processing system.

## 📚 Supported Models

OmniReader supports a wide range of OCR models, including:

- **Mistral/pixtral-12b-2409**: Mistral AI's vision-language model specializing in document understanding with strong OCR capabilities for complex layouts.
- **GPT-4o-mini**: OpenAI's efficient vision model offering a good balance of accuracy and speed for general document processing tasks.
- **Gemma3:27b**: Google's open-source multimodal model supporting 140+ languages with a 128K context window, optimized for text extraction from diverse document types.
- **Llava:34b**: Large multilingual vision-language model with strong performance on document understanding tasks requiring contextual interpretation.
- **Llava-phi3**: Microsoft's efficient multimodal model combining phi-3 language capabilities with vision understanding, ideal for mixed text-image documents.
- **Granite3.2-vision**: Specialized for visual document understanding, offering excellent performance on tables, charts, and technical diagrams.

> ⚠️ Note: For production deployments, we recommend using the non-GGUF hosted model versions via their respective APIs for better performance and accuracy. The Ollama models mentioned here are primarily for convenience.

## 🚀 Getting Started

Expand All @@ -10,215 +52,140 @@ OmniReader is a document processing workflow that ingests unstructured documents
- Mistral API key (set as environment variable `MISTRAL_API_KEY`)
- OpenAI API key (set as environment variable `OPENAI_API_KEY`)
- ZenML >= 0.80.0
- Ollama (required for running local models)

### Installation
### Quick Start

```bash
# Clone the repository
git clone https://github.com/yourusername/omni-reader.git

# Navigate to OmniReader
cd omni-reader

# Install dependencies
pip install -r requirements.txt

# Start Ollama (if using local models)
ollama serve
```

### Configuration
### Prepare Your Models

1. Ensure any Ollama models you want to use are pulled, e.g.:
If using local models, ensure any Ollama models you want to use are pulled:

```bash
ollama pull llama3.2-vision:11b
ollama pull gemma3:12b
ollama pull gemma3:27b
ollama pull llava-phi3
ollama pull granite3.2-vision
```

2. Set the following environment variables:
### Set Up Your Environment

Configure your API keys:

```bash
OPENAI_API_KEY=your_openai_api_key
MISTRAL_API_KEY=your_mistral_api_key
export OPENAI_API_KEY=your_openai_api_key
export MISTRAL_API_KEY=your_mistral_api_key
export OLLAMA_HOST=base_url_for_ollama_host # defaults to "http://localhost:11434/api/generate" if not set
```

## 📌 Usage

### Using YAML Configuration (Recommended)
### Run OmniReader

```bash
# Use the default config (ocr_config.yaml)
# Use the default config (config.yaml)
python run.py

# Run with a custom config file
python run.py --config my_config.yaml
```

### Configuration Structure

YAML configuration files organize parameters into logical sections:

```yaml
# Image input configuration
input:
image_paths: [] # List of specific image paths
image_folder: "./assets" # Folder containing images

# OCR model configuration
models:
custom_prompt: null # Optional custom prompt for all models
# Either specify individual models (for backward compatibility)
model1: "llama3.2-vision:11b" # First model for comparison
model2: "mistral/pixtral-12b-2409" # Second model for comparison
# Or specify multiple models as a list (new approach)
models: ["llama3.2-vision:11b", "mistral/pixtral-12b-2409"]
ground_truth_model: "gpt-4o-mini" # Model to use for ground truth when source is "openai"

# Ground truth configuration
ground_truth:
source: "openai" # Source: "openai", "manual", "file", or "none"
texts: [] # Ground truth texts (for manual source)
file: null # Path to ground truth JSON file (for file source)

# Output configuration
output:
ground_truth:
save: false # Whether to save ground truth data
directory: "ocr_results" # Directory for ground truth data

ocr_results:
save: false # Whether to save OCR results
directory: "ocr_results" # Directory for OCR results

visualization:
save: false # Whether to save HTML visualization
directory: "visualizations" # Directory for visualization
```

### Running with Command Line Arguments

You can still use command line arguments for quick runs:

```bash
# Run with default settings (processes all images in assets directory)
python run.py --image-folder assets

# Run with custom prompt
python run.py --image-folder assets --custom-prompt "Extract all text from this image."

# Run with specific images
python run.py --image-paths assets/image1.jpg assets/image2.png
### Interactive UI

# Run with OpenAI ground truth for evaluation
python run.py --image-folder assets --ground-truth openai --save-ground-truth
The project also includes a Streamlit app that allows you to:

# List available ground truth files
python run.py --list-ground-truth-files

# For quicker processing of a single image without metadata or artifact tracking
python run_compare_ocr.py --image assets/your_image.jpg --model both

# Run comparison with multiple specific models in parallel
python run_compare_ocr.py --image assets/your_image.jpg --model "gemma3:12b,llama3.2-vision:11b,moondream"

# Run comparison with all available models in parallel
python run_compare_ocr.py --image assets/your_image.jpg --model all
```

### Using the Streamlit App

For interactive use, the project includes a Streamlit app:
- Upload documents for instant OCR processing
- Compare results from multiple models side-by-side
- Experiment with custom prompts to improve extraction quality

```bash
# Launch the Streamlit interface
streamlit run app.py
```

<div align="center">
<img src="assets/streamlit.png" alt="Streamlit UI Interface" width="800"/>
<p><em>Interactive Streamlit interface for easy document processing and model comparison</em></p>
<img src="assets/docs/streamlit.png" alt="Model Comparison Results" width="800"/>
<p><em>Side-by-side comparison of OCR results across different models</em></p>
</div>

### ### Remote Artifact Storage and Execution
## ☁️ Cloud Deployment

This project supports storing artifacts remotely and executing pipelines on cloud infrastructure. Follow these steps to configure your environment for remote operation:
OmniReader supports storing artifacts remotely and executing pipelines on cloud infrastructure:

#### Setting Up Cloud Provider Integrations

Install the appropriate ZenML integrations for your preferred cloud provider:
### Set Up Cloud Provider Integrations

```bash
# For AWS
zenml integration install aws s3 -y
zenml integration install aws s3

# For Azure
zenml integration install azure azure-blob -y
zenml integration install azure

# For Google Cloud
zenml integration install gcp gcs -y
zenml integration install gcp gcs
```

For detailed configuration options and other components, refer to the ZenML documentation:

- [Artifact Stores](https://docs.zenml.io/stacks/artifact-stores)
- [Orchestrators](https://docs.zenml.io/stacks/orchestrators)

## 📋 Pipeline Architecture

The OCR comparison pipeline consists of the following components:

### Steps

1. **Multi-Model OCR Step**: Processes images with multiple models in parallel
- Supports any number of models defined in configuration
- Models run in parallel using ThreadPoolExecutor
- Each model processes its assigned images with parallelized execution
- Progress tracking during batch processing
2. **Ground Truth Step**: Optional step that uses a reference model for evaluation (default: GPT-4o Mini)
3. **Evaluation Step**: Compares results and calculates metrics

The pipeline supports configurable models, allowing you to easily swap out the models used for OCR comparison and ground truth generation via the YAML configuration file. It also supports processing any number of models in parallel for more comprehensive comparisons.

### Configuration Management
Run your pipeline in the cloud:

The new configuration system provides:

- Structured YAML files for experiment parameters
- Parameter validation and intelligent defaults
- Easy sharing and version control of experiment settings
- Configuration generator for quickly creating new experiment setups
- Support for multi-model configuration via arrays
- Flexible model selection and comparison

### Metadata Tracking

ZenML's metadata tracking is used throughout the pipeline:
```bash
# Configure your cloud stack
zenml stack register my-cloud-stack -a cloud-artifact-store -o cloud-orchestrator
```

- Processing times and performance metrics
- Extracted text length and entity counts
- Comparison metrics between models (CER, WER)
- Progress tracking for batch operations
- Parallel processing statistics
For detailed configuration options and other components, refer to the ZenML documentation:

### Results Visualization
- [AWS Integration Guide](https://docs.zenml.io/how-to/popular-integrations/aws-guide)
- [GCP Integration Guide](https://docs.zenml.io/how-to/popular-integrations/gcp-guide)
- [Azure Integration Guide](https://docs.zenml.io/how-to/popular-integrations/azure-guide)

- Pipeline results are available in the ZenML Dashboard
- HTML visualizations can be automatically saved to configurable directories

<div align="center">
<img src="assets/html_visualization.png" alt="HTML Visualization of OCR Results" width="800"/>
<p><em>HTML visualization showing metrics and comparison results from the OCR pipeline</em></p>
</div>

## 📁 Project Organization
## 🛠️ Project Structure

```
omni-reader/
├── app.py # Streamlit UI for interactive document processing
├── assets/ # Sample images for ocr
├── configs/ # YAML configuration files
├── ground_truth_texts/ # Text files containing ground truth for evaluation
├── pipelines/ # ZenML pipeline definitions
│ ├── batch_pipeline.py # Batch OCR pipeline (single or multiple models)
│ └── evaluation_pipeline.py # Evaluation pipeline (multiple models)
├── steps/ # Pipeline step implementations
├── utils/ # Utility functions
│ ├── evaluate_models.py # Model comparison and metrics
│ ├── loaders.py # Loading images and ground truth texts
│ ├── run_ocr.py # Running OCR with selected models
│ └── save_results.py # Saving results and visualizations
├── utils/ # Utility functions and helpers
│ ├── ocr_processing.py # OCR processing core logic
│ ├── config.py # Configuration utilities
│ └── ...
├── run.py # Main script for running the pipeline
├── config_generator.py # Tool for generating experiment configurations
│ └── model_configs.py # Model configuration and registry
├── run.py # Main entrypoint for running the pipeline
└── README.md # Project documentation
```

## 🔗 Links
## 🔮 Use Cases

- **Document Processing Automation**: Extract structured data from invoices, receipts, and forms
- **Content Digitization**: Convert scanned documents and books into searchable digital content
- **Regulatory Compliance**: Extract and validate information from compliance documents
- **Data Migration**: Convert legacy paper documents into structured digital formats
- **Research & Analysis**: Extract data from academic papers, reports, and publications

## 📚 Documentation

For more information about ZenML and building MLOps pipelines, refer to the [ZenML documentation](https://docs.zenml.io/).

For model-specific documentation:

- [ZenML Documentation](https://docs.zenml.io/)
- [Mistral AI Vision Documentation](https://docs.mistral.ai/capabilities/vision/)
- [Gemma 3 Documentation](https://ai.google.dev/gemma/docs/integrations/ollama)
- [Ollama Models Library](https://ollama.com/library)
Loading
Loading