Skip to content

Commit 82a4bcc

Browse files
author
marwan37
committed
update README
1 parent 987477d commit 82a4bcc

File tree

1 file changed

+119
-152
lines changed

1 file changed

+119
-152
lines changed

omni-reader/README.md

Lines changed: 119 additions & 152 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,48 @@
1-
# OmniReader - Multi-model text extraction comparison
1+
# OmniReader
22

3-
OmniReader is a document processing workflow that ingests unstructured documents (PDFs, images, scans) and extracts text using multiple OCR models. It provides side-by-side comparison of extraction results, highlighting differences in accuracy, formatting, and content recognition. The multi-model approach allows users to evaluate OCR performance across different document types, languages, and formatting complexity. OmniReader delivers reproducible, automated, and cloud-agnostic analysis, with comprehensive metrics on extraction quality, processing time, and confidence scores for each model. It also supports parallel processing for faster batch operations and can compare an arbitrary number of models simultaneously.
3+
A scalable multi-model text extraction solution for unstructured documents.
4+
5+
<div align="center">
6+
<img src="assets/docs/pipeline_dags.png" alt="Pipeline DAG" width="600" />
7+
</div>
8+
9+
**Extract Structured Text from Any Document**
10+
OmniReader is built for teams who routinely work with unstructured documents (e.g., PDFs, images, scanned forms) and want a scalable workflow for structured text extraction. It provides an end-to-end batch OCR pipeline with optional multi-model comparison to help ML engineers evaluate different OCR solutions before deployment.
11+
12+
<div align="center">
13+
<img src="assets/demo/html_visualization.png" alt="HTML Visualization of OCR Results" width="800"/>
14+
<p><em>HTML visualization showing metrics and comparison results from the OCR pipeline</em></p>
15+
</div>
16+
17+
## 🌟 Key Features
18+
19+
- **End-to-end workflow management** from evaluation to production deployment
20+
- **Multi-model comparison** to identify the best model for your specific document types
21+
- **Scalable batch processing** that can handle enterprise document volumes
22+
- **Quantitative evaluation metrics** to inform business and technical decisions
23+
- **ZenML integration** providing reproducibility, cloud-agnostic deployment, and monitoring
24+
25+
## 🎭 How It Works
26+
27+
OmniReader provides two primary pipeline workflows that can be run separately:
28+
29+
1. **Batch OCR Pipeline**: Run large batches of documents through a single model to extract structured text and metadata.
30+
2. **Evaluation Pipeline**: Compare multiple OCR models side-by-side and generate evaluation reports using CER/WER and HTML visualizations against ground truth text files.
31+
32+
Behind the scenes, OmniReader leverages state-of-the-art vision-language models and ZenML's MLOps framework to create a reproducible, scalable document processing system.
33+
34+
## 📚 Supported Models
35+
36+
OmniReader supports a wide range of OCR models, including:
37+
38+
- **Mistral/pixtral-12b-2409**: Mistral AI's vision-language model specializing in document understanding with strong OCR capabilities for complex layouts.
39+
- **GPT-4o-mini**: OpenAI's efficient vision model offering a good balance of accuracy and speed for general document processing tasks.
40+
- **Gemma3:27b**: Google's open-source multimodal model supporting 140+ languages with a 128K context window, optimized for text extraction from diverse document types.
41+
- **Llava:34b**: Large multilingual vision-language model with strong performance on document understanding tasks requiring contextual interpretation.
42+
- **Llava-phi3**: Microsoft's efficient multimodal model combining phi-3 language capabilities with vision understanding, ideal for mixed text-image documents.
43+
- **Granite3.2-vision**: Specialized for visual document understanding, offering excellent performance on tables, charts, and technical diagrams.
44+
45+
> ⚠️ Note: For production deployments, we recommend using the non-GGUF hosted model versions via their respective APIs for better performance and accuracy. The Ollama models mentioned here are primarily for convenience.
446
547
## 🚀 Getting Started
648

@@ -10,215 +52,140 @@ OmniReader is a document processing workflow that ingests unstructured documents
1052
- Mistral API key (set as environment variable `MISTRAL_API_KEY`)
1153
- OpenAI API key (set as environment variable `OPENAI_API_KEY`)
1254
- ZenML >= 0.80.0
55+
- Ollama (required for running local models)
1356

14-
### Installation
57+
### Quick Start
1558

1659
```bash
60+
# Clone the repository
61+
git clone https://github.com/yourusername/omni-reader.git
62+
63+
# Navigate to OmniReader
64+
cd omni-reader
65+
1766
# Install dependencies
1867
pip install -r requirements.txt
68+
69+
# Start Ollama (if using local models)
70+
ollama serve
1971
```
2072

21-
### Configuration
73+
### Prepare Your Models
2274

23-
1. Ensure any Ollama models you want to use are pulled, e.g.:
75+
If using local models, ensure any Ollama models you want to use are pulled:
2476

2577
```bash
26-
ollama pull llama3.2-vision:11b
27-
ollama pull gemma3:12b
78+
ollama pull gemma3:27b
79+
ollama pull llava-phi3
80+
ollama pull granite3.2-vision
2881
```
2982

30-
2. Set the following environment variables:
83+
### Set Up Your Environment
84+
85+
Configure your API keys:
3186

3287
```bash
33-
OPENAI_API_KEY=your_openai_api_key
34-
MISTRAL_API_KEY=your_mistral_api_key
88+
export OPENAI_API_KEY=your_openai_api_key
89+
export MISTRAL_API_KEY=your_mistral_api_key
90+
export OLLAMA_HOST=base_url_for_ollama_host # defaults to "http://localhost:11434/api/generate" if not set
3591
```
3692

37-
## 📌 Usage
38-
39-
### Using YAML Configuration (Recommended)
93+
### Run OmniReader
4094

4195
```bash
42-
# Use the default config (ocr_config.yaml)
96+
# Use the default config (config.yaml)
4397
python run.py
4498

4599
# Run with a custom config file
46100
python run.py --config my_config.yaml
47101
```
48102

49-
### Configuration Structure
50-
51-
YAML configuration files organize parameters into logical sections:
52-
53-
```yaml
54-
# Image input configuration
55-
input:
56-
image_paths: [] # List of specific image paths
57-
image_folder: "./assets" # Folder containing images
58-
59-
# OCR model configuration
60-
models:
61-
custom_prompt: null # Optional custom prompt for all models
62-
# Either specify individual models (for backward compatibility)
63-
model1: "llama3.2-vision:11b" # First model for comparison
64-
model2: "mistral/pixtral-12b-2409" # Second model for comparison
65-
# Or specify multiple models as a list (new approach)
66-
models: ["llama3.2-vision:11b", "mistral/pixtral-12b-2409"]
67-
ground_truth_model: "gpt-4o-mini" # Model to use for ground truth when source is "openai"
68-
69-
# Ground truth configuration
70-
ground_truth:
71-
source: "openai" # Source: "openai", "manual", "file", or "none"
72-
texts: [] # Ground truth texts (for manual source)
73-
file: null # Path to ground truth JSON file (for file source)
74-
75-
# Output configuration
76-
output:
77-
ground_truth:
78-
save: false # Whether to save ground truth data
79-
directory: "ocr_results" # Directory for ground truth data
80-
81-
ocr_results:
82-
save: false # Whether to save OCR results
83-
directory: "ocr_results" # Directory for OCR results
84-
85-
visualization:
86-
save: false # Whether to save HTML visualization
87-
directory: "visualizations" # Directory for visualization
88-
```
89-
90-
### Running with Command Line Arguments
91-
92-
You can still use command line arguments for quick runs:
93-
94-
```bash
95-
# Run with default settings (processes all images in assets directory)
96-
python run.py --image-folder assets
97-
98-
# Run with custom prompt
99-
python run.py --image-folder assets --custom-prompt "Extract all text from this image."
100-
101-
# Run with specific images
102-
python run.py --image-paths assets/image1.jpg assets/image2.png
103+
### Interactive UI
103104

104-
# Run with OpenAI ground truth for evaluation
105-
python run.py --image-folder assets --ground-truth openai --save-ground-truth
105+
The project also includes a Streamlit app that allows you to:
106106

107-
# List available ground truth files
108-
python run.py --list-ground-truth-files
109-
110-
# For quicker processing of a single image without metadata or artifact tracking
111-
python run_compare_ocr.py --image assets/your_image.jpg --model both
112-
113-
# Run comparison with multiple specific models in parallel
114-
python run_compare_ocr.py --image assets/your_image.jpg --model "gemma3:12b,llama3.2-vision:11b,moondream"
115-
116-
# Run comparison with all available models in parallel
117-
python run_compare_ocr.py --image assets/your_image.jpg --model all
118-
```
119-
120-
### Using the Streamlit App
121-
122-
For interactive use, the project includes a Streamlit app:
107+
- Upload documents for instant OCR processing
108+
- Compare results from multiple models side-by-side
109+
- Customize prompts for improved extraction
123110

124111
```bash
112+
# Launch the Streamlit interface
125113
streamlit run app.py
126114
```
127115

128116
<div align="center">
129-
<img src="assets/streamlit.png" alt="Streamlit UI Interface" width="800"/>
130-
<p><em>Interactive Streamlit interface for easy document processing and model comparison</em></p>
117+
<img src="assets/demo/streamlit.png" alt="Model Comparison Results" width="800"/>
118+
<p><em>Side-by-side comparison of OCR results across different models</em></p>
131119
</div>
132120

133-
### ### Remote Artifact Storage and Execution
121+
## ☁️ Cloud Deployment
134122

135-
This project supports storing artifacts remotely and executing pipelines on cloud infrastructure. Follow these steps to configure your environment for remote operation:
123+
OmniReader supports storing artifacts remotely and executing pipelines on cloud infrastructure:
136124

137-
#### Setting Up Cloud Provider Integrations
138-
139-
Install the appropriate ZenML integrations for your preferred cloud provider:
125+
### Set Up Cloud Provider Integrations
140126

141127
```bash
142128
# For AWS
143-
zenml integration install aws s3 -y
129+
zenml integration install aws s3
144130

145131
# For Azure
146-
zenml integration install azure azure-blob -y
132+
zenml integration install azure
147133

148134
# For Google Cloud
149-
zenml integration install gcp gcs -y
135+
zenml integration install gcp gcs
150136
```
151137

152-
For detailed configuration options and other components, refer to the ZenML documentation:
153-
154-
- [Artifact Stores](https://docs.zenml.io/stacks/artifact-stores)
155-
- [Orchestrators](https://docs.zenml.io/stacks/orchestrators)
156-
157-
## 📋 Pipeline Architecture
158-
159-
The OCR comparison pipeline consists of the following components:
160-
161-
### Steps
162-
163-
1. **Multi-Model OCR Step**: Processes images with multiple models in parallel
164-
- Supports any number of models defined in configuration
165-
- Models run in parallel using ThreadPoolExecutor
166-
- Each model processes its assigned images with parallelized execution
167-
- Progress tracking during batch processing
168-
2. **Ground Truth Step**: Optional step that uses a reference model for evaluation (default: GPT-4o Mini)
169-
3. **Evaluation Step**: Compares results and calculates metrics
170-
171-
The pipeline supports configurable models, allowing you to easily swap out the models used for OCR comparison and ground truth generation via the YAML configuration file. It also supports processing any number of models in parallel for more comprehensive comparisons.
172-
173-
### Configuration Management
138+
Run your pipeline in the cloud:
174139

175-
The new configuration system provides:
176-
177-
- Structured YAML files for experiment parameters
178-
- Parameter validation and intelligent defaults
179-
- Easy sharing and version control of experiment settings
180-
- Configuration generator for quickly creating new experiment setups
181-
- Support for multi-model configuration via arrays
182-
- Flexible model selection and comparison
183-
184-
### Metadata Tracking
185-
186-
ZenML's metadata tracking is used throughout the pipeline:
140+
```bash
141+
# Configure your cloud stack
142+
zenml stack register my-cloud-stack -a cloud-artifact-store -o cloud-orchestrator
143+
```
187144

188-
- Processing times and performance metrics
189-
- Extracted text length and entity counts
190-
- Comparison metrics between models (CER, WER)
191-
- Progress tracking for batch operations
192-
- Parallel processing statistics
145+
For detailed configuration options and other components, refer to the ZenML documentation:
193146

194-
### Results Visualization
147+
- [AWS Integration Guide](https://docs.zenml.io/how-to/popular-integrations/aws-guide)
148+
- [GCP Integration Guide](https://docs.zenml.io/how-to/popular-integrations/gcp-guide)
149+
- [Azure Integration Guide](https://docs.zenml.io/how-to/popular-integrations/azure-guide)
195150

196-
- Pipeline results are available in the ZenML Dashboard
197-
- HTML visualizations can be automatically saved to configurable directories
198-
199-
<div align="center">
200-
<img src="assets/html_visualization.png" alt="HTML Visualization of OCR Results" width="800"/>
201-
<p><em>HTML visualization showing metrics and comparison results from the OCR pipeline</em></p>
202-
</div>
203-
204-
## 📁 Project Organization
151+
## 🛠️ Project Structure
205152

206153
```
207154
omni-reader/
208155
156+
├── app.py # Streamlit UI for interactive document processing
157+
├── assets/ # Sample images for ocr
209158
├── configs/ # YAML configuration files
159+
├── ground_truth_texts/ # Text files containing ground truth for evaluation
210160
├── pipelines/ # ZenML pipeline definitions
161+
│ ├── batch_pipeline.py # Batch OCR pipeline (single or multiple models)
162+
│ └── evaluation_pipeline.py # Evaluation pipeline (multiple models)
211163
├── steps/ # Pipeline step implementations
212-
├── utils/ # Utility functions
164+
│ ├── evaluate_models.py # Model comparison and metrics
165+
│ ├── loaders.py # Loading images and ground truth texts
166+
│ ├── run_ocr.py # Running OCR with selected models
167+
│ └── save_results.py # Saving results and visualizations
168+
├── utils/ # Utility functions and helpers
169+
│ ├── ocr_processing.py # OCR processing core logic
213170
│ ├── config.py # Configuration utilities
214-
│ └── ...
215-
├── run.py # Main script for running the pipeline
216-
├── config_generator.py # Tool for generating experiment configurations
171+
│ └── model_configs.py # Model configuration and registry
172+
├── run.py # Main entrypoint for running the pipeline
217173
└── README.md # Project documentation
218174
```
219175

220-
## 🔗 Links
176+
## 🔮 Use Cases
177+
178+
- **Document Processing Automation**: Extract structured data from invoices, receipts, and forms
179+
- **Content Digitization**: Convert scanned documents and books into searchable digital content
180+
- **Regulatory Compliance**: Extract and validate information from compliance documents
181+
- **Data Migration**: Convert legacy paper documents into structured digital formats
182+
- **Research & Analysis**: Extract data from academic papers, reports, and publications
183+
184+
## 📚 Documentation
185+
186+
For more information about ZenML and building MLOps pipelines, refer to the [ZenML documentation](https://docs.zenml.io/).
187+
188+
For model-specific documentation:
221189

222-
- [ZenML Documentation](https://docs.zenml.io/)
223190
- [Mistral AI Vision Documentation](https://docs.mistral.ai/capabilities/vision/)
224-
- [Gemma 3 Documentation](https://ai.google.dev/gemma/docs/integrations/ollama)
191+
- [Ollama Models Library](https://ollama.com/library)

0 commit comments

Comments
 (0)