type

knowledge

created

2026-02-03

updated

2026-02-03

OCR/HTR Model Landscape

Current state of OCR/HTR models relevant for coOCR/HTR (as of February 2026).

External Validation

Findings from the Digital Humanities community (2026-02):

Gemini 3 Pro leads closed models "by a long way"
LightOnOCR is currently state-of-the-art for open source
DeepSeek OCR 2 "works okay for simple layouts"
Layout analysis as a separate step significantly improves accuracy
Agentic Vision Mode relevant for complex layouts
HTR (handwriting) remains more challenging than OCR (print) across all models

Model Comparison Matrix

Closed/Commercial Models

Model	Strengths	Weaknesses	HTR Quality	Notes
Gemini 3 Pro	Best overall accuracy, HTR breakthrough, complex layouts	Cost, API dependency	Excellent	"Solves" English HTR (18th-19th c.)
Gemini 3 Flash	Fast, cost-effective, Agentic Vision	Less accurate than Pro	Good	3x faster, 4x cheaper than Pro
GPT-4o	Good semantic understanding, handwriting	Less layout-aware	Good	Better for semantic context
Claude	Good reasoning	Not OCR-specialized	Moderate	Better for validation than transcription

Open Source Models

Model	Parameters	Strengths	Weaknesses	License
LightOnOCR-2	1B	SotA on OlmOCR-Bench (83.2), fastest	Less tested on HTR	Apache 2.0
dots.ocr	1.7B	100+ languages, excellent layout	Slower than LightOnOCR	MIT
DeepSeek OCR 2	3B	Structure preservation, local	Handwriting limitations	Open
PaddleOCR-VL	0.9B	Multilingual	Slower	Apache 2.0

Performance Benchmarks (OlmOCR-Bench)

Model	Score	Speed (pages/s)	Notes
LightOnOCR-2	83.2	5.71	Best accuracy + speed
dots.ocr	~82	1.14	Best multilingual
DeepSeek OCR 2	~78	3.3	Good for simple layouts

Gemini 3 Details

Pro vs Flash Decision Matrix

Use Case	Recommended	Rationale
Historical handwriting (HTR)	Pro	Significantly better accuracy
Complex multi-column layouts	Pro	Better layout understanding
Simple printed documents	Flash	Cost-effective, sufficient quality
Batch processing	Flash	3x faster, 4x cheaper
Prototyping/testing	Flash	Lower cost iteration

Agentic Vision Mode (Flash)

New capability combining visual reasoning with code execution:

Think - Act - Observe Loop:

Think: Analyzes request, formulates multi-step plan
Act: Generates Python code to zoom, crop, annotate
Observe: Re-examines transformed image

Benefits:

5-10% quality boost on vision benchmarks
Reduced hallucinations
Better for high-resolution documents
Can zoom into regions of interest

Activation: Requires explicit prompt or code execution enabled in API.

Relevance for coOCR/HTR: Could improve recognition of dense documents, small text, or damaged areas by allowing the model to "investigate" problematic regions.

HTR Breakthrough (Pro)

Research finding (Generative History, 2025):

"Sixty years after IBM's first HTR system, Gemini 3 Pro has solved HTR on English language texts. By 'solved' we don't mean absolute perfection, but that Gemini 3 consistently produces text with error rates comparable to the very best humans."

Test corpus: 50 English documents, 18th-19th century, including letters, legal documents, meeting transcriptions, memorandums, journal entries.

Implication: For English historical documents, Gemini 3 Pro may reduce the need for specialized HTR models like Transkribus.

LightOnOCR-2 Details

Current state-of-the-art open source OCR model.

Architecture

Vision Transformer encoder (Pixtral-based)
Lightweight text decoder (Qwen3-based)
Distilled from high-quality open VLMs

Key Features

End-to-end (no external OCR pipeline)
Layout-aware text extraction
Handles tables, receipts, forms, multi-column, math notation
Compact variants for European languages (32k/16k vocab)

Performance vs Competitors

6.49x faster than dots.ocr
2.67x faster than PaddleOCR-VL
1.73x faster than DeepSeek OCR
9x smaller than comparable approaches

Integration

Supported in vLLM v0.11.1+
Available on Hugging Face: lightonai/LightOnOCR-2-1B
Can run locally via Ollama (requires conversion)

Limitation

Less tested on historical handwriting (HTR)
Optimized for document parsing, not manuscript transcription

dots.ocr Details

Multilingual document parser with excellent layout detection.

Key Differentiator

Single VLM handles everything:

Layout detection
Text parsing
Reading order
Formula recognition

No multi-model pipeline required.

Performance

100+ languages supported
SOTA on multilingual documents
Error rates nearly halved vs competitors on low-resource languages
Formula recognition comparable to 72B models

Availability

GitHub: rednote-hilab/dots.ocr
Hugging Face: rednote-hilab/dots.ocr
MIT License

DeepSeek OCR 2 Limitations

Confirmed Weaknesses

Handwriting: Not a core focus; "performance remains limited compared to specialized cursive OCR tools"
Complex layouts: "Works okay for simple layouts" (community feedback)
Historical scripts: Unfamiliar letter forms problematic

When to Use

Printed documents with clear structure
Local/offline processing requirement
Privacy-sensitive documents (runs locally)
Batch processing of simple documents

When to Avoid

Handwritten documents (use Gemini Pro or specialized HTR)
Complex multi-column layouts (use dots.ocr or Gemini)
Low-resource languages (use dots.ocr)

Optimization Tips

Capture at 300 DPI or higher
Avoid glass reflections
Denoise lightly and deskew
Use higher token/vision preset for faint text

Recommendations for coOCR/HTR

Implemented (2026-02-03)

Action	Status
Gemini 3 Pro as model option	[x] Done
Model Selection Guide in UI	[x] Done
DeepSeek OCR 2 + LightOnOCR-2 in Ollama options	[x] Done

Future Considerations

Action	Priority	Effort	Notes
Mistral OCR 3 as provider (Azure)	Medium	Medium	Quality comparison pending (zbz-ocr-tei), requires Azure endpoint config
Agentic Vision integration	Medium	Medium
LightOnOCR-2 Ollama conversion guide	Medium	Low
Layout analysis pre-step	Low	High
Multi-model ensemble	Low	High

Model Selection Guide (implemented in UI)

Is it handwritten?
├─ Yes → Use Gemini 3 Pro
└─ No (printed)
    ├─ Complex layout? → Gemini 3 Pro or dots.ocr
    ├─ Simple layout, need speed? → Gemini 3 Flash or DeepSeek OCR
    └─ Privacy/offline required? → DeepSeek OCR (Ollama) or LightOnOCR

Implementation Concepts

Agentic Vision (Future)

Concept: Gemini 3 Flash can actively investigate images instead of just passively analyzing them.

Think-Act-Observe Loop:

1. THINK: Model analyzes request, plans multi-step investigation
2. ACT: Model generates Python code to zoom, crop, annotate
3. OBSERVE: Transformed image is re-analyzed
4. REPEAT: Until sufficient clarity is reached

Potential use cases in coOCR/HTR:

Automatic zooming on illegible passages
Segmentation of complex layouts
Quality improvement for damaged documents
Verification of uncertain readings through re-analysis

Technical implementation:

// Concept: Agentic Vision API call
const requestBody = {
  contents: [{ parts }],
  generationConfig: {
    temperature: 1.0,
    maxOutputTokens: 8192
  },
  // Enable Code Execution for Agentic Vision
  tools: [{
    codeExecution: {}
  }]
};

Prerequisites:

Gemini API with Code Execution support
Prompt engineering for multi-step analysis
UI for visualization of intermediate steps (optional)

Status: Concept documented, not implemented.

Ollama Integration (Extended)

Current state:

DeepSeek-OCR as default model
DeepSeek-OCR 2 added as option
LightOnOCR-2 added as option (requires conversion)

LightOnOCR-2 via Ollama:

The model is available on Hugging Face but must be converted for Ollama:

# 1. Load model from Hugging Face
git lfs install
git clone https://huggingface.co/lightonai/LightOnOCR-2-1B

# 2. Convert to GGUF (requires llama.cpp)
python convert.py lightonai/LightOnOCR-2-1B --outtype f16

# 3. Create Modelfile
cat > Modelfile << EOF
FROM ./lightonocr-2-1b.gguf
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF

# 4. Import into Ollama
ollama create lightonocr -f Modelfile

Note: The conversion is not trivial and requires:

llama.cpp with Vision support
Sufficient RAM (16GB+)
GPU recommended for usable speed

DeepSeek-OCR 2:

More easily available, can be used directly with Ollama:

# If available in Ollama Registry
ollama pull deepseek-ocr2

# Or manually with Modelfile
ollama create deepseek-ocr2 -f Modelfile

Status: Options added to UI, conversion guide documented.

Sources

Primary Sources

Community Validation

Digital Humanities community discussion (2026-02)

Related: METHODOLOGY | ARCHITECTURE | VALIDATION

Updated: 2026-02-14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR/HTR Model Landscape

External Validation

Model Comparison Matrix

Closed/Commercial Models

Open Source Models

Performance Benchmarks (OlmOCR-Bench)

Gemini 3 Details

Pro vs Flash Decision Matrix

Agentic Vision Mode (Flash)

HTR Breakthrough (Pro)

LightOnOCR-2 Details

Architecture

Key Features

Performance vs Competitors

Integration

Limitation

dots.ocr Details

Key Differentiator

Performance

Availability

DeepSeek OCR 2 Limitations

Confirmed Weaknesses

When to Use

When to Avoid

Optimization Tips

Recommendations for coOCR/HTR

Implemented (2026-02-03)

Future Considerations

Model Selection Guide (implemented in UI)

Implementation Concepts

Agentic Vision (Future)

Ollama Integration (Extended)

Sources

Primary Sources

Community Validation

FilesExpand file tree

MODEL-LANDSCAPE.md

Latest commit

History

MODEL-LANDSCAPE.md

File metadata and controls

OCR/HTR Model Landscape

External Validation

Model Comparison Matrix

Closed/Commercial Models

Open Source Models

Performance Benchmarks (OlmOCR-Bench)

Gemini 3 Details

Pro vs Flash Decision Matrix

Agentic Vision Mode (Flash)

HTR Breakthrough (Pro)

LightOnOCR-2 Details

Architecture

Key Features

Performance vs Competitors

Integration

Limitation

dots.ocr Details

Key Differentiator

Performance

Availability

DeepSeek OCR 2 Limitations

Confirmed Weaknesses

When to Use

When to Avoid

Optimization Tips

Recommendations for coOCR/HTR

Implemented (2026-02-03)

Future Considerations

Model Selection Guide (implemented in UI)

Implementation Concepts

Agentic Vision (Future)

Ollama Integration (Extended)

Sources

Primary Sources

Community Validation