Skip to content

Latest commit

 

History

History
332 lines (243 loc) · 10.1 KB

File metadata and controls

332 lines (243 loc) · 10.1 KB
type knowledge
created 2026-02-03
updated 2026-02-03
tags
coocr-htr
models
ocr
htr
vlm
external-validation
status active

OCR/HTR Model Landscape

Current state of OCR/HTR models relevant for coOCR/HTR (as of February 2026).

External Validation

Findings from the Digital Humanities community (2026-02):

  • Gemini 3 Pro leads closed models "by a long way"
  • LightOnOCR is currently state-of-the-art for open source
  • DeepSeek OCR 2 "works okay for simple layouts"
  • Layout analysis as a separate step significantly improves accuracy
  • Agentic Vision Mode relevant for complex layouts
  • HTR (handwriting) remains more challenging than OCR (print) across all models

Model Comparison Matrix

Closed/Commercial Models

Model Strengths Weaknesses HTR Quality Notes
Gemini 3 Pro Best overall accuracy, HTR breakthrough, complex layouts Cost, API dependency Excellent "Solves" English HTR (18th-19th c.)
Gemini 3 Flash Fast, cost-effective, Agentic Vision Less accurate than Pro Good 3x faster, 4x cheaper than Pro
GPT-4o Good semantic understanding, handwriting Less layout-aware Good Better for semantic context
Claude Good reasoning Not OCR-specialized Moderate Better for validation than transcription

Open Source Models

Model Parameters Strengths Weaknesses License
LightOnOCR-2 1B SotA on OlmOCR-Bench (83.2), fastest Less tested on HTR Apache 2.0
dots.ocr 1.7B 100+ languages, excellent layout Slower than LightOnOCR MIT
DeepSeek OCR 2 3B Structure preservation, local Handwriting limitations Open
PaddleOCR-VL 0.9B Multilingual Slower Apache 2.0

Performance Benchmarks (OlmOCR-Bench)

Model Score Speed (pages/s) Notes
LightOnOCR-2 83.2 5.71 Best accuracy + speed
dots.ocr ~82 1.14 Best multilingual
DeepSeek OCR 2 ~78 3.3 Good for simple layouts

Gemini 3 Details

Pro vs Flash Decision Matrix

Use Case Recommended Rationale
Historical handwriting (HTR) Pro Significantly better accuracy
Complex multi-column layouts Pro Better layout understanding
Simple printed documents Flash Cost-effective, sufficient quality
Batch processing Flash 3x faster, 4x cheaper
Prototyping/testing Flash Lower cost iteration

Agentic Vision Mode (Flash)

New capability combining visual reasoning with code execution:

Think - Act - Observe Loop:

  1. Think: Analyzes request, formulates multi-step plan
  2. Act: Generates Python code to zoom, crop, annotate
  3. Observe: Re-examines transformed image

Benefits:

  • 5-10% quality boost on vision benchmarks
  • Reduced hallucinations
  • Better for high-resolution documents
  • Can zoom into regions of interest

Activation: Requires explicit prompt or code execution enabled in API.

Relevance for coOCR/HTR: Could improve recognition of dense documents, small text, or damaged areas by allowing the model to "investigate" problematic regions.

HTR Breakthrough (Pro)

Research finding (Generative History, 2025):

"Sixty years after IBM's first HTR system, Gemini 3 Pro has solved HTR on English language texts. By 'solved' we don't mean absolute perfection, but that Gemini 3 consistently produces text with error rates comparable to the very best humans."

Test corpus: 50 English documents, 18th-19th century, including letters, legal documents, meeting transcriptions, memorandums, journal entries.

Implication: For English historical documents, Gemini 3 Pro may reduce the need for specialized HTR models like Transkribus.


LightOnOCR-2 Details

Current state-of-the-art open source OCR model.

Architecture

  • Vision Transformer encoder (Pixtral-based)
  • Lightweight text decoder (Qwen3-based)
  • Distilled from high-quality open VLMs

Key Features

  • End-to-end (no external OCR pipeline)
  • Layout-aware text extraction
  • Handles tables, receipts, forms, multi-column, math notation
  • Compact variants for European languages (32k/16k vocab)

Performance vs Competitors

  • 6.49x faster than dots.ocr
  • 2.67x faster than PaddleOCR-VL
  • 1.73x faster than DeepSeek OCR
  • 9x smaller than comparable approaches

Integration

  • Supported in vLLM v0.11.1+
  • Available on Hugging Face: lightonai/LightOnOCR-2-1B
  • Can run locally via Ollama (requires conversion)

Limitation

  • Less tested on historical handwriting (HTR)
  • Optimized for document parsing, not manuscript transcription

dots.ocr Details

Multilingual document parser with excellent layout detection.

Key Differentiator

Single VLM handles everything:

  • Layout detection
  • Text parsing
  • Reading order
  • Formula recognition

No multi-model pipeline required.

Performance

  • 100+ languages supported
  • SOTA on multilingual documents
  • Error rates nearly halved vs competitors on low-resource languages
  • Formula recognition comparable to 72B models

Availability

  • GitHub: rednote-hilab/dots.ocr
  • Hugging Face: rednote-hilab/dots.ocr
  • MIT License

DeepSeek OCR 2 Limitations

Confirmed Weaknesses

  • Handwriting: Not a core focus; "performance remains limited compared to specialized cursive OCR tools"
  • Complex layouts: "Works okay for simple layouts" (community feedback)
  • Historical scripts: Unfamiliar letter forms problematic

When to Use

  • Printed documents with clear structure
  • Local/offline processing requirement
  • Privacy-sensitive documents (runs locally)
  • Batch processing of simple documents

When to Avoid

  • Handwritten documents (use Gemini Pro or specialized HTR)
  • Complex multi-column layouts (use dots.ocr or Gemini)
  • Low-resource languages (use dots.ocr)

Optimization Tips

  • Capture at 300 DPI or higher
  • Avoid glass reflections
  • Denoise lightly and deskew
  • Use higher token/vision preset for faint text

Recommendations for coOCR/HTR

Implemented (2026-02-03)

Action Status
Gemini 3 Pro as model option [x] Done
Model Selection Guide in UI [x] Done
DeepSeek OCR 2 + LightOnOCR-2 in Ollama options [x] Done

Future Considerations

Action Priority Effort Notes
Mistral OCR 3 as provider (Azure) Medium Medium Quality comparison pending (zbz-ocr-tei), requires Azure endpoint config
Agentic Vision integration Medium Medium
LightOnOCR-2 Ollama conversion guide Medium Low
Layout analysis pre-step Low High
Multi-model ensemble Low High

Model Selection Guide (implemented in UI)

Is it handwritten?
├─ Yes → Use Gemini 3 Pro
└─ No (printed)
    ├─ Complex layout? → Gemini 3 Pro or dots.ocr
    ├─ Simple layout, need speed? → Gemini 3 Flash or DeepSeek OCR
    └─ Privacy/offline required? → DeepSeek OCR (Ollama) or LightOnOCR

Implementation Concepts

Agentic Vision (Future)

Concept: Gemini 3 Flash can actively investigate images instead of just passively analyzing them.

Think-Act-Observe Loop:

1. THINK: Model analyzes request, plans multi-step investigation
2. ACT: Model generates Python code to zoom, crop, annotate
3. OBSERVE: Transformed image is re-analyzed
4. REPEAT: Until sufficient clarity is reached

Potential use cases in coOCR/HTR:

  • Automatic zooming on illegible passages
  • Segmentation of complex layouts
  • Quality improvement for damaged documents
  • Verification of uncertain readings through re-analysis

Technical implementation:

// Concept: Agentic Vision API call
const requestBody = {
  contents: [{ parts }],
  generationConfig: {
    temperature: 1.0,
    maxOutputTokens: 8192
  },
  // Enable Code Execution for Agentic Vision
  tools: [{
    codeExecution: {}
  }]
};

Prerequisites:

  • Gemini API with Code Execution support
  • Prompt engineering for multi-step analysis
  • UI for visualization of intermediate steps (optional)

Status: Concept documented, not implemented.


Ollama Integration (Extended)

Current state:

  • DeepSeek-OCR as default model
  • DeepSeek-OCR 2 added as option
  • LightOnOCR-2 added as option (requires conversion)

LightOnOCR-2 via Ollama:

The model is available on Hugging Face but must be converted for Ollama:

# 1. Load model from Hugging Face
git lfs install
git clone https://huggingface.co/lightonai/LightOnOCR-2-1B

# 2. Convert to GGUF (requires llama.cpp)
python convert.py lightonai/LightOnOCR-2-1B --outtype f16

# 3. Create Modelfile
cat > Modelfile << EOF
FROM ./lightonocr-2-1b.gguf
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF

# 4. Import into Ollama
ollama create lightonocr -f Modelfile

Note: The conversion is not trivial and requires:

  • llama.cpp with Vision support
  • Sufficient RAM (16GB+)
  • GPU recommended for usable speed

DeepSeek-OCR 2:

More easily available, can be used directly with Ollama:

# If available in Ollama Registry
ollama pull deepseek-ocr2

# Or manually with Modelfile
ollama create deepseek-ocr2 -f Modelfile

Status: Options added to UI, conversion guide documented.


Sources

Primary Sources

Community Validation

  • Digital Humanities community discussion (2026-02)

Related: METHODOLOGY | ARCHITECTURE | VALIDATION


Updated: 2026-02-14