| type | knowledge | ||||||
|---|---|---|---|---|---|---|---|
| created | 2026-02-03 | ||||||
| updated | 2026-02-03 | ||||||
| tags |
|
||||||
| status | active |
Current state of OCR/HTR models relevant for coOCR/HTR (as of February 2026).
Findings from the Digital Humanities community (2026-02):
- Gemini 3 Pro leads closed models "by a long way"
- LightOnOCR is currently state-of-the-art for open source
- DeepSeek OCR 2 "works okay for simple layouts"
- Layout analysis as a separate step significantly improves accuracy
- Agentic Vision Mode relevant for complex layouts
- HTR (handwriting) remains more challenging than OCR (print) across all models
| Model | Strengths | Weaknesses | HTR Quality | Notes |
|---|---|---|---|---|
| Gemini 3 Pro | Best overall accuracy, HTR breakthrough, complex layouts | Cost, API dependency | Excellent | "Solves" English HTR (18th-19th c.) |
| Gemini 3 Flash | Fast, cost-effective, Agentic Vision | Less accurate than Pro | Good | 3x faster, 4x cheaper than Pro |
| GPT-4o | Good semantic understanding, handwriting | Less layout-aware | Good | Better for semantic context |
| Claude | Good reasoning | Not OCR-specialized | Moderate | Better for validation than transcription |
| Model | Parameters | Strengths | Weaknesses | License |
|---|---|---|---|---|
| LightOnOCR-2 | 1B | SotA on OlmOCR-Bench (83.2), fastest | Less tested on HTR | Apache 2.0 |
| dots.ocr | 1.7B | 100+ languages, excellent layout | Slower than LightOnOCR | MIT |
| DeepSeek OCR 2 | 3B | Structure preservation, local | Handwriting limitations | Open |
| PaddleOCR-VL | 0.9B | Multilingual | Slower | Apache 2.0 |
| Model | Score | Speed (pages/s) | Notes |
|---|---|---|---|
| LightOnOCR-2 | 83.2 | 5.71 | Best accuracy + speed |
| dots.ocr | ~82 | 1.14 | Best multilingual |
| DeepSeek OCR 2 | ~78 | 3.3 | Good for simple layouts |
| Use Case | Recommended | Rationale |
|---|---|---|
| Historical handwriting (HTR) | Pro | Significantly better accuracy |
| Complex multi-column layouts | Pro | Better layout understanding |
| Simple printed documents | Flash | Cost-effective, sufficient quality |
| Batch processing | Flash | 3x faster, 4x cheaper |
| Prototyping/testing | Flash | Lower cost iteration |
New capability combining visual reasoning with code execution:
Think - Act - Observe Loop:
- Think: Analyzes request, formulates multi-step plan
- Act: Generates Python code to zoom, crop, annotate
- Observe: Re-examines transformed image
Benefits:
- 5-10% quality boost on vision benchmarks
- Reduced hallucinations
- Better for high-resolution documents
- Can zoom into regions of interest
Activation: Requires explicit prompt or code execution enabled in API.
Relevance for coOCR/HTR: Could improve recognition of dense documents, small text, or damaged areas by allowing the model to "investigate" problematic regions.
Research finding (Generative History, 2025):
"Sixty years after IBM's first HTR system, Gemini 3 Pro has solved HTR on English language texts. By 'solved' we don't mean absolute perfection, but that Gemini 3 consistently produces text with error rates comparable to the very best humans."
Test corpus: 50 English documents, 18th-19th century, including letters, legal documents, meeting transcriptions, memorandums, journal entries.
Implication: For English historical documents, Gemini 3 Pro may reduce the need for specialized HTR models like Transkribus.
Current state-of-the-art open source OCR model.
- Vision Transformer encoder (Pixtral-based)
- Lightweight text decoder (Qwen3-based)
- Distilled from high-quality open VLMs
- End-to-end (no external OCR pipeline)
- Layout-aware text extraction
- Handles tables, receipts, forms, multi-column, math notation
- Compact variants for European languages (32k/16k vocab)
- 6.49x faster than dots.ocr
- 2.67x faster than PaddleOCR-VL
- 1.73x faster than DeepSeek OCR
- 9x smaller than comparable approaches
- Supported in vLLM v0.11.1+
- Available on Hugging Face:
lightonai/LightOnOCR-2-1B - Can run locally via Ollama (requires conversion)
- Less tested on historical handwriting (HTR)
- Optimized for document parsing, not manuscript transcription
Multilingual document parser with excellent layout detection.
Single VLM handles everything:
- Layout detection
- Text parsing
- Reading order
- Formula recognition
No multi-model pipeline required.
- 100+ languages supported
- SOTA on multilingual documents
- Error rates nearly halved vs competitors on low-resource languages
- Formula recognition comparable to 72B models
- GitHub:
rednote-hilab/dots.ocr - Hugging Face:
rednote-hilab/dots.ocr - MIT License
- Handwriting: Not a core focus; "performance remains limited compared to specialized cursive OCR tools"
- Complex layouts: "Works okay for simple layouts" (community feedback)
- Historical scripts: Unfamiliar letter forms problematic
- Printed documents with clear structure
- Local/offline processing requirement
- Privacy-sensitive documents (runs locally)
- Batch processing of simple documents
- Handwritten documents (use Gemini Pro or specialized HTR)
- Complex multi-column layouts (use dots.ocr or Gemini)
- Low-resource languages (use dots.ocr)
- Capture at 300 DPI or higher
- Avoid glass reflections
- Denoise lightly and deskew
- Use higher token/vision preset for faint text
| Action | Status |
|---|---|
| Gemini 3 Pro as model option | [x] Done |
| Model Selection Guide in UI | [x] Done |
| DeepSeek OCR 2 + LightOnOCR-2 in Ollama options | [x] Done |
| Action | Priority | Effort | Notes |
|---|---|---|---|
| Mistral OCR 3 as provider (Azure) | Medium | Medium | Quality comparison pending (zbz-ocr-tei), requires Azure endpoint config |
| Agentic Vision integration | Medium | Medium | |
| LightOnOCR-2 Ollama conversion guide | Medium | Low | |
| Layout analysis pre-step | Low | High | |
| Multi-model ensemble | Low | High |
Is it handwritten?
├─ Yes → Use Gemini 3 Pro
└─ No (printed)
├─ Complex layout? → Gemini 3 Pro or dots.ocr
├─ Simple layout, need speed? → Gemini 3 Flash or DeepSeek OCR
└─ Privacy/offline required? → DeepSeek OCR (Ollama) or LightOnOCR
Concept: Gemini 3 Flash can actively investigate images instead of just passively analyzing them.
Think-Act-Observe Loop:
1. THINK: Model analyzes request, plans multi-step investigation
2. ACT: Model generates Python code to zoom, crop, annotate
3. OBSERVE: Transformed image is re-analyzed
4. REPEAT: Until sufficient clarity is reached
Potential use cases in coOCR/HTR:
- Automatic zooming on illegible passages
- Segmentation of complex layouts
- Quality improvement for damaged documents
- Verification of uncertain readings through re-analysis
Technical implementation:
// Concept: Agentic Vision API call
const requestBody = {
contents: [{ parts }],
generationConfig: {
temperature: 1.0,
maxOutputTokens: 8192
},
// Enable Code Execution for Agentic Vision
tools: [{
codeExecution: {}
}]
};Prerequisites:
- Gemini API with Code Execution support
- Prompt engineering for multi-step analysis
- UI for visualization of intermediate steps (optional)
Status: Concept documented, not implemented.
Current state:
- DeepSeek-OCR as default model
- DeepSeek-OCR 2 added as option
- LightOnOCR-2 added as option (requires conversion)
LightOnOCR-2 via Ollama:
The model is available on Hugging Face but must be converted for Ollama:
# 1. Load model from Hugging Face
git lfs install
git clone https://huggingface.co/lightonai/LightOnOCR-2-1B
# 2. Convert to GGUF (requires llama.cpp)
python convert.py lightonai/LightOnOCR-2-1B --outtype f16
# 3. Create Modelfile
cat > Modelfile << EOF
FROM ./lightonocr-2-1b.gguf
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF
# 4. Import into Ollama
ollama create lightonocr -f ModelfileNote: The conversion is not trivial and requires:
- llama.cpp with Vision support
- Sufficient RAM (16GB+)
- GPU recommended for usable speed
DeepSeek-OCR 2:
More easily available, can be used directly with Ollama:
# If available in Ollama Registry
ollama pull deepseek-ocr2
# Or manually with Modelfile
ollama create deepseek-ocr2 -f ModelfileStatus: Options added to UI, conversion guide documented.
- LightOnOCR Blog
- LightOnOCR-2 Hugging Face
- Gemini 3 Pro Vision Blog
- Gemini 3 HTR Analysis
- Agentic Vision Blog
- dots.ocr GitHub
- DeepSeek OCR Handwriting Test
- Digital Humanities community discussion (2026-02)
Related: METHODOLOGY | ARCHITECTURE | VALIDATION
Updated: 2026-02-14