docs(ocr): add two-tier Vision→OCR workflow to backlog + comparison test

Martin · claude · Martin · commit ae6acf22b52b · 2025-12-22T04:17:50.000+01:00
Added to OCR_BACKLOG.md (Medium Priority): - Two-tier workflow: Vision classification → OCR deep analysis - Tier 1 (Vision): Quick, cheap image classification - Tier 2 (OCR): Detailed data extraction on demand - Triggers: Image markers, complex tables, user choice - Benefits: Cost-effective, flexible, user-controlled depth - Dedicated Mistral OCR API wrapper (to be built) - Use client.ocr.process() not chat.complete() - Model: mistral-ocr-2512 (OCR 3) - Features: structured output, tables, headers/footers - Note: Current wrapper is Vision API (good for classification) Created OCR_COMPARISON_TEST.md: - Detailed comparison: Claude Vision vs Mistral Vision vs Mistral OCR - Test case: N3290x Design Guide, Page 890 timing diagram - Results: Vision good for "what is this?", OCR best for precise extraction - Recommended workflow: Vision triage → OCR deep dive (optional) - Cost analysis: Vision ~$0.003, OCR ~$0.002 per page - Both should offer persistent caching Clarifies the Vision/OCR API confusion discovered during testing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
diff --git a/OCR_BACKLOG.md b/OCR_BACKLOG.md
@@ -255,6 +255,35 @@ chat_response = client.chat.complete(
 
 ### Medium Priority
 
+- [ ] **Two-tier Vision → OCR workflow** 🆕
+  - **Tier 1: Vision Classification (quick, cheap)**
+    - Classify image type: "timing diagram", "table", "schematic", etc.
+    - Provide context: "4 signals, technical drawing"
+    - Ask: "Want more details?" → proceed to Tier 2
+  - **Tier 2: OCR Deep Analysis (detailed, expensive)**
+    - Extract all labels, values, annotations
+    - Precise technical data: "VDD33: 3.3V @ t=0ms, threshold: 1.62V"
+  - **Triggers:**
+    - Image markers: `[IMAGE n: ...]` → offer Vision classification
+    - Complex tables: Detected via text extraction → offer OCR
+    - User choice: Both tiers should be optional
+  - **Benefits:**
+    - Cost-effective: Vision for triage, OCR only when needed
+    - Flexible: User controls depth of analysis
+  - **Implementation:**
+    - Vision wrapper (existing: chat.complete)
+    - OCR wrapper (new: ocr.process with mistral-ocr-2512)
+    - Workflow orchestration tool/skill
+
+- [ ] **Build dedicated Mistral OCR API wrapper** 🆕
+  - Use `client.ocr.process()` instead of Vision API
+  - Model: `mistral-ocr-2512` (OCR 3)
+  - Endpoint: `/v1/ocr`
+  - Features: table_format, extract_header/footer, include_image_base64
+  - Structured output: markdown, HTML tables, images
+  - Price: $2/1000 pages ($1 with Batch API)
+  - **Note:** Current wrapper uses Vision API (chat.complete) - good for classification, not OCR
+
 - [ ] **Add troubleshooting guide** (`docs/troubleshooting/ocr.md`)
   - Common errors (missing endpoint, auth failures)
   - Response format validation errors
diff --git a/OCR_COMPARISON_TEST.md b/OCR_COMPARISON_TEST.md
@@ -0,0 +1,184 @@
+# OCR API Comparison Test
+
+**Date:** 2025-12-22
+**Document:** N3290x_Design_Guide_A1.pdf, Pages 889-890
+**Purpose:** Compare Vision API vs OCR API for technical diagram analysis
+
+## Test Case: Power-on Sequence Timing Diagram
+
+**Location:** Page 890, Image 1 (918x482px)
+**Content:** Technical timing diagram with voltage signals, thresholds, and timing parameters
+
+---
+
+## Method 1: Claude Vision (Native)
+
+**Process:** pdf-reader-mcp → Image extraction → Claude Sonnet 4.5 Vision
+
+**Result:**
+
+### Analysis:
+- **Identified as:** Power-on Sequence Timing-Diagramm
+- **Signals detected:**
+  1. VDD33 (blau) - 3.3V IO Power
+  2. 1.8V Core Power (türkis)
+  3. RESET (rot gestrichelt)
+  4. Internal RESET (magenta gestrichelt)
+
+- **Key parameters extracted:**
+  - Threshold: 1.62V
+  - Threshold: VDD33/2
+  - Timing: "More than 4T where T is XTAL cycle"
+  - Duration: 75ms
+  - Label: "Valid power on setting value"
+
+- **Axes:**
+  - Y-axis: Voltage (V)
+  - X-axis: Time (mS)
+
+**Quality:** ✅ Accurate, comprehensive technical understanding
+
+**Cost:** ~$0.01-0.02 per image (Claude API pricing)
+
+**Cache:** ❌ No persistent cache (not part of pdf-reader-mcp OCR cache)
+
+---
+
+## Method 2: Mistral Vision API (via wrapper)
+
+**Process:** pdf-reader-mcp → Image extraction → Mistral Vision wrapper → mistral-large-2512
+
+**Current Status:** ✅ Wrapper built and tested
+**API Used:** `client.chat.complete()` with vision
+**Model:** mistral-large-2512
+
+**Expected Result:** Similar to Claude Vision
+- Semantic understanding
+- Identifies diagram type
+- General description of signals
+
+**Quality:** Expected ✅ Good for classification
+
+**Cost:** ~$0.002-0.003 per image (Mistral Vision pricing)
+
+**Cache:** ✅ Persistent disk cache (`N3290x_Design_Guide_A1_ocr.json`)
+
+**Note:** This is **Vision API**, not **OCR API** - good for "what is this?" not "extract all labels"
+
+---
+
+## Method 3: Mistral OCR API (NOT YET IMPLEMENTED)
+
+**Process:** pdf-reader-mcp → Image/PDF → Mistral OCR wrapper → mistral-ocr-2512
+
+**Current Status:** ❌ Not implemented (Vision wrapper built instead)
+
+**API Needed:** `client.ocr.process()`
+**Model:** `mistral-ocr-2512` (OCR 3)
+**Endpoint:** `/v1/ocr`
+
+**Expected Features:**
+- Structured output: `.markdown`, `.tables[]`, `.images[]`
+- Precise text extraction from technical diagrams
+- Table detection with HTML/markdown output
+- Header/footer extraction
+
+**Expected Result for our diagram:**
+```json
+{
+  "markdown": "VDD33\n1.8V Core Power\nRESET\nInternal RESET\n...",
+  "labels": [
+    "Voltage (V)",
+    "Time (mS)",
+    "1.62V",
+    "VDD33/2",
+    "More than 4T where T is XTAL cycle",
+    "75ms",
+    "Valid power on setting value"
+  ]
+}
+```
+
+**Quality:** Expected ✅✅ Best for precise data extraction
+
+**Cost:** $2 per 1,000 pages = $0.002 per page ($1 with Batch API)
+
+**Cache:** ✅ Persistent disk cache (same as Vision wrapper)
+
+---
+
+## Comparison Summary
+
+| Method | API Type | Quality | Cost/Image | Cache | Best For |
+|--------|----------|---------|------------|-------|----------|
+| **Claude Vision** | Vision | ✅ Excellent | ~$0.01-0.02 | ❌ No | Semantic understanding, complex analysis |
+| **Mistral Vision** | Vision | ✅ Good | ~$0.002-0.003 | ✅ Yes | Quick classification, "what is this?" |
+| **Mistral OCR** | OCR | ✅✅ Best | ~$0.002 | ✅ Yes | **Precise data extraction, technical diagrams** |
+
+---
+
+## Recommended Workflow: Two-Tier Approach
+
+### Tier 1: Vision Classification (Quick Triage)
+**Tool:** Mistral Vision wrapper (existing)
+- "This is a timing diagram with 4 signals"
+- "Complex table with 12 rows"
+- **Cost:** Low (~$0.003)
+- **Speed:** Fast
+- **Decision:** "Interesting? → Proceed to OCR"
+
+### Tier 2: OCR Deep Analysis (On Demand)
+**Tool:** Mistral OCR wrapper (to be built)
+- "VDD33: 3.3V, rises from 0V at t=0ms"
+- "Threshold: 1.62V (VDD33/2)"
+- "Timing constraint: >4T where T=XTAL cycle"
+- "Duration: 75ms until valid power-on"
+- **Cost:** Low (~$0.002)
+- **Speed:** Moderate
+- **Trigger:** User requests details
+
+### Benefits:
+- 💰 Cost-effective: Vision for triage, OCR only when needed
+- ⚡ Fast: Quick overview without deep analysis
+- 🎯 Flexible: User controls analysis depth
+- 💾 Cached: Both results persist in .json files
+
+---
+
+## Action Items
+
+- [x] Build Mistral Vision wrapper (completed)
+- [ ] Build Mistral OCR wrapper (`client.ocr.process()`)
+- [ ] Implement two-tier workflow
+- [ ] Add Vision classification as optional step in pdf-reader-mcp
+- [ ] Document both approaches in guide
+
+---
+
+## Technical Notes
+
+### Current Mistral Vision Wrapper
+- ✅ Working: POST /v1/ocr endpoint
+- ✅ Uses: `client.chat.complete()` with vision
+- ✅ Accepts: Base64 images, data URIs
+- ✅ Returns: `{ text, language }`
+- ⚠️ Limitation: Vision API, not OCR API - good for understanding, not extraction
+
+### Needed: Mistral OCR Wrapper
+- ❌ Not built yet
+- Should use: `client.ocr.process()`
+- Should accept: PDFs, base64 images
+- Should return: Structured data (markdown, tables, images)
+- Features: table_format, extract_header/footer, include_image_base64
+
+### Why Both?
+- **Vision:** Semantic understanding ("This is a Power-on sequence diagram")
+- **OCR:** Data extraction ("VDD33=3.3V, t=75ms, threshold=1.62V")
+- **Together:** Complete analysis pipeline
+
+---
+
+**Conclusion:** For technical diagrams like our timing diagram, the ideal approach is:
+1. Quick Vision classification to understand context
+2. Deep OCR analysis to extract precise values
+3. Both cached for future reference