docs(ocr): add comprehensive OCR provider documentation

Martin · claude · Martin · commit 6d4a7b5af8f0 · 2025-12-22T02:54:43.000+01:00
Created complete OCR documentation suite with provider examples and guides: New files: - docs/guide/ocr-providers.md (208 lines) - Copy-paste-ready wrapper examples (Mistral, OpenAI, Google Vision) - Architecture overview and setup instructions - Environment variables and API key management - Mock provider for testing - Troubleshooting guide - Cache behavior documentation Updated files: - README.md - Added Provider Examples quickstart section - Documented practical OCR flow (detect → OCR → cache) - Link to detailed provider guide - Updated roadmap: marked "OCR for scanned PDFs" as completed - docs/guide/getting-started.md - Added mock OCR provider documentation - Usage examples for development and CI/CD - Integration testing tips - OCR_BACKLOG.md - Marked documentation tasks as completed - Updated "Documentation Gaps" to "Documentation Status" - All 6 documentation gaps now fixed This closes the high-priority OCR documentation backlog items. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
diff --git a/OCR_BACKLOG.md b/OCR_BACKLOG.md
@@ -214,14 +214,14 @@ chat_response = client.chat.complete(
 - **Pros:** Native support, no wrapper needed
 - **Cons:** Increases codebase complexity, maintenance burden
 
-### ❌ Documentation Gaps
+### ⚠️ Documentation Status
 
-1. **No provider examples** (Mistral, OpenAI Vision, Google Vision, etc.)
-2. **Mock provider undocumented** (exists in code, not in docs)
-3. **Caching details missing** (cache key construction, invalidation)
-4. **`extras` field unexplained** (no concrete use cases)
-5. **No troubleshooting guide** (error handling, debugging)
-6. **Provider contract unclear** (required response format: `{ text }` or `{ ocr }`)
+1. ~~**No provider examples**~~ ✅ Fixed (2025-12-22) - Added to README and docs/guide/ocr-providers.md
+2. ~~**Mock provider undocumented**~~ ✅ Fixed (2025-12-22) - Added to docs/guide/getting-started.md
+3. ~~**Caching details missing**~~ ✅ Fixed (2025-12-22) - Documented in ocr-providers.md
+4. ~~**`extras` field unexplained**~~ ✅ Fixed (2025-12-22) - Examples in provider wrapper code
+5. ~~**No troubleshooting guide**~~ ✅ Fixed (2025-12-22) - Troubleshooting section in ocr-providers.md
+6. ~~**Provider contract unclear**~~ ✅ Fixed (2025-12-22) - Documented in README and guide
 
 ## Backlog
 
@@ -231,24 +231,27 @@ chat_response = client.chat.complete(
   - Simple Express.js/Node.js HTTP server
   - Translates pdf-reader-mcp format → Mistral Vision API
   - Deploy as separate service or Docker container
-  - Template code already drafted in this backlog (see Implementation Example)
-
-- [ ] **Add provider examples** to README/docs
-  - Mistral Vision wrapper (with setup instructions)
-  - OpenAI Vision API wrapper (similar pattern)
-  - Google Cloud Vision wrapper
-  - Document that Mistral OCR API is incompatible (document-level vs page-level)
-
-- [ ] **Create `docs/guide/ocr-providers.md`**
-  - Architecture overview: pdf-reader-mcp → wrapper → vision APIs
-  - Step-by-step wrapper setup (Mistral, OpenAI, Google)
-  - Environment variables and API keys
-  - Testing and troubleshooting
-
-- [ ] **Document mock provider**
-  - When to use (testing, development)
-  - Default behavior (returns placeholder text)
-  - How to test without real API calls
+  - ✅ Template code now in docs/guide/ocr-providers.md (copy-paste ready)
+
+- [x] **Add provider examples** to README/docs ✅ (2025-12-22)
+  - ✅ Mistral Vision wrapper (with setup instructions)
+  - ✅ OpenAI Vision API wrapper (similar pattern)
+  - ✅ Google Cloud Vision wrapper
+  - ✅ Documented that Mistral OCR API is incompatible (document-level vs page-level)
+  - ✅ Added practical OCR flow to README
+
+- [x] **Create `docs/guide/ocr-providers.md`** ✅ (2025-12-22)
+  - ✅ Architecture overview: pdf-reader-mcp → wrapper → vision APIs
+  - ✅ Step-by-step wrapper setup (Mistral, OpenAI, Google)
+  - ✅ Environment variables and API keys
+  - ✅ Testing and troubleshooting
+  - ✅ Cache behavior documentation
+
+- [x] **Document mock provider** ✅ (2025-12-22)
+  - ✅ When to use (testing, development, CI/CD)
+  - ✅ Default behavior (returns placeholder text)
+  - ✅ How to test without real API calls
+  - ✅ Added to docs/guide/getting-started.md with integration testing tips
 
 ### Medium Priority
 
diff --git a/README.md b/README.md
@@ -525,6 +525,23 @@ Runs OCR against a rendered page with provider overrides and caching.
 
 The service receives `{ "image": "<base64 PNG>", "model": "vision-large", "language": "en", "extras": { "detect_tables": true } }` and must respond with `{ "text": "..." }` (or `{ "ocr": "..." }`).
 
+**Provider Examples (quick start):**
+
+- **Mistral Vision (HTTP wrapper):**
+  ```json
+  { "type": "http", "endpoint": "http://localhost:8787/ocr", "model": "mistral-large-2407" }
+  ```
+- **OpenAI Vision (HTTP wrapper):**
+  ```json
+  { "type": "http", "endpoint": "http://localhost:8788/ocr", "model": "gpt-4o-mini" }
+  ```
+See `docs/guide/ocr-providers.md` for full setup and wrapper code.
+
+**Practical OCR flow:**
+1) Detect pages that need vision: `pdf_read_pages` with `"insert_markers": true` to find `[IMAGE n: ...]` markers.  
+2) OCR only the candidates: call `pdf_ocr_page` with your provider config (e.g., Mistral or OpenAI wrapper above).  
+3) Rerun cheaply: keep `cache: true` so repeated `pdf_ocr_page` calls reuse the cached text instead of re-hitting the API.
+
 ### `pdf_ocr_image` — OCR a single image
 
 Targets one embedded image for OCR without rasterizing the full page again.
@@ -880,9 +897,9 @@ See [CONTRIBUTING.md](./CONTRIBUTING.md)
 - [x] Y-coordinate ordering (v1.2.0)
 - [x] Absolute paths (v1.3.0)
 - [x] 94%+ test coverage (v1.3.0)
+- [x] OCR for scanned PDFs
 
 **🚀 Next**
-- [ ] OCR for scanned PDFs
 - [ ] Annotation extraction
 - [ ] Form field extraction
 - [ ] Table detection
diff --git a/docs/guide/getting-started.md b/docs/guide/getting-started.md
@@ -189,6 +189,42 @@ Responds with page dimensions, scale, fingerprint, and a PNG part for the render
 
 Outputs OCR `text`, provider info, whether it came `from_cache`, and page identifiers. Use `pdf_ocr_image` similarly when you already know the image index.
 
+**Mock OCR provider — fast, no-network placeholder**
+
+Use the built-in mock provider when you want predictable OCR responses without hitting an external API (ideal for local development, CI, and integration tests).
+
+```json
+{
+  "source": { "path": "./docs/report.pdf" },
+  "page": 1,
+  "provider": { "type": "mock", "name": "test-ocr" },
+  "cache": false
+}
+```
+
+What it does:
+- Returns immediately with placeholder text; never performs network calls.
+- Uses the provided `name` (or `mock` by default) in the response so you can assert which provider ran.
+
+Example response:
+
+```json
+{
+  "source": "./docs/report.pdf",
+  "success": true,
+  "data": {
+    "text": "OCR provider not configured. Supply provider options to enable OCR.",
+    "provider": "test-ocr",
+    "fingerprint": "<document-fingerprint>",
+    "from_cache": false,
+    "page": 1
+  }
+}
+```
+
+Integration testing tip:
+- Exercise the OCR tools end-to-end without external dependencies by calling `pdf_ocr_page` (or `pdf_ocr_image`) with `provider: { "type": "mock" }` and `cache: false`. Assert on the static `text` string and `provider` name to confirm the handler, caching guards, and payload shape are wired correctly.
+
 ### Cache management
 
 Inspect cache state or clear scopes between runs:
diff --git a/docs/guide/ocr-providers.md b/docs/guide/ocr-providers.md
@@ -0,0 +1,208 @@
+# OCR Providers
+
+Use OCR when a page renders as images or when embedded text is unreliable. The MCP server leans on HTTP-friendly provider wrappers so you can swap vision backends without changing client code.
+
+## Capabilities
+
+- `pdf_ocr_page` — renders a page to PNG (respecting `scale`) and POSTs it to your HTTP wrapper; returns OCR `text`, provider metadata, and `from_cache`.
+- `pdf_ocr_image` — reuses an embedded image (by index) without re-rendering the page; same request shape as `pdf_ocr_page`.
+- Both tools accept `provider` configs (`type: "http"`, `endpoint`, `model`, `language`, `extras`) and optionally `api_key`. Set `cache: true` to reuse responses across identical inputs.
+
+## Architecture
+
+`pdf-reader-mcp` → lightweight HTTP wrapper → upstream vision API. The server never talks directly to cloud vision APIs; you own the wrapper so you can inject prompts, redact data, log, or mock responses. Wrappers accept a JSON body `{ image, model, language, extras }` where `image` is a base64 PNG or data URI.
+
+## Provider recipes (copy/paste-ready)
+
+The patterns below mirror the Option B wrapper in `OCR_BACKLOG.md`: single POST endpoint, direct vision call, and minimal plumbing. Replace the API keys and models with your own. These are docs-only examples—run them from a separate node/ts project.
+
+### Mistral Vision (simple, fast)
+
+```typescript
+// mistral-ocr-wrapper.ts
+import express from 'express';
+import { Mistral } from '@mistralai/mistralai';
+
+const app = express();
+app.use(express.json({ limit: '50mb' }));
+const client = new Mistral({ apiKey: process.env.MISTRAL_API_KEY });
+
+app.post('/v1/ocr', async (req, res) => {
+  const { image, model, language, extras } = req.body;
+  const imageUrl = image.startsWith('data:') ? image : `data:image/png;base64,${image}`;
+  const prompt = extras?.prompt || 'Extract and transcribe all text from this image. Preserve layout and return markdown.';
+
+  try {
+    const response = await client.chat.complete({
+      model: model || 'mistral-large-2512',
+      messages: [{ role: 'user', content: [{ type: 'text', text: prompt }, { type: 'image_url', image_url: imageUrl }] }],
+      temperature: extras?.temperature ?? 0,
+      maxTokens: extras?.max_tokens ?? 4000
+    });
+
+    res.json({ text: response.choices[0].message.content, language });
+  } catch (error) {
+    res.status(500).json({ error: error.message || 'OCR processing failed' });
+  }
+});
+
+app.listen(3000, () => console.log('Mistral OCR wrapper on http://localhost:3000'));
+```
+
+**Setup**
+
+```bash
+npm init -y
+npm install express @mistralai/mistralai dotenv
+echo "MISTRAL_API_KEY=sk-..." > .env
+npx tsx mistral-ocr-wrapper.ts
+```
+
+**Provider config**
+
+```json
+{
+  "type": "http",
+  "endpoint": "http://localhost:3000/v1/ocr",
+  "model": "mistral-large-2512",
+  "language": "en",
+  "extras": { "prompt": "Preserve tables; return markdown", "temperature": 0 }
+}
+```
+
+### OpenAI Vision (similar pattern)
+
+```typescript
+// openai-ocr-wrapper.ts
+import express from 'express';
+import OpenAI from 'openai';
+
+const app = express();
+app.use(express.json({ limit: '50mb' }));
+const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
+
+app.post('/v1/ocr', async (req, res) => {
+  const { image, model, language, extras } = req.body;
+  const imageUrl = image.startsWith('data:') ? image : `data:image/png;base64,${image}`;
+  const prompt = extras?.prompt || 'Extract all text; keep headings, lists, and tables; return markdown.';
+
+  try {
+    const completion = await client.chat.completions.create({
+      model: model || 'gpt-4o-mini',
+      messages: [{ role: 'user', content: [{ type: 'text', text: prompt }, { type: 'image_url', image_url: { url: imageUrl } }] }],
+      temperature: extras?.temperature ?? 0,
+      max_tokens: extras?.max_tokens ?? 4000
+    });
+
+    res.json({ text: completion.choices[0].message.content, language });
+  } catch (error) {
+    res.status(500).json({ error: error.message || 'OCR processing failed' });
+  }
+});
+
+app.listen(3001, () => console.log('OpenAI OCR wrapper on http://localhost:3001'));
+```
+
+**Setup**
+
+```bash
+npm init -y
+npm install express openai dotenv
+echo "OPENAI_API_KEY=sk-..." > .env
+npx tsx openai-ocr-wrapper.ts
+```
+
+**Provider config**
+
+```json
+{
+  "type": "http",
+  "endpoint": "http://localhost:3001/v1/ocr",
+  "model": "gpt-4o-mini",
+  "language": "en"
+}
+```
+
+### Google Cloud Vision (brief JSON wrapper)
+
+```typescript
+// gcv-ocr-wrapper.ts
+import express from 'express';
+import vision from '@google-cloud/vision';
+
+const app = express();
+app.use(express.json({ limit: '50mb' }));
+const client = new vision.ImageAnnotatorClient();
+
+app.post('/v1/ocr', async (req, res) => {
+  const { image, language } = req.body;
+  const imageContent = image.startsWith('data:') ? image.split(',')[1] : image;
+
+  try {
+    const [result] = await client.documentTextDetection({ image: { content: imageContent } });
+    const text = result.fullTextAnnotation?.text || '';
+    res.json({ text, language });
+  } catch (error) {
+    res.status(500).json({ error: error.message || 'OCR processing failed' });
+  }
+});
+
+app.listen(3002, () => console.log('GCV OCR wrapper on http://localhost:3002'));
+```
+
+**Setup**
+
+```bash
+npm init -y
+npm install express @google-cloud/vision dotenv
+# Set GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
+npx tsx gcv-ocr-wrapper.ts
+```
+
+**Provider config**
+
+```json
+{
+  "type": "http",
+  "endpoint": "http://localhost:3002/v1/ocr",
+  "language": "en"
+}
+```
+
+## Environment variables
+
+- `MISTRAL_API_KEY`, `OPENAI_API_KEY` — required for their wrappers.
+- `GOOGLE_APPLICATION_CREDENTIALS` — path to a service account JSON with Vision scope.
+- Optional: `PROXY`, `HTTPS_PROXY` if wrappers run behind egress controls.
+
+Load keys via `.env` in wrapper projects; the MCP server does not read them directly when calling `type: "http"` providers.
+
+## Testing with a mock provider
+
+Use a no-network stub to validate end-to-end OCR flows:
+
+```typescript
+// mock-ocr-wrapper.ts
+import express from 'express';
+const app = express();
+app.use(express.json({ limit: '5mb' }));
+app.post('/v1/ocr', (req, res) => res.json({ text: `MOCK TEXT for page/image`, language: req.body.language || 'en' }));
+app.listen(3999, () => console.log('Mock OCR wrapper on http://localhost:3999'));
+```
+
+Point `provider.endpoint` to `http://localhost:3999/v1/ocr` and run `pdf_ocr_page` to confirm request shape, cache keys, and error handling without consuming API quota.
+
+## Troubleshooting
+
+- HTTP 401/403: confirm API keys and that the wrapper forwards `Authorization` if your upstream expects it.
+- Empty or partial text: increase render `scale` (e.g., 1.5–2.0) or raise `max_tokens` in `extras`.
+- Mixed languages: set `language` or include a hint in `extras.prompt`.
+- Timeouts: wrappers should set generous `express.json` limits and upstream timeouts; large pages can exceed 10s on some providers.
+- Wrong endpoint: verify the MCP server can reach `http://localhost:PORT`; Docker/WSL may need `0.0.0.0` binding.
+
+## Cache behavior
+
+- OCR caches are keyed by source fingerprint, page/index, scale (for `pdf_ocr_page`), provider endpoint, model, language, and `extras`.
+- `cache: true` reuses prior responses and skips provider calls; `cache: false` forces a fresh request and updates the cache.
+- Manage caches with `pdf_cache_stats` (inspect keys/counts) and `pdf_cache_clear` (`scope: "ocr"` or `"all"`).
+- When wrappers change prompts or models, bump `extras.prompt` or `model` to avoid stale responses.