|
| 1 | +# OCR Providers |
| 2 | + |
| 3 | +Use OCR when a page renders as images or when embedded text is unreliable. The MCP server leans on HTTP-friendly provider wrappers so you can swap vision backends without changing client code. |
| 4 | + |
| 5 | +## Capabilities |
| 6 | + |
| 7 | +- `pdf_ocr_page` — renders a page to PNG (respecting `scale`) and POSTs it to your HTTP wrapper; returns OCR `text`, provider metadata, and `from_cache`. |
| 8 | +- `pdf_ocr_image` — reuses an embedded image (by index) without re-rendering the page; same request shape as `pdf_ocr_page`. |
| 9 | +- Both tools accept `provider` configs (`type: "http"`, `endpoint`, `model`, `language`, `extras`) and optionally `api_key`. Set `cache: true` to reuse responses across identical inputs. |
| 10 | + |
| 11 | +## Architecture |
| 12 | + |
| 13 | +`pdf-reader-mcp` → lightweight HTTP wrapper → upstream vision API. The server never talks directly to cloud vision APIs; you own the wrapper so you can inject prompts, redact data, log, or mock responses. Wrappers accept a JSON body `{ image, model, language, extras }` where `image` is a base64 PNG or data URI. |
| 14 | + |
| 15 | +## Provider recipes (copy/paste-ready) |
| 16 | + |
| 17 | +The patterns below mirror the Option B wrapper in `OCR_BACKLOG.md`: single POST endpoint, direct vision call, and minimal plumbing. Replace the API keys and models with your own. These are docs-only examples—run them from a separate node/ts project. |
| 18 | + |
| 19 | +### Mistral Vision (simple, fast) |
| 20 | + |
| 21 | +```typescript |
| 22 | +// mistral-ocr-wrapper.ts |
| 23 | +import express from 'express'; |
| 24 | +import { Mistral } from '@mistralai/mistralai'; |
| 25 | + |
| 26 | +const app = express(); |
| 27 | +app.use(express.json({ limit: '50mb' })); |
| 28 | +const client = new Mistral({ apiKey: process.env.MISTRAL_API_KEY }); |
| 29 | + |
| 30 | +app.post('/v1/ocr', async (req, res) => { |
| 31 | + const { image, model, language, extras } = req.body; |
| 32 | + const imageUrl = image.startsWith('data:') ? image : `data:image/png;base64,${image}`; |
| 33 | + const prompt = extras?.prompt || 'Extract and transcribe all text from this image. Preserve layout and return markdown.'; |
| 34 | + |
| 35 | + try { |
| 36 | + const response = await client.chat.complete({ |
| 37 | + model: model || 'mistral-large-2512', |
| 38 | + messages: [{ role: 'user', content: [{ type: 'text', text: prompt }, { type: 'image_url', image_url: imageUrl }] }], |
| 39 | + temperature: extras?.temperature ?? 0, |
| 40 | + maxTokens: extras?.max_tokens ?? 4000 |
| 41 | + }); |
| 42 | + |
| 43 | + res.json({ text: response.choices[0].message.content, language }); |
| 44 | + } catch (error) { |
| 45 | + res.status(500).json({ error: error.message || 'OCR processing failed' }); |
| 46 | + } |
| 47 | +}); |
| 48 | + |
| 49 | +app.listen(3000, () => console.log('Mistral OCR wrapper on http://localhost:3000')); |
| 50 | +``` |
| 51 | + |
| 52 | +**Setup** |
| 53 | + |
| 54 | +```bash |
| 55 | +npm init -y |
| 56 | +npm install express @mistralai/mistralai dotenv |
| 57 | +echo "MISTRAL_API_KEY=sk-..." > .env |
| 58 | +npx tsx mistral-ocr-wrapper.ts |
| 59 | +``` |
| 60 | + |
| 61 | +**Provider config** |
| 62 | + |
| 63 | +```json |
| 64 | +{ |
| 65 | + "type": "http", |
| 66 | + "endpoint": "http://localhost:3000/v1/ocr", |
| 67 | + "model": "mistral-large-2512", |
| 68 | + "language": "en", |
| 69 | + "extras": { "prompt": "Preserve tables; return markdown", "temperature": 0 } |
| 70 | +} |
| 71 | +``` |
| 72 | + |
| 73 | +### OpenAI Vision (similar pattern) |
| 74 | + |
| 75 | +```typescript |
| 76 | +// openai-ocr-wrapper.ts |
| 77 | +import express from 'express'; |
| 78 | +import OpenAI from 'openai'; |
| 79 | + |
| 80 | +const app = express(); |
| 81 | +app.use(express.json({ limit: '50mb' })); |
| 82 | +const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); |
| 83 | + |
| 84 | +app.post('/v1/ocr', async (req, res) => { |
| 85 | + const { image, model, language, extras } = req.body; |
| 86 | + const imageUrl = image.startsWith('data:') ? image : `data:image/png;base64,${image}`; |
| 87 | + const prompt = extras?.prompt || 'Extract all text; keep headings, lists, and tables; return markdown.'; |
| 88 | + |
| 89 | + try { |
| 90 | + const completion = await client.chat.completions.create({ |
| 91 | + model: model || 'gpt-4o-mini', |
| 92 | + messages: [{ role: 'user', content: [{ type: 'text', text: prompt }, { type: 'image_url', image_url: { url: imageUrl } }] }], |
| 93 | + temperature: extras?.temperature ?? 0, |
| 94 | + max_tokens: extras?.max_tokens ?? 4000 |
| 95 | + }); |
| 96 | + |
| 97 | + res.json({ text: completion.choices[0].message.content, language }); |
| 98 | + } catch (error) { |
| 99 | + res.status(500).json({ error: error.message || 'OCR processing failed' }); |
| 100 | + } |
| 101 | +}); |
| 102 | + |
| 103 | +app.listen(3001, () => console.log('OpenAI OCR wrapper on http://localhost:3001')); |
| 104 | +``` |
| 105 | + |
| 106 | +**Setup** |
| 107 | + |
| 108 | +```bash |
| 109 | +npm init -y |
| 110 | +npm install express openai dotenv |
| 111 | +echo "OPENAI_API_KEY=sk-..." > .env |
| 112 | +npx tsx openai-ocr-wrapper.ts |
| 113 | +``` |
| 114 | + |
| 115 | +**Provider config** |
| 116 | + |
| 117 | +```json |
| 118 | +{ |
| 119 | + "type": "http", |
| 120 | + "endpoint": "http://localhost:3001/v1/ocr", |
| 121 | + "model": "gpt-4o-mini", |
| 122 | + "language": "en" |
| 123 | +} |
| 124 | +``` |
| 125 | + |
| 126 | +### Google Cloud Vision (brief JSON wrapper) |
| 127 | + |
| 128 | +```typescript |
| 129 | +// gcv-ocr-wrapper.ts |
| 130 | +import express from 'express'; |
| 131 | +import vision from '@google-cloud/vision'; |
| 132 | + |
| 133 | +const app = express(); |
| 134 | +app.use(express.json({ limit: '50mb' })); |
| 135 | +const client = new vision.ImageAnnotatorClient(); |
| 136 | + |
| 137 | +app.post('/v1/ocr', async (req, res) => { |
| 138 | + const { image, language } = req.body; |
| 139 | + const imageContent = image.startsWith('data:') ? image.split(',')[1] : image; |
| 140 | + |
| 141 | + try { |
| 142 | + const [result] = await client.documentTextDetection({ image: { content: imageContent } }); |
| 143 | + const text = result.fullTextAnnotation?.text || ''; |
| 144 | + res.json({ text, language }); |
| 145 | + } catch (error) { |
| 146 | + res.status(500).json({ error: error.message || 'OCR processing failed' }); |
| 147 | + } |
| 148 | +}); |
| 149 | + |
| 150 | +app.listen(3002, () => console.log('GCV OCR wrapper on http://localhost:3002')); |
| 151 | +``` |
| 152 | + |
| 153 | +**Setup** |
| 154 | + |
| 155 | +```bash |
| 156 | +npm init -y |
| 157 | +npm install express @google-cloud/vision dotenv |
| 158 | +# Set GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json |
| 159 | +npx tsx gcv-ocr-wrapper.ts |
| 160 | +``` |
| 161 | + |
| 162 | +**Provider config** |
| 163 | + |
| 164 | +```json |
| 165 | +{ |
| 166 | + "type": "http", |
| 167 | + "endpoint": "http://localhost:3002/v1/ocr", |
| 168 | + "language": "en" |
| 169 | +} |
| 170 | +``` |
| 171 | + |
| 172 | +## Environment variables |
| 173 | + |
| 174 | +- `MISTRAL_API_KEY`, `OPENAI_API_KEY` — required for their wrappers. |
| 175 | +- `GOOGLE_APPLICATION_CREDENTIALS` — path to a service account JSON with Vision scope. |
| 176 | +- Optional: `PROXY`, `HTTPS_PROXY` if wrappers run behind egress controls. |
| 177 | + |
| 178 | +Load keys via `.env` in wrapper projects; the MCP server does not read them directly when calling `type: "http"` providers. |
| 179 | + |
| 180 | +## Testing with a mock provider |
| 181 | + |
| 182 | +Use a no-network stub to validate end-to-end OCR flows: |
| 183 | + |
| 184 | +```typescript |
| 185 | +// mock-ocr-wrapper.ts |
| 186 | +import express from 'express'; |
| 187 | +const app = express(); |
| 188 | +app.use(express.json({ limit: '5mb' })); |
| 189 | +app.post('/v1/ocr', (req, res) => res.json({ text: `MOCK TEXT for page/image`, language: req.body.language || 'en' })); |
| 190 | +app.listen(3999, () => console.log('Mock OCR wrapper on http://localhost:3999')); |
| 191 | +``` |
| 192 | + |
| 193 | +Point `provider.endpoint` to `http://localhost:3999/v1/ocr` and run `pdf_ocr_page` to confirm request shape, cache keys, and error handling without consuming API quota. |
| 194 | + |
| 195 | +## Troubleshooting |
| 196 | + |
| 197 | +- HTTP 401/403: confirm API keys and that the wrapper forwards `Authorization` if your upstream expects it. |
| 198 | +- Empty or partial text: increase render `scale` (e.g., 1.5–2.0) or raise `max_tokens` in `extras`. |
| 199 | +- Mixed languages: set `language` or include a hint in `extras.prompt`. |
| 200 | +- Timeouts: wrappers should set generous `express.json` limits and upstream timeouts; large pages can exceed 10s on some providers. |
| 201 | +- Wrong endpoint: verify the MCP server can reach `http://localhost:PORT`; Docker/WSL may need `0.0.0.0` binding. |
| 202 | + |
| 203 | +## Cache behavior |
| 204 | + |
| 205 | +- OCR caches are keyed by source fingerprint, page/index, scale (for `pdf_ocr_page`), provider endpoint, model, language, and `extras`. |
| 206 | +- `cache: true` reuses prior responses and skips provider calls; `cache: false` forces a fresh request and updates the cache. |
| 207 | +- Manage caches with `pdf_cache_stats` (inspect keys/counts) and `pdf_cache_clear` (`scope: "ocr"` or `"all"`). |
| 208 | +- When wrappers change prompts or models, bump `extras.prompt` or `model` to avoid stale responses. |
0 commit comments