Skip to content

Commit 320ecf1

Browse files
Martinclaude
andcommitted
feat(ocr): add persistent disk cache for OCR results
Implements 3-layer caching architecture to reduce expensive API calls: - Layer 1: In-memory cache (fast, volatile) - Layer 2: Disk cache (persistent, JSON files) - Layer 3: API calls (slow, expensive) OCR results are now stored as {pdf_basename}_ocr.json alongside the source PDF, surviving MCP server restarts and enabling cost-effective reuse of Mistral OCR results. Features: - Automatic fingerprint validation (invalidates on PDF changes) - Provider-specific caching (different OCR settings = separate cache) - Supports both page OCR (pdf_ocr_page) and image OCR (pdf_ocr_image) - Structured storage for Mistral OCR metadata (markdown, tables, etc.) Limitations: - Only works for file-based PDFs (not URLs) - Requires write permissions in PDF directory Files: - src/types/cache.ts: Cache structure type definitions - src/utils/diskCache.ts: Load/save/get/set utilities - src/handlers/ocrPage.ts: Integrated 3-layer cache - src/handlers/ocrImage.ts: Integrated 3-layer cache 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent ee633b4 commit 320ecf1

File tree

7 files changed

+725
-18
lines changed

7 files changed

+725
-18
lines changed

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,20 @@
1212
- Enables selective OCR processing (e.g., OCR only 50 of 800 pages with markers)
1313
- Non-breaking change: defaults to `false` to preserve existing behavior
1414

15+
- **OCR:** add persistent disk cache for OCR results
16+
- 3-layer cache architecture: in-memory → disk → API
17+
- Stores OCR results as `{pdf_basename}_ocr.json` alongside PDFs
18+
- Survives MCP server restarts and reduces expensive API calls
19+
- Fingerprint validation automatically invalidates cache on PDF changes
20+
- Supports both page OCR (`pdf_ocr_page`) and image OCR (`pdf_ocr_image`)
21+
- Only works for file-based PDFs (not URLs)
22+
23+
### 🐛 Bug Fixes
24+
25+
- **pdf_read_pages:** fix image extraction when `insert_markers=true` but `include_image_indexes=false`
26+
- Images were not being extracted for marker insertion
27+
- Now extracts images when EITHER parameter is enabled
28+
1529
## 2.1.0 (2025-12-17)
1630

1731
### ✨ Features

OCR_BACKLOG.md

Lines changed: 65 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,16 @@
11
# OCR Implementation - Status & Backlog
22

33
**Status:** ✅ Implementation complete, ⚠️ Documentation incomplete
4-
**Last Updated:** 2025-12-21
4+
**Last Updated:** 2025-12-22
55
**API Version Checked:** Mistral API 2025-12 (mistral-large-2512, mistral-ocr-2512)
66

7+
## 🆕 Update Summary (2025-12-22)
8+
-**Implemented persistent disk cache** for OCR results
9+
- ✅ 3-layer cache architecture: in-memory → disk → API
10+
- ✅ JSON cache files stored alongside PDFs (`{basename}_ocr.json`)
11+
- ✅ Fingerprint validation to detect PDF changes
12+
- ✅ Supports both page and image OCR caching
13+
714
## 🆕 Update Summary (2025-12-21)
815
- ✅ Verified against current Mistral API documentation
916
- ✅ Updated model names (mistral-large-2512, mistral-medium-2508, etc.)
@@ -34,11 +41,67 @@
3441
- Generic HTTP OCR provider pattern
3542
- Page OCR (`pdf_ocr_page`) with configurable scale
3643
- Image OCR (`pdf_ocr_image`) for embedded images
37-
- Fingerprint-based caching (text + provider key)
44+
- **3-layer caching architecture** (NEW in v1.4.0):
45+
- **Layer 1: In-memory cache** - Fast, volatile (survives within session)
46+
- **Layer 2: Disk cache** - Persistent, survives restarts (JSON files)
47+
- **Layer 3: API calls** - Slow, expensive (only when cache misses)
48+
- Fingerprint-based cache validation (detects PDF modifications)
3849
- Mock provider for testing
3950
- Vex schema validation for provider config
4051
- Cache management tools (`pdf_cache_stats`, `pdf_cache_clear`)
4152

53+
### 💾 Disk Cache Implementation
54+
55+
**File Format:** `{pdf_basename}_ocr.json` (stored in same directory as PDF)
56+
57+
**Structure:**
58+
```json
59+
{
60+
"fingerprint": "sha256-hash-of-first-64kb",
61+
"pdf_path": "/path/to/document.pdf",
62+
"created_at": "2025-12-22T...",
63+
"updated_at": "2025-12-22T...",
64+
"ocr_provider": "mistral-ocr-2512",
65+
"pages": {
66+
"2": {
67+
"text": "OCR result...",
68+
"markdown": "...",
69+
"tables": [...],
70+
"hyperlinks": [...],
71+
"dimensions": {...},
72+
"provider_hash": "sha256...",
73+
"cached_at": "2025-12-22T...",
74+
"scale": 1.5
75+
}
76+
},
77+
"images": {
78+
"2/0": {
79+
"text": "OCR result for image 0 on page 2",
80+
"markdown": "...",
81+
"provider_hash": "sha256...",
82+
"cached_at": "2025-12-22T..."
83+
}
84+
}
85+
}
86+
```
87+
88+
**Benefits:**
89+
- ✅ Survives MCP server restarts
90+
- ✅ Reduces API costs (expensive Mistral OCR calls)
91+
- ✅ Can be version-controlled with PDFs
92+
- ✅ Shareable between users/machines
93+
- ✅ Automatic invalidation on PDF changes (fingerprint mismatch)
94+
95+
**Limitations:**
96+
- Only works for file-based PDFs (not URLs)
97+
- Cache file stored in PDF directory (requires write permissions)
98+
- No automatic cleanup of stale cache files
99+
100+
**Code Locations:**
101+
- **Types:** `src/types/cache.ts` - Cache structure definitions
102+
- **Utilities:** `src/utils/diskCache.ts` - Load/save functions
103+
- **Integration:** `src/handlers/ocrPage.ts`, `src/handlers/ocrImage.ts` - Handler integration
104+
42105
### ⚠️ Mistral Integration Status
43106
**No Mistral-specific code exists** - uses generic HTTP provider.
44107

0 commit comments

Comments
 (0)