|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Build and Test Commands |
| 6 | + |
| 7 | +```bash |
| 8 | +# Build the project |
| 9 | +bun run build |
| 10 | + |
| 11 | +# Run all tests |
| 12 | +bun run test |
| 13 | +# or via Vitest directly |
| 14 | +bunx vitest run |
| 15 | + |
| 16 | +# Run tests in watch mode |
| 17 | +bun run test:watch |
| 18 | + |
| 19 | +# Generate coverage report |
| 20 | +bun run test:cov |
| 21 | + |
| 22 | +# Run benchmarks |
| 23 | +bun run benchmark |
| 24 | + |
| 25 | +# Type checking |
| 26 | +bun run typecheck |
| 27 | +``` |
| 28 | + |
| 29 | +## Code Quality Commands |
| 30 | + |
| 31 | +```bash |
| 32 | +# Run linting and formatting checks |
| 33 | +bun run check |
| 34 | + |
| 35 | +# Auto-fix linting and formatting issues |
| 36 | +bun run check:fix |
| 37 | + |
| 38 | +# Lint only |
| 39 | +bun run lint |
| 40 | + |
| 41 | +# Format only |
| 42 | +bun run format |
| 43 | + |
| 44 | +# Complete validation (lint + format + tests) |
| 45 | +bun run validate |
| 46 | +``` |
| 47 | + |
| 48 | +## Running the MCP Server |
| 49 | + |
| 50 | +```bash |
| 51 | +# Run the built server |
| 52 | +bun run start |
| 53 | +# or |
| 54 | +node dist/index.js |
| 55 | + |
| 56 | +# Run with MCP Inspector for debugging |
| 57 | +bun run inspector |
| 58 | +``` |
| 59 | + |
| 60 | +## Architecture Overview |
| 61 | + |
| 62 | +### Core Design Philosophy |
| 63 | + |
| 64 | +This is an **MCP (Model Context Protocol) server** that provides PDF processing capabilities to AI agents. The architecture emphasizes: |
| 65 | + |
| 66 | +- **Specialized tool handlers** - Each MCP tool (pdf_get_metadata, pdf_read_pages, etc.) has a dedicated handler in `src/handlers/` |
| 67 | +- **Parallel processing** - Uses Promise.all for 5-10x speedup when processing multiple pages or PDFs |
| 68 | +- **Y-coordinate ordering** - Content is ordered by Y-position to preserve natural reading flow |
| 69 | +- **Per-page error isolation** - Individual page failures don't crash entire documents |
| 70 | +- **Fingerprint-based caching** - Text and OCR results are cached using document fingerprints |
| 71 | +- **Guardrails** - Large document operations require explicit opt-in (allow_full_document flag) to prevent accidental resource exhaustion |
| 72 | + |
| 73 | +### Layer Structure |
| 74 | + |
| 75 | +``` |
| 76 | +src/ |
| 77 | +├── index.ts # MCP server setup and tool registration |
| 78 | +├── handlers/ # MCP tool implementations (one per tool) |
| 79 | +│ ├── readPdf.ts # Legacy all-in-one tool (backward compatible) |
| 80 | +│ ├── getMetadata.ts # Metadata extraction |
| 81 | +│ ├── readPages.ts # Structured page reading |
| 82 | +│ ├── searchPdf.ts # Text search |
| 83 | +│ ├── renderPage.ts # Page rasterization |
| 84 | +│ ├── ocrPage.ts # OCR for rendered pages |
| 85 | +│ ├── cache.ts # Cache management |
| 86 | +│ └── ... |
| 87 | +├── pdf/ # PDF processing core |
| 88 | +│ ├── loader.ts # Document loading (files/URLs) |
| 89 | +│ ├── parser.ts # Page selection and parsing logic |
| 90 | +│ ├── extractor.ts # Content extraction (text + images) |
| 91 | +│ ├── text.ts # Text normalization and ordering |
| 92 | +│ └── render.ts # Page rendering to PNG |
| 93 | +├── schemas/ # @sylphx/vex validation schemas |
| 94 | +│ ├── pdfSource.ts # Shared source/pages schemas |
| 95 | +│ ├── readPages.ts # Per-tool input schemas |
| 96 | +│ └── ... |
| 97 | +├── utils/ # Shared utilities |
| 98 | +│ ├── cache.ts # Fingerprint-based caching |
| 99 | +│ ├── fingerprint.ts # Document identity hashing |
| 100 | +│ ├── pathUtils.ts # Path resolution (absolute/relative) |
| 101 | +│ ├── errors.ts # Custom error types |
| 102 | +│ └── logger.ts # Structured logging |
| 103 | +└── types/ # TypeScript type definitions |
| 104 | + └── pdf.ts # Domain types |
| 105 | +``` |
| 106 | + |
| 107 | +### Handler Pattern |
| 108 | + |
| 109 | +Each handler follows this structure: |
| 110 | +1. **Define schema** using @sylphx/vex in `schemas/` |
| 111 | +2. **Export tool** using `tool()` from @sylphx/mcp-server-sdk |
| 112 | +3. **Process sources** in parallel with Promise.all |
| 113 | +4. **Return results** as array of {source, success, data?, error?} |
| 114 | + |
| 115 | +Example: |
| 116 | +```typescript |
| 117 | +export const pdfReadPages = tool({ |
| 118 | + description: 'Extract structured text from PDF pages', |
| 119 | + inputSchema: readPagesArgsSchema, |
| 120 | + handler: async (args) => { |
| 121 | + // Parallel processing of all sources |
| 122 | + const results = await Promise.all( |
| 123 | + args.sources.map(async (source) => { |
| 124 | + // Per-source error handling |
| 125 | + // Return { source, success, data/error } |
| 126 | + }) |
| 127 | + ); |
| 128 | + return text(JSON.stringify({ results })); |
| 129 | + } |
| 130 | +}); |
| 131 | +``` |
| 132 | + |
| 133 | +### PDF Processing Flow |
| 134 | + |
| 135 | +1. **Load document** (loader.ts) - Handles both files and URLs, validates size (<100MB) |
| 136 | +2. **Parse page spec** (parser.ts) - Converts "1-5,10" or [1,2,3] to page numbers |
| 137 | +3. **Apply guardrails** (parser.ts) - Enforces sampling limits unless allow_full_document=true |
| 138 | +4. **Extract content** (extractor.ts) - Pulls text/images with Y-coordinates |
| 139 | +5. **Order content** (text.ts) - Sorts by Y-position and groups into lines |
| 140 | +6. **Cache results** (cache.ts) - Stores using fingerprint + page + options as key |
| 141 | + |
| 142 | +### Validation with @sylphx/vex |
| 143 | + |
| 144 | +The project uses **@sylphx/vex** (not Zod/Joi) for schema validation: |
| 145 | + |
| 146 | +```typescript |
| 147 | +import { object, str, bool, optional, array } from '@sylphx/vex'; |
| 148 | + |
| 149 | +const schema = object({ |
| 150 | + path: optional(str(min(1))), |
| 151 | + include_metadata: optional(bool), |
| 152 | +}); |
| 153 | +``` |
| 154 | + |
| 155 | +Vex schemas are used both for: |
| 156 | +- MCP tool input validation (via inputSchema) |
| 157 | +- Internal validation with safeParse() |
| 158 | + |
| 159 | +### Caching Strategy |
| 160 | + |
| 161 | +Two separate caches with fingerprint-based keys: |
| 162 | +- **Text cache**: `fingerprint#page#options` → PdfPageText |
| 163 | +- **OCR cache**: `fingerprint#page/image#provider` → OcrResult |
| 164 | + |
| 165 | +Fingerprints are SHA-256 hashes of first 64KB of PDF (fast uniqueness check). |
| 166 | + |
| 167 | +### Testing with Vitest |
| 168 | + |
| 169 | +Tests use Vitest (not Bun test runner despite using Bun for builds): |
| 170 | + |
| 171 | +```typescript |
| 172 | +import { describe, it, expect, vi, beforeAll } from 'vitest'; |
| 173 | + |
| 174 | +// Mock pdfjs-dist |
| 175 | +vi.mock('pdfjs-dist/legacy/build/pdf.mjs', () => ({...})); |
| 176 | + |
| 177 | +describe('handler name', () => { |
| 178 | + it('should handle X', async () => { |
| 179 | + // Test using mocked PDF.js |
| 180 | + }); |
| 181 | +}); |
| 182 | +``` |
| 183 | + |
| 184 | +Run single test file: |
| 185 | +```bash |
| 186 | +bunx vitest run test/handlers/readPdf.test.ts |
| 187 | +``` |
| 188 | + |
| 189 | +### Guardrail System |
| 190 | + |
| 191 | +Large documents (>DEFAULT_SAMPLE_PAGE_LIMIT pages) trigger sampling warnings unless: |
| 192 | +- `pages` parameter is explicitly provided, OR |
| 193 | +- `allow_full_document=true` is set |
| 194 | + |
| 195 | +This prevents accidental full reads of 1000+ page PDFs. The warning is added to the `warnings` array in results. |
| 196 | + |
| 197 | +## Important Notes |
| 198 | + |
| 199 | +### PDF.js Integration |
| 200 | +- Uses pdfjs-dist v5.x with legacy build for Node.js |
| 201 | +- CMap files resolved relative to pdfjs-dist package location |
| 202 | +- Canvas package required for page rendering |
| 203 | + |
| 204 | +### Path Handling |
| 205 | +- Supports both absolute and relative paths (v1.3.0+) |
| 206 | +- Windows paths work with both `\` and `/` |
| 207 | +- Relative paths resolved against process.cwd() |
| 208 | +- resolvePath() in pathUtils.ts handles all normalization |
| 209 | + |
| 210 | +### Biome Configuration |
| 211 | +- Extends @sylphx/biome-config |
| 212 | +- Cognitive complexity limited to 10 (relaxed for pdf/handlers) |
| 213 | +- No explicit `any` types allowed (error level) |
| 214 | +- Line width: 100 characters |
| 215 | +- Single quotes, semicolons, 2-space indent |
| 216 | + |
| 217 | +### Conventional Commits |
| 218 | +Required format for commits: |
| 219 | +``` |
| 220 | +feat(images): add WebP support |
| 221 | +fix(paths): handle UNC paths correctly |
| 222 | +docs(readme): update API examples |
| 223 | +test(loader): add URL loading tests |
| 224 | +``` |
| 225 | + |
| 226 | +Enforced via commitlint (if lefthook is installed). |
| 227 | + |
| 228 | +### Package Manager |
| 229 | +Project uses **Bun 1.3.x** as package manager (see packageManager in package.json). |
| 230 | +Install dependencies with `bun install`. |
| 231 | + |
| 232 | +### Build System |
| 233 | +- `bunup` for TypeScript compilation (not tsc directly) |
| 234 | +- Outputs to `dist/` directory |
| 235 | +- ESM modules only (type: "module") |
| 236 | +- Node.js >=22.0.0 required |
0 commit comments