-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Description
Context
The packages/extract/src/index.ts is a 386-line monolith containing:
- Zod schemas
- MIME type detection
- PDF to image conversion
- OCR processing
- Gemini provider
- Ollama provider
- Streaming logic
This makes unit testing difficult - we cannot test OCR, PDF conversion, or providers in isolation.
Architecture Decisions
- Package:
packages/extract - Pattern: Module decomposition with barrel export
- Goal: Each module testable independently
Proposed Structure
packages/extract/src/
├── index.ts # Barrel export only
├── schemas.ts # Zod schemas
├── mime.ts # getMimeType()
├── pdf.ts # pdfToImages()
├── ocr.ts # ocrImages() + types
├── extract.ts # extractDocument() orchestrator
├── providers/
│ ├── gemini.ts # extractWithGemini()
│ └── ollama.ts # extractWithOllama()
└── types.ts # StreamChunk, StreamCallback, ExtractOptions
Requirements
- Extract Zod schemas to
schemas.ts - Extract MIME detection to
mime.ts - Extract PDF conversion to
pdf.ts - Extract OCR logic to
ocr.ts - Extract Gemini provider to
providers/gemini.ts - Extract Ollama provider to
providers/ollama.ts - Create shared types in
types.ts - Create barrel export in
index.ts - Update tests to use new module structure
- Maintain 100% backward compatibility
Success Criteria
- All existing tests pass
- Each module can be imported/tested independently
- No breaking changes to public API
- Coverage maintained or improved
References
- Related: Epic Epic: Core Ingestion Pipeline + Local Storage #1 (completed)
Metadata
Metadata
Assignees
Labels
No labels