Skip to content

refactor(extract): Modularize extract package for testability #12

@prosdev

Description

@prosdev

Context

The packages/extract/src/index.ts is a 386-line monolith containing:

  • Zod schemas
  • MIME type detection
  • PDF to image conversion
  • OCR processing
  • Gemini provider
  • Ollama provider
  • Streaming logic

This makes unit testing difficult - we cannot test OCR, PDF conversion, or providers in isolation.

Architecture Decisions

  • Package: packages/extract
  • Pattern: Module decomposition with barrel export
  • Goal: Each module testable independently

Proposed Structure

packages/extract/src/
├── index.ts              # Barrel export only
├── schemas.ts            # Zod schemas
├── mime.ts               # getMimeType()
├── pdf.ts                # pdfToImages()
├── ocr.ts                # ocrImages() + types
├── extract.ts            # extractDocument() orchestrator
├── providers/
│   ├── gemini.ts         # extractWithGemini()
│   └── ollama.ts         # extractWithOllama()
└── types.ts              # StreamChunk, StreamCallback, ExtractOptions

Requirements

  • Extract Zod schemas to schemas.ts
  • Extract MIME detection to mime.ts
  • Extract PDF conversion to pdf.ts
  • Extract OCR logic to ocr.ts
  • Extract Gemini provider to providers/gemini.ts
  • Extract Ollama provider to providers/ollama.ts
  • Create shared types in types.ts
  • Create barrel export in index.ts
  • Update tests to use new module structure
  • Maintain 100% backward compatibility

Success Criteria

  • All existing tests pass
  • Each module can be imported/tested independently
  • No breaking changes to public API
  • Coverage maintained or improved

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions