Skip to content

Conversation

@evansun06
Copy link
Contributor

📝 Description

Implemented the pdf text extraction script for the pre-chunking and associated tests.

Primary features:

  • newline normalization
  • hyphen repair
  • page-page repair
  • quotation and character normalization
  • sha256 unique file identifier

Added dependencies:

  • pymupdf

Note: OCR in PyMuPDF only works when Tesseract is locally installed as binary. Future implementation may require a Dockerfile to containerize this dependency. For now, if OCR is triggered, the script just skips the page.

Try Local Extraction

### local test extraction
python -m app.pdfx.pdfx extract app/path/to/pdf --out app/path/to/output_directory --page-range 3-4

Example Payload Structure

{
  "doc_uuid": "b6ccb18c-8581-54f2-ba30-a7d509a6483b",
  "page_count": ...,
  "processed_page_range": [...],
  "processed_pages": [...],
  "total_word_count": ...,
  "created_at": "2025-11-26T06:45:05.341631+00:00",
  "tool_version": "0.1.0",
  "skipped": true/false,
  "skipped_pages": [...],
  "text": "...",
  "pages": [
    {
      "page_num": ...,
      "word_count": ...,
      "used_ocr": ...,
      "text": ""
    }
  ]
}

🎯 Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📚 Documentation update
  • 🔨 Refactoring (no functional changes)
  • 🧪 Tests (adding or updating tests)
  • 🔧 Chore (dependency updates, config changes, etc.)

🧪 Testing

  • I have tested this change locally
  • I have added/updated tests for this change
  • All existing tests pass

📋 Checklist

  • My code follows the code style of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings

📸 Screenshots (if applicable)

Add screenshots or GIFs to help explain your changes.

🔗 Related Issues

Closes #(issue number)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants