Feature/pdf extraction #58

evansun06 · 2025-11-26T07:42:40Z

📝 Description

Implemented the pdf text extraction script for the pre-chunking and associated tests.

Primary features:

newline normalization
hyphen repair
page-page repair
quotation and character normalization
sha256 unique file identifier

Added dependencies:

pymupdf

Note: OCR in PyMuPDF only works when Tesseract is locally installed as binary. Future implementation may require a Dockerfile to containerize this dependency. For now, if OCR is triggered, the script just skips the page.

Try Local Extraction

### local test extraction
python -m app.pdfx.pdfx extract app/path/to/pdf --out app/path/to/output_directory --page-range 3-4

Example Payload Structure

{
  "doc_uuid": "b6ccb18c-8581-54f2-ba30-a7d509a6483b",
  "page_count": ...,
  "processed_page_range": [...],
  "processed_pages": [...],
  "total_word_count": ...,
  "created_at": "2025-11-26T06:45:05.341631+00:00",
  "tool_version": "0.1.0",
  "skipped": true/false,
  "skipped_pages": [...],
  "text": "...",
  "pages": [
    {
      "page_num": ...,
      "word_count": ...,
      "used_ocr": ...,
      "text": ""
    }
  ]
}

🎯 Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📚 Documentation update
🔨 Refactoring (no functional changes)
🧪 Tests (adding or updating tests)
🔧 Chore (dependency updates, config changes, etc.)

🧪 Testing

I have tested this change locally
I have added/updated tests for this change
All existing tests pass

📋 Checklist

My code follows the code style of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings

📸 Screenshots (if applicable)

Add screenshots or GIFs to help explain your changes.

🔗 Related Issues

Closes #(issue number)

…tion signature

…arser

…ics accurately

…dentification

… into feature/pdf-extraction

… ranged extraction

evansun06 added 14 commits November 3, 2025 21:41

feat(pdfx): setup and recorded dependencies and python module

7e6c1c0

feat(pdfx): created basic cli setup including health_check and extrac…

ade88f4

…tion signature

feat(pdfx): added unique file hashing

ba1f768

feat(pdfx): added test pdfs (1 simple, 1 complex)

5e742af

feat(pdfx): Added write logic, and cli output path handling

1726621

added page variability to cli options, additionally most basic text p…

b7c072b

…arser

feat(pdfx): able to extract text, need to implement structure heurist…

d4dd103

…ics accurately

feat(pdfx): added block metrics and added hyphenation fixing + list i…

3e45845

…dentification

Merge branch 'main' of https://github.com/ubclaunchpad/Piazza-AI-Plugin…

6225504

… into feature/pdf-extraction

refactor(pdfx): refactored pdf extraction script to use PyMuPDF and OCR

ed63238

feat(pdfx): included response codes + CLI integrtion to script

6b23a6c

feat(pdfx): implemented correct ranged extraction

2c8caa5

test(pdfx): added tests for json structure, return path checking, and…

e61a875

… ranged extraction

chore(pdfx): fixed linting + styling

4069ada

evansun06 requested review from TonyLiu0226 and hamin2006 as code owners November 26, 2025 07:42

chore(pdfx): applied ruff formatting

4ec9719

TonyLiu0226 approved these changes Nov 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/pdf extraction #58

Feature/pdf extraction #58

Uh oh!

evansun06 commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feature/pdf extraction #58

Are you sure you want to change the base?

Feature/pdf extraction #58

Uh oh!

Conversation

evansun06 commented Nov 26, 2025

📝 Description

🎯 Type of Change

🧪 Testing

📋 Checklist

📸 Screenshots (if applicable)

🔗 Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants