Skip to content

Add optional PDF image preservation and PyMuPDF pipeline#28

Open
mspinolaeie wants to merge 8 commits intoimadreamerboy:mainfrom
mspinolaeie:feat/pdf-assets
Open

Add optional PDF image preservation and PyMuPDF pipeline#28
mspinolaeie wants to merge 8 commits intoimadreamerboy:mainfrom
mspinolaeie:feat/pdf-assets

Conversation

@mspinolaeie
Copy link
Copy Markdown
Contributor

Closes #27

Summary

  • add optional PDF image preservation with extracted asset files, preview materialization, and save-time link rewriting
  • add a PDF pipeline toggle so the GUI keeps markitdown as default while exposing a pymupdf path for PDF-specific parsing and best-effort inline image placement
  • expand coverage with conversion tests, runtime UI tests, real generated PDF tests, and opt-in packaging smoke coverage

Details

  • preserve PDF images as linked files in *_assets/ or shared assets/ layouts
  • support preview-before-save for extracted PDF assets
  • keep the legacy markitdown PDF behavior when the new options are disabled
  • add best-effort inline image placement for the pymupdf pipeline using nearest preceding text blocks
  • split CI into fast tests, runtime/real-PDF tests, and opt-in packaging smoke; tighten release artifact retention

Validation

  • python -m pytest -> 96 passed, 1 skipped
  • python -m pytest -q -m "not runtime_ui and not pdf_real and not packaging" -> 84 passed, 13 deselected
  • python -m pytest -q -m "runtime_ui or pdf_real" -> 12 passed, 85 deselected
  • opt-in PyInstaller smoke starts correctly but timed out locally at 180s during Analysis, so the harness is kept opt-in and bounded

Notes

  • local untracked .claude/ is not part of this PR

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c534bbf9c5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +238 to +240
collision_candidate = base.with_name(f"{stem}_{hashed}{suffix}").as_posix()
claimed_paths.add(collision_candidate)
return collision_candidate
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Generate a unique name after secondary asset collisions

The fallback branch for asset path collisions builds a single deterministic collision_candidate from source_file and returns it without checking whether that fallback is already claimed. In single-layout exports (combined=True or shared assets/ mode), converting 3+ files with the same basename (e.g., multiple doc.pdf from different directories) can produce the same fallback name, so later copies overwrite earlier assets and earlier markdown links end up pointing to the wrong image.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add optional PDF image preservation and PyMuPDF pipeline

1 participant