Add optional PDF image preservation and PyMuPDF pipeline#28
Add optional PDF image preservation and PyMuPDF pipeline#28mspinolaeie wants to merge 8 commits intoimadreamerboy:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c534bbf9c5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| collision_candidate = base.with_name(f"{stem}_{hashed}{suffix}").as_posix() | ||
| claimed_paths.add(collision_candidate) | ||
| return collision_candidate |
There was a problem hiding this comment.
Generate a unique name after secondary asset collisions
The fallback branch for asset path collisions builds a single deterministic collision_candidate from source_file and returns it without checking whether that fallback is already claimed. In single-layout exports (combined=True or shared assets/ mode), converting 3+ files with the same basename (e.g., multiple doc.pdf from different directories) can produce the same fallback name, so later copies overwrite earlier assets and earlier markdown links end up pointing to the wrong image.
Useful? React with 👍 / 👎.
Closes #27
Summary
markitdownas default while exposing apymupdfpath for PDF-specific parsing and best-effort inline image placementDetails
*_assets/or sharedassets/layoutsmarkitdownPDF behavior when the new options are disabledpymupdfpipeline using nearest preceding text blocksValidation
python -m pytest->96 passed, 1 skippedpython -m pytest -q -m "not runtime_ui and not pdf_real and not packaging"->84 passed, 13 deselectedpython -m pytest -q -m "runtime_ui or pdf_real"->12 passed, 85 deselectedPyInstallersmoke starts correctly but timed out locally at 180s during Analysis, so the harness is kept opt-in and boundedNotes
.claude/is not part of this PR