Skip to content

Conversation

luis-cd
Copy link

@luis-cd luis-cd commented Aug 13, 2025

This fixes an issue where removing pages after writing left orphan objects in the PDF writer.

The fix forces a rebuild of the internal structure to clean orphaned objects, improving stability and avoiding potential PDF corruption.

Closes #3418

Copy link

codecov bot commented Aug 13, 2025

Codecov Report

❌ Patch coverage is 90.90909% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.96%. Comparing base (71a1bd1) to head (25a9684).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pypdf/_writer.py 90.90% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3432      +/-   ##
==========================================
- Coverage   96.97%   96.96%   -0.02%     
==========================================
  Files          54       54              
  Lines        9337     9358      +21     
  Branches     1711     1712       +1     
==========================================
+ Hits         9055     9074      +19     
- Misses        168      170       +2     
  Partials      114      114              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@luis-cd luis-cd changed the title Fix #3418: Properly clean up orphan objects when removing pages in PdfWriter fix(writer): properly clean up orphan objects when removing pages (fixes #3418) Aug 13, 2025
@luis-cd luis-cd changed the title fix(writer): properly clean up orphan objects when removing pages (fixes #3418) BUG: properly clean up orphan objects when removing pages Aug 13, 2025
This change ensures that failures during the internal rebuild process after removing a page
with clean=True do not crash the operation but instead log a warning.

Closes py-pdf#3418
@stefan6419846
Copy link
Collaborator

Thanks for the PR. Besides the exception handling being untested, I fear that this will have a quite large hit on performance and break with edge cases like incremental mode or similar. The docs are rather clear that clean=True will just clear references.

I am not sure if we are able to solve this issue without too much overhead or should prefer to document that to remove orphans, the corresponding method should be called at the end. If we still miss some unused objects with this, I prefer to focus on fixing the behavior there instead.

@stefan6419846 stefan6419846 added the needs-discussion The PR/issue needs more discussion before we can continue label Aug 13, 2025
@luis-cd
Copy link
Author

luis-cd commented Aug 13, 2025

Hi — thanks again for the review and for pointing out the performance / semantic concerns.

Short summary / proposal

Rather than forcing a heavy serialize→read→clone rebuild inside remove_page(clean=True) (which changes semantics and can be expensive), I propose:

  • Add a lightweight PdfWriter.remove_orphan_objects() that performs an in-memory mark-and-sweep (graph reachability) to remove unreachable indirect objects from the writer’s internal pool.
  • Keep a heavy PdfWriter.rebuild() as an explicit fallback that does serialize→read→clone (bulletproof but expensive).
  • Optionally expose writer.write(..., rebuild=False) as a convenience flag (default False), but do not change remove_page(clean=True) semantics.

Why this approach

  • Preserves current documented behavior of clean=True.
  • Mark-and-sweep is O(number_of_objects + references) and avoids reserializing large streams — much faster than full rebuild for large PDFs.
  • Users opt in to the expensive operation; no surprising perf hit by default.

Edge cases / safety

  • Incremental mode: do not run sweep or rebuild automatically in incremental=True. Suggested behavior: remove_orphan_objects() no-ops and logs a warning (or we can raise if maintainers prefer).
  • Encryption / object streams / names trees: include these roots in traversal and add tests for encrypted files.
  • Fallback: rebuild() remains when sweep is insufficient.

Request for guidance

  1. Are you comfortable with adding remove_orphan_objects() + rebuild() as explicit APIs, instead of changing remove_page(clean=True)?
  2. Preferred incremental-mode behavior: no-op + logger_warning (recommended) or raise an error?
  3. Any additional roots or special catalog entries I should include in traversal?

Thanks again for the feedback — happy to adjust based on your guidance.

@stefan6419846
Copy link
Collaborator

  1. We should not need remove_orphan_objects, we already have compress_identical_objects(remove_orphans=True), possibly with adding remove_identicals=False. I do not see the need for adding new parameters or the rebuild() method.
  2. The user should know when he uses incremental mode. I do not see an explicit need for adding a new warning here.
  3. According to PdfWriter remove_page/compress_identical_objects does not clean up properly #3418 (comment), fonts and images are not deleted correctly at the moment. They should be included. I do not see any need for handling encrypted documents in a special manner.

@luis-cd
Copy link
Author

luis-cd commented Aug 14, 2025

Given the feedback from the maintainers, I understand that the approach in this PR (modifying remove_page and adding _rebuild_via_io) is not the preferred solution due to potential performance impact and maintenance concerns.

I will close this PR and instead prepare a new one focusing on improving compress_identical_objects(remove_orphans=True) so that it properly removes fonts, images, and other unreferenced objects, as suggested in the discussion.

Thanks for the detailed review and guidance!

@luis-cd luis-cd closed this Aug 14, 2025
@luis-cd luis-cd deleted the fix-remove-page-cleanup branch August 14, 2025 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-discussion The PR/issue needs more discussion before we can continue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PdfWriter remove_page/compress_identical_objects does not clean up properly
2 participants