BUG: properly clean up orphan objects when removing pages #3432

luis-cd · 2025-08-13T00:16:43Z

This fixes an issue where removing pages after writing left orphan objects in the PDF writer.

The fix forces a rebuild of the internal structure to clean orphaned objects, improving stability and avoiding potential PDF corruption.

Closes #3418

…ng writer after writes

codecov · 2025-08-13T00:26:09Z

Codecov Report

❌ Patch coverage is 90.90909% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.96%. Comparing base (71a1bd1) to head (25a9684).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pypdf/_writer.py	90.90%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3432      +/-   ##
==========================================
- Coverage   96.97%   96.96%   -0.02%     
==========================================
  Files          54       54              
  Lines        9337     9358      +21     
  Branches     1711     1712       +1     
==========================================
+ Hits         9055     9074      +19     
- Misses        168      170       +2     
  Partials      114      114

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This change ensures that failures during the internal rebuild process after removing a page with clean=True do not crash the operation but instead log a warning. Closes py-pdf#3418

stefan6419846 · 2025-08-13T06:52:04Z

Thanks for the PR. Besides the exception handling being untested, I fear that this will have a quite large hit on performance and break with edge cases like incremental mode or similar. The docs are rather clear that clean=True will just clear references.

I am not sure if we are able to solve this issue without too much overhead or should prefer to document that to remove orphans, the corresponding method should be called at the end. If we still miss some unused objects with this, I prefer to focus on fixing the behavior there instead.

luis-cd · 2025-08-13T11:38:59Z

Hi — thanks again for the review and for pointing out the performance / semantic concerns.

Short summary / proposal

Rather than forcing a heavy serialize→read→clone rebuild inside remove_page(clean=True) (which changes semantics and can be expensive), I propose:

Add a lightweight PdfWriter.remove_orphan_objects() that performs an in-memory mark-and-sweep (graph reachability) to remove unreachable indirect objects from the writer’s internal pool.
Keep a heavy PdfWriter.rebuild() as an explicit fallback that does serialize→read→clone (bulletproof but expensive).
Optionally expose writer.write(..., rebuild=False) as a convenience flag (default False), but do not change remove_page(clean=True) semantics.

Why this approach

Preserves current documented behavior of clean=True.
Mark-and-sweep is O(number_of_objects + references) and avoids reserializing large streams — much faster than full rebuild for large PDFs.
Users opt in to the expensive operation; no surprising perf hit by default.

Edge cases / safety

Incremental mode: do not run sweep or rebuild automatically in incremental=True. Suggested behavior: remove_orphan_objects() no-ops and logs a warning (or we can raise if maintainers prefer).
Encryption / object streams / names trees: include these roots in traversal and add tests for encrypted files.
Fallback: rebuild() remains when sweep is insufficient.

Request for guidance

Are you comfortable with adding remove_orphan_objects() + rebuild() as explicit APIs, instead of changing remove_page(clean=True)?
Preferred incremental-mode behavior: no-op + logger_warning (recommended) or raise an error?
Any additional roots or special catalog entries I should include in traversal?

Thanks again for the feedback — happy to adjust based on your guidance.

stefan6419846 · 2025-08-13T11:49:52Z

We should not need remove_orphan_objects, we already have compress_identical_objects(remove_orphans=True), possibly with adding remove_identicals=False. I do not see the need for adding new parameters or the rebuild() method.
The user should know when he uses incremental mode. I do not see an explicit need for adding a new warning here.
According to PdfWriter remove_page/compress_identical_objects does not clean up properly #3418 (comment), fonts and images are not deleted correctly at the moment. They should be included. I do not see any need for handling encrypted documents in a special manner.

luis-cd · 2025-08-14T14:42:08Z

Given the feedback from the maintainers, I understand that the approach in this PR (modifying remove_page and adding _rebuild_via_io) is not the preferred solution due to potential performance impact and maintenance concerns.

I will close this PR and instead prepare a new one focusing on improving compress_identical_objects(remove_orphans=True) so that it properly removes fonts, images, and other unreferenced objects, as suggested in the discussion.

Thanks for the detailed review and guidance!

Fix py-pdf#3418: Clean remove_page removes orphan objects by rebuildi…

164b8f3

…ng writer after writes

Fix py-pdf#3418: move imports to top-level to satisfy ruff

a8fdfad

luis-cd changed the title ~~Fix #3418: Properly clean up orphan objects when removing pages in PdfWriter~~ fix(writer): properly clean up orphan objects when removing pages (fixes #3418) Aug 13, 2025

luis-cd changed the title ~~fix(writer): properly clean up orphan objects when removing pages (fixes #3418)~~ BUG: properly clean up orphan objects when removing pages Aug 13, 2025

ROB: Log warning when PdfWriter rebuild fails in remove_page(clean=True)

25a9684

This change ensures that failures during the internal rebuild process after removing a page with clean=True do not crash the operation but instead log a warning. Closes py-pdf#3418

luis-cd mentioned this pull request Aug 13, 2025

PdfWriter remove_page/compress_identical_objects does not clean up properly #3418

Open

stefan6419846 added the needs-discussion The PR/issue needs more discussion before we can continue label Aug 13, 2025

luis-cd closed this Aug 14, 2025

luis-cd deleted the fix-remove-page-cleanup branch August 14, 2025 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: properly clean up orphan objects when removing pages #3432

BUG: properly clean up orphan objects when removing pages #3432

Uh oh!

luis-cd commented Aug 13, 2025 •

edited

Loading

Uh oh!

codecov bot commented Aug 13, 2025 •

edited

Loading

Uh oh!

stefan6419846 commented Aug 13, 2025

Uh oh!

luis-cd commented Aug 13, 2025

Uh oh!

stefan6419846 commented Aug 13, 2025

Uh oh!

luis-cd commented Aug 14, 2025

Uh oh!

Uh oh!

BUG: properly clean up orphan objects when removing pages #3432

BUG: properly clean up orphan objects when removing pages #3432

Uh oh!

Conversation

luis-cd commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

stefan6419846 commented Aug 13, 2025

Uh oh!

luis-cd commented Aug 13, 2025

Uh oh!

stefan6419846 commented Aug 13, 2025

Uh oh!

luis-cd commented Aug 14, 2025

Uh oh!

Uh oh!

luis-cd commented Aug 13, 2025 •

edited

Loading

codecov bot commented Aug 13, 2025 •

edited

Loading