Skip to content

PdfWriter remove_page/compress_identical_objects does not clean up properly #3418

@illumi

Description

@illumi

I'm creating PDF documents which can be a maximum size. I add pages to a PdfWriter object then when it goes over the maximum size I remove the last added page and write to a final document. I noticed my documents were over my maximum which was causing issues in my system. The examples here shows a small increase but with more complex pages with images, megabytes of data can be left in the final output.

Environment

Which environment were you using when you encountered the problem? This occurs on Windows and Linux and both 5.7.0 and 5.9.0.

$ python -m platform
Linux-5.10.0-23-cloud-amd64-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.9.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

import io
from pypdf import PdfReader
from pypdf import PdfWriter

writer = PdfWriter()

stream = io.BytesIO()
page1 = PdfReader("minimal-document.pdf")
writer.append(page1)
writer.write(stream)
print (f"1 page:\t\t\t{stream.tell()} bytes")

stream = io.BytesIO()
page2 = PdfReader("minimal-document.pdf")
writer.append(page2)
writer.write(stream)
print (f"2 pages:\t\t{stream.tell()} bytes")

writer.remove_page(len(writer.pages) - 1, True)
writer.compress_identical_objects()

stream = io.BytesIO()
writer.write(stream)
print (f"1 page (again):\t{stream.tell()} bytes")

First run with: https://github.com/py-pdf/sample-files/blob/main/001-trivial/minimal-document.pdf

Second run with: https://github.com/py-pdf/sample-files/blob/main/011-google-doc-document/google-doc-document.pdf

Traceback

First run:

1 page:			17387 bytes
2 pages:		34412 bytes
1 page (again):	18077 bytes

Second run:

1 page:			80356 bytes
2 pages:		160178 bytes
1 page (again):	94343 bytes

Metadata

Metadata

Assignees

No one assigned

    Labels

    PdfWriterThe PdfWriter component is affected

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions