Decode all streams in a PDF #3635

j-t-1 · 2026-02-10T17:21:19Z

j-t-1
Feb 10, 2026

What is a good way to do the following:

Iterate over all streams (including non-referenced) and decode them using its filter to produce the original non-encoded data.
Once streams are decoded, any inline images are also decoded. (This part is more difficult.)
Then save as a new PDF.

Python code in this discussion would be ideal. Preferably using this function as it already exists:

pypdf/pypdf/filters.py

Line 766 in 219153e

def decode_stream_data(stream: StreamObject) -> bytes:

Optionally, are there any files that contain all or most of the filter types and inline images to test this with?

stefan6419846 · 2026-02-10T19:13:32Z

stefan6419846
Feb 10, 2026
Maintainer

As pypdf is written, this will only work for filters which are image-only and thus do not rely on external libraries like Pillow or jbig2dec.

If you do not care about using internal APIs, something like this works:

from pypdf import PdfWriter
from pypdf.generic import DecodedStreamObject, EncodedStreamObject


writer = PdfWriter(clone_from="resources/crazyones.pdf")
for index, obj in enumerate(writer._objects):
    if not isinstance(obj, EncodedStreamObject):
        continue
    new_stream = DecodedStreamObject()
    new_stream.set_data(obj.get_data())
    for key, value in dict(obj).items():
        if key not in {"/Filter"}:
            new_stream[key] = value
    writer._objects[index] = new_stream
writer.write("out.pdf")

I have not tested this with inline images or similar though, and relying on internal APIs is not recommended.

Preferably using this function as it already exists:

It should not be necessary to use this explicitly, EncodedStreamObject.get_data already takes care of this.

Optionally, are there any files that contain all or most of the filter types and inline images to test this with?

I am not aware of this and it is rather uncommon to have such a file - except for explicit testing purposes.

2 replies

j-t-1 Feb 10, 2026
Author

Great this is what was hoping for. I am fine with internal APIs.

I am not aware of this and it is rather uncommon to have such a file - except for explicit testing purposes.

Yes such a file would likely exist for testing purposes. Or a set of small files, one for each filter type, including inline images.

I was surprised that PDF32000_2008.pdf has a CCITTFaxDecode filter. Maybe it also has other filters as it needs to demonstrate them? #3609 (comment). Where is this file located in the repository?

Aside: I think we could rename some of the files in the resources folder to be more descriptive like sample-files.

stefan6419846 Feb 11, 2026
Maintainer

Where is this file located in the repository?

In a proper development environment, you can find the file at tests/pdf_cache/PDF32000_2008.pdf. Apart from this, the GitHub search is your friend.

Aside: I think we could rename some of the files in the resources folder to be more descriptive like sample-files.

The generic files do not need to be more descriptive IMHO.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decode all streams in a PDF #3635

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Decode all streams in a PDF #3635

Uh oh!

j-t-1 Feb 10, 2026

Replies: 1 comment · 2 replies

Uh oh!

stefan6419846 Feb 10, 2026 Maintainer

Uh oh!

Uh oh!

j-t-1 Feb 10, 2026 Author

Uh oh!

stefan6419846 Feb 11, 2026 Maintainer

j-t-1
Feb 10, 2026

Replies: 1 comment 2 replies

stefan6419846
Feb 10, 2026
Maintainer

j-t-1 Feb 10, 2026
Author

stefan6419846 Feb 11, 2026
Maintainer