-
Hello! I'm working on an application to clean up a scanned PDF file. The idea is to remove shadows in the background, straighten the text so the PDFs could be fed into an OCR software or printed. For testing I loaded a PDF document and saved it as a new file. However I can see that the files saved are much larger then the ones that were read (input ~5MB, output ~18MB). I uploaded the generated file into online PDF optimizer and investigated the result. It has around ~8MB so much better then my 18MB output. The difference is that the image files have a
I wonder if there is a way to force use FlateDecode for the PDF Images in pypdfium? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
So basically, you're extracting images from an existing pdf, processing them, and then put them into a new PDF? The thing is, usually, adding flate compression on top of DCT only results in marginal size improvements, so that sounds a bit as if the optimizer might have been lossy. |
Beta Was this translation helpful? Give feedback.
Unfortunately pdfium's public API is rather limited when it comes to images.
It works fine for JPEG, but otherwise it only provides the
FPDF_BITMAP
entrypoint, which does not support binary or CMYK images, or images with higher bit-depth. If you're usingPdfBitmap.from_pil()
andPdfImage.set_bitmap()
, these will be transcoded to grayscale, RGB, or 8-bit respectively.1 Also you can't choose the encoding (IIRC pdfium will just flate compress the bitmap data).The wrappers are just there to expose what pdfium can do, but again, I agree they're limited, so you may be better off with img2pdf or similar, especially when working with binary images (which seems to be your use case).
Footnotes
T…