Strange pdf with every page referring every image #3160
-
Hello. It is the same problem I have been looking at for a while: I have a bunch of pdf's (apparently from Internet Archive) which are generated from scans, and every page is a background image with a mask, at least visually. The mask is the high-def text and the background is just yellowing paper, so extracting the mask as image results in a more readable pdf anyway, and faster/smaller too. So I have a script which does this:
This works okay for every pdf in that collection so far, except 1. That particular one somehow has every page referring to every image. Ie. Granted the documentation of |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Argh, there is a comment buried in the issue about |
Beta Was this translation helpful? Give feedback.
-
Argh, |
Beta Was this translation helpful? Give feedback.
Argh,
page.clean_contents()
does work. I now get a pdf which is about 6MB instead of the 20MB original. Somehow if I join the first and last page of the original with the 2-(n-1) of mod with qpdf, I get a 22MB result. It seems that the first page still refers to every image. Anyway, I think I have got this answered.