-
Notifications
You must be signed in to change notification settings - Fork 520
Description
Camelot contains this code: https://github.com/camelot-dev/camelot/blob/master/camelot/utils.py#L1391
Which would appear to respect the "extractable" flag on encrypted PDFs and refuse to process them. Why you would actually want to do that is a mystery to me, but I can see the benefit in some compliance-related situations.
Unfortunately it does no such thing, because it also decrypts PDFs while splitting them into individual pages with pypdf, which eliminates the possibility of knowing whether the author wanted you to extract text or not (as this flag is only available on encrypted PDFs).
So the code above does nothing. If you want to respect the permissions then you'll need to look at the permissions property on the PdfReader when you're doing that decryption:
https://pypdf.readthedocs.io/en/stable/modules/PdfDocCommon.html#pypdf._doc_common.PdfDocCommon.user_access_permissions
https://pypdf.readthedocs.io/en/stable/modules/constants.html#pypdf.constants.UserAccessPermissions
Or, alternately, you could apply #589 and have one fewer PDF parser to deal with ;-)
There is no way to reproduce this because it's an expected behaviour that doesn't occur, but I noticed it with this PDF from the test set:
https://github.com/camelot-dev/camelot/blob/master/tests/files/birdisland.pdf