-
Notifications
You must be signed in to change notification settings - Fork 246
Accept binary bytes on the PDF header line #480
Copy link
Copy link
Closed
Description
I have a pdf vittel_rdc.pdf that fails to load in lopdf with
Error: Parse(invalid file header)
but otherwise opens fine in other pdf viewers.
After some investigations it looks like it has a header with something like
%PDF-1.3 \xb0\x9f\x92\x9c\x9f\xd4\xe0\xce\xd0\xd0\xd0\r
and was produced by ImageMill Imaging Library v3.000.
If I understand correctly, the pdf spec (ISO 32000, §7.5.2) would requires a newline before the binary comment line and that's what lopdf implements. However looking at what other pdf libraries implement they are a bit less strict,
- pypdf: Only validates the 5-byte
%PDF-prefix, doesn't parse the rest of the line - pdf.js (Mozilla): Reads version chars until whitespace (byte ≤ 0x20) or 7 chars max, ignores the rest
- qpdf: Uses regex
[0-9]+\.[0-9]+to extract version digits, ignores trailing bytes
so I would propose to be a bit less strict about this header parsing, to not error on such files. There should be little downside.
Made a PR in #481
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels