Skip to content

Accept binary bytes on the PDF header line #480

@rth

Description

@rth

I have a pdf vittel_rdc.pdf that fails to load in lopdf with

Error: Parse(invalid file header)

but otherwise opens fine in other pdf viewers.

After some investigations it looks like it has a header with something like

%PDF-1.3 \xb0\x9f\x92\x9c\x9f\xd4\xe0\xce\xd0\xd0\xd0\r

and was produced by ImageMill Imaging Library v3.000.

If I understand correctly, the pdf spec (ISO 32000, §7.5.2) would requires a newline before the binary comment line and that's what lopdf implements. However looking at what other pdf libraries implement they are a bit less strict,

  • pypdf: Only validates the 5-byte %PDF- prefix, doesn't parse the rest of the line
  • pdf.js (Mozilla): Reads version chars until whitespace (byte ≤ 0x20) or 7 chars max, ignores the rest
  • qpdf: Uses regex [0-9]+\.[0-9]+ to extract version digits, ignores trailing bytes

so I would propose to be a bit less strict about this header parsing, to not error on such files. There should be little downside.

Made a PR in #481

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions