Skip to content

fix: accept binary bytes on the PDF header line in non strict mode#481

Merged
J-F-Liu merged 1 commit intoJ-F-Liu:mainfrom
rth:fix-header-binary-bytes
Mar 22, 2026
Merged

fix: accept binary bytes on the PDF header line in non strict mode#481
J-F-Liu merged 1 commit intoJ-F-Liu:mainfrom
rth:fix-header-binary-bytes

Conversation

@rth
Copy link
Copy Markdown
Contributor

@rth rth commented Mar 19, 2026

Closes #480

Only capture version-like characters (digits and '.') in the header, then skip any remaining bytes on the line. This matches the approach used by pdf.js (read until whitespace or 7 chars max) and qpdf (regex for digits only).

Also added more unit tests for parsing various PDF headers I saw in the dataset I was working on.

In lenient mode (default), only capture version digits from the header
line, skipping any trailing binary marker bytes that some generators
(e.g. ImageMill) place before the newline. In strict mode, reject
headers with trailing bytes after the version string.
@rth rth force-pushed the fix-header-binary-bytes branch from d56765e to a52c7f9 Compare March 21, 2026 13:31
@rth
Copy link
Copy Markdown
Contributor Author

rth commented Mar 21, 2026

@J-F-Liu this should be ready for a review.

@rth rth changed the title fix(parser): accept binary bytes on the PDF header line fix: accept binary bytes on the PDF header line in non strict mode Mar 21, 2026
@J-F-Liu J-F-Liu merged commit 0526740 into J-F-Liu:main Mar 22, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Accept binary bytes on the PDF header line

2 participants