-
Notifications
You must be signed in to change notification settings - Fork 246
Better support for the non strict mode #484
Copy link
Copy link
Open
Description
I'm opening this issue to work on better support of non strict mode in this package. This follow up on #41 which requested to add non strict mode and that was added in afe79a4
I work on a 300k PDF dataset (public tender documents from various sources) and here are the parsing errors that I see by occurrence,
- Invalid PDF structure (756 documents): possibly related Exceeded allocation size when reading pdf #433 and Add relaxed mode (ignores things like false byte offsets in xref table) #41
- invalid file header (296 documents) will be addressed by fix: accept binary bytes on the PDF header line in non strict mode #481
- Invalid file trailer (215 documents) : possibly related 15_EventMaxiumSpeed_Qualifying.PDF can't be loaded because of overly strict startxref parsing #318
- Invalid content stream (125 documents): possibly related Content decoding does not handle inline images #78
- Invalid cross reference table (3 documents): possibly related Parse error when encoding is an indirect reference #463
I will try to make a PR at least for cases where it's small fix in the non strict mode. @J-F-Liu
FYI @abimaelmartell
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels