You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR (#3822)
Fixes: #3815
Verified on my very large documents that it doesn't unnecessarily and
unsuccessfully "repair" them.
You may or may not wish to keep the version check in `patch_psparser`.
Since ~you're pinning the version of pdfminer.six and since it isn't
guaranteed that the bug in question will be fixed in the next
pdfminer.six release (but it is rather serious, so I should hope so),
then perhaps you just want to unconditionally patch it.~ it seems like
pinning of versions is only operative when running from Docker (good!)
so never mind! Keep that version check!
Also corrected an import so that if you do feel like using a newer
version of pdfminer.six, it won't break on you.
---------
Authored-by: David Huggins-Daines <[email protected]>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,12 @@
1
-
## 0.16.16-dev1
1
+
## 0.16.16-dev2
2
2
3
3
### Enhancements
4
4
5
5
### Features
6
6
-**Vectorize layout (inferred, extracted, and OCR) data structure** Using `np.ndarray` to store a group of layout elements or text regions instead of using a list of objects. This improves the memory efficiency and compute speed around layout merging and deduplication.
7
7
8
8
### Fixes
9
+
-**Correctly patch pdfminer to avoid PDF repair**. The patch applied to pdfminer's parser caused it to occasionally split tokens in content streams, throwing `PDFSyntaxError`. Repairing these PDFs sometimes failed (since they were not actually invalid) resulting in unnecessary OCR fallback.
0 commit comments