You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: skip extracted images during element merge (#175)
When merging extracted elements with inferred elements, extracted elements have a certain priority because they tend to have the exact text and bounding boxes. However this is not true when the extracted element is an image. In this case the element provides no insight into where in the region the text might be (if there is text in the image) and what the text is.
This is particularly bad when the pdf contains a single full-page embedded image.
This PR fixes that by skipping the embedded images during merging.
Note that the unintended behavior did provide one benefit in certain cases -- when a document contains elements with embedded images that contain text, and the detection model is unable to find the text (Examples are some scanned documents or 2023-Jan-economic-outlook.pdf), keeping the Image element forced us to OCR the appropriate area, which captured more of the text.
We should be able to intentionally get this behavior by utilizing the full page OCR mode, and when we find text elements via OCR that don't have corresponding elements from the detected layout, adding them to the list. See #176 .
0 commit comments