Skip to content

Commit de19ace

Browse files
authored
fix: skip extracted images during element merge (#175)
When merging extracted elements with inferred elements, extracted elements have a certain priority because they tend to have the exact text and bounding boxes. However this is not true when the extracted element is an image. In this case the element provides no insight into where in the region the text might be (if there is text in the image) and what the text is. This is particularly bad when the pdf contains a single full-page embedded image. This PR fixes that by skipping the embedded images during merging. Note that the unintended behavior did provide one benefit in certain cases -- when a document contains elements with embedded images that contain text, and the detection model is unable to find the text (Examples are some scanned documents or 2023-Jan-economic-outlook.pdf), keeping the Image element forced us to OCR the appropriate area, which captured more of the text. We should be able to intentionally get this behavior by utilizing the full page OCR mode, and when we find text elements via OCR that don't have corresponding elements from the detected layout, adding them to the list. See #176 .
1 parent ae73cf8 commit de19ace

File tree

3 files changed

+9
-1
lines changed

3 files changed

+9
-1
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.5.13
2+
3+
* Fix extracted image elements being included in layout merge
4+
15
## 0.5.12
26

37
* Fix a pdfminer error when using `process_data_with_model`
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.5.12" # pragma: no cover
1+
__version__ = "0.5.13" # pragma: no cover

unstructured_inference/inference/layoutelement.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,10 @@ def merge_inferred_layout_with_extracted_layout(
8787
extracted_elements_to_add: List[TextRegion] = []
8888
inferred_regions_to_remove = []
8989
for extracted_region in extracted_layout:
90+
if isinstance(extracted_region, ImageTextRegion):
91+
# Skip extracted images for this purpose, we don't have the text from them and they
92+
# don't provide good text bounding boxes.
93+
continue
9094
region_matched = False
9195
for inferred_region in inferred_layout:
9296
if inferred_region.intersects(extracted_region):

0 commit comments

Comments
 (0)