fix: skip extracted images during element merge (#175)

qued · web-flow · commit de19ace10163 · 2023-08-16T21:06:41.000-07:00
When merging extracted elements with inferred elements, extracted elements have a certain priority because they tend to have the exact text and bounding boxes. However this is not true when the extracted element is an image. In this case the element provides no insight into where in the region the text might be (if there is text in the image) and what the text is. This is particularly bad when the pdf contains a single full-page embedded image. This PR fixes that by skipping the embedded images during merging. Note that the unintended behavior did provide one benefit in certain cases -- when a document contains elements with embedded images that contain text, and the detection model is unable to find the text (Examples are some scanned documents or 2023-Jan-economic-outlook.pdf), keeping the Image element forced us to OCR the appropriate area, which captured more of the text. We should be able to intentionally get this behavior by utilizing the full page OCR mode, and when we find text elements via OCR that don't have corresponding elements from the detected layout, adding them to the list. See #176 .
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,7 @@
+## 0.5.13
+
+* Fix extracted image elements being included in layout merge
+
 ## 0.5.12
 
 * Fix a pdfminer error when using `process_data_with_model`
diff --git a/unstructured_inference/__version__.py b/unstructured_inference/__version__.py
@@ -1 +1 @@
-__version__ = "0.5.12"  # pragma: no cover
+__version__ = "0.5.13"  # pragma: no cover
diff --git a/unstructured_inference/inference/layoutelement.py b/unstructured_inference/inference/layoutelement.py
@@ -87,6 +87,10 @@ def merge_inferred_layout_with_extracted_layout(
     extracted_elements_to_add: List[TextRegion] = []
     inferred_regions_to_remove = []
     for extracted_region in extracted_layout:
+        if isinstance(extracted_region, ImageTextRegion):
+            # Skip extracted images for this purpose, we don't have the text from them and they
+            # don't provide good text bounding boxes.
+            continue
         region_matched = False
         for inferred_region in inferred_layout:
             if inferred_region.intersects(extracted_region):

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-__version__ = "0.5.12" # pragma: no cover`
	`1`	`+__version__ = "0.5.13" # pragma: no cover`