Skip to content

Commit 0cd07d7

Browse files
feat: parition_pdf() add ability to get cid ratio (#2970)
This PR adds the ability to get the ratio of `cid` characters in embedded text extracted by `pdfminer`. This PR is the second part of moving `cid` related code from `unstructured-inference` to `unstructured` and works together with Unstructured-IO/unstructured-inference#342.
1 parent cb55245 commit 0cd07d7

File tree

4 files changed

+49
-2
lines changed

4 files changed

+49
-2
lines changed

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.13.7-dev6
1+
## 0.13.7-dev7
22

33
### Enhancements
44

@@ -7,6 +7,8 @@
77

88
### Features
99

10+
* **add ability to get ratio of `cid` characters in embedded text extracted by `pdfminer`**.
11+
1012
### Fixes
1113

1214
* **`partition_docx()` handles short table rows.** The DOCX format allows a table row to start late and/or end early, meaning cells at the beginning or end of a row can be omitted. While there are legitimate uses for this capability, using it in practice is relatively rare. However, it can happen unintentionally when adjusting cell borders with the mouse. Accommodate this case and generate accurate `.text` and `.metadata.text_as_html` for these tables.

test_unstructured/partition/pdf_image/test_pdf_image_utils.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,33 @@ def test_valid_text(text, outcome):
179179
assert pdf_image_utils.valid_text(text) == outcome
180180

181181

182+
@pytest.mark.parametrize(
183+
("text", "expected"),
184+
[
185+
("base", 0.0),
186+
("", 0.0),
187+
("(cid:2)", 1.0),
188+
("(cid:1)a", 0.5),
189+
("c(cid:1)ab", 0.25),
190+
],
191+
)
192+
def test_cid_ratio(text, expected):
193+
assert pdf_image_utils.cid_ratio(text) == expected
194+
195+
196+
@pytest.mark.parametrize(
197+
("text", "expected"),
198+
[
199+
("base", False),
200+
("(cid:2)", True),
201+
("(cid:1234567890)", True),
202+
("jkl;(cid:12)asdf", True),
203+
],
204+
)
205+
def test_is_cid_present(text, expected):
206+
assert pdf_image_utils.is_cid_present(text) == expected
207+
208+
182209
def test_pad_bbox():
183210
bbox = (100, 100, 200, 200)
184211
padding = (10, 20) # Horizontal padding 10, Vertical padding 20

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.13.7-dev6" # pragma: no cover
1+
__version__ = "0.13.7-dev7" # pragma: no cover

unstructured/partition/pdf_image/pdf_image_utils.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import base64
22
import os
3+
import re
34
import tempfile
45
from copy import deepcopy
56
from io import BytesIO
@@ -230,6 +231,23 @@ def valid_text(text: str) -> bool:
230231
return "(cid:" not in text
231232

232233

234+
def cid_ratio(text: str) -> float:
235+
"""Gets ratio of unknown 'cid' characters extracted from text to all characters."""
236+
if not is_cid_present(text):
237+
return 0.0
238+
cid_pattern = r"\(cid\:(\d+)\)"
239+
unmatched, n_cid = re.subn(cid_pattern, "", text)
240+
total = n_cid + len(unmatched)
241+
return n_cid / total
242+
243+
244+
def is_cid_present(text: str) -> bool:
245+
"""Checks if a cid code is present in a text selection."""
246+
if len(text) < len("(cid:x)"):
247+
return False
248+
return text.find("(cid:") != -1
249+
250+
233251
def annotate_layout_elements_with_image(
234252
inferred_page_layout: "PageLayout",
235253
extracted_page_layout: Optional["PageLayout"],

0 commit comments

Comments
 (0)