Skip to content

Commit 749f9c6

Browse files
authored
fix: avoid divide by zero in exceeds_cap_ratio (#160)
1 parent 5d9183d commit 749f9c6

File tree

4 files changed

+8
-1
lines changed

4 files changed

+8
-1
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.4.3-dev0
2+
3+
* Fix in `exceeds_cap_ratio` so the function doesn't break with empty text
4+
15
## 0.4.2
26

37
* Added `partition_image` to process documents in an image format.

test_unstructured/partition/test_text_type.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,7 @@ def test_contains_verb(text, expected, monkeypatch):
135135
("Intellectual Property in the United States", True),
136136
("Intellectual property helps incentivize innovation.", False),
137137
("THIS IS ALL CAPS. BUT IT IS TWO SENTENCES.", False),
138+
("", False),
138139
],
139140
)
140141
def test_contains_exceeds_cap_ratio(text, expected, monkeypatch):

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.4.2" # pragma: no cover
1+
__version__ = "0.4.3-dev0" # pragma: no cover

unstructured/partition/text_type.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,8 @@ def exceeds_cap_ratio(text: str, threshold: float = 0.3) -> bool:
119119
return False
120120

121121
tokens = word_tokenize(text)
122+
if len(tokens) == 0:
123+
return False
122124
capitalized = sum([word.istitle() or word.isupper() for word in tokens])
123125
ratio = capitalized / len(tokens)
124126
return ratio > threshold

0 commit comments

Comments
 (0)