Skip to content

Commit 030c56f

Browse files
enhancement: better leaf element string check in XML parsing (#734)
* Enhance leaf element string check in XML parsing * fix is_string check * changelog and version --------- Co-authored-by: Matt Robinson <[email protected]> Co-authored-by: Matt Robinson <[email protected]>
1 parent a8a19ce commit 030c56f

File tree

3 files changed

+9
-5
lines changed

3 files changed

+9
-5
lines changed

CHANGELOG.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
## 0.7.9-dev0
1+
## 0.7.9-dev1
22

33
### Enhancements
44

5-
* Adds --partition-ocr-languages to unstructured-ingest
6-
5+
* Improvements to string check for leafs in `partition_xml`.
6+
* Adds --partition-ocr-languages to unstructured-ingest.
77

88
### Features
99

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.7.9-dev0" # pragma: no cover
1+
__version__ = "0.7.9-dev1" # pragma: no cover

unstructured/partition/xml.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,10 @@ def is_leaf(elem):
1313
return not bool(elem)
1414

1515

16+
def is_string(elem):
17+
return isinstance(elem, str) or (hasattr(elem, "text") and isinstance(elem.text, str))
18+
19+
1620
def get_leaf_elements(
1721
filename: Optional[str] = None,
1822
file: Optional[Union[IO, SpooledTemporaryFile]] = None,
@@ -33,7 +37,7 @@ def get_leaf_elements(
3337

3438
for elem in root.findall(xml_path):
3539
for subelem in elem.iter():
36-
if is_leaf(subelem):
40+
if is_leaf(subelem) and is_string(subelem.text):
3741
leaf_elements.append(subelem.text)
3842

3943
return "\n".join(leaf_elements) # type: ignore

0 commit comments

Comments
 (0)