Skip to content

Commit 7a74cdd

Browse files
authored
feat: add partition_email cleaning brick (#104)
* fix for processing deeply embedded list elements * fix types in mime encodings cleaner * first pass on partition_email * tests for email * test for mime encodings * changelog bump * added note about \n= * linting, linting, linting * added email docs * add partition_email to the readme * add one more test
1 parent 1d68bb2 commit 7a74cdd

File tree

10 files changed

+260
-4
lines changed

10 files changed

+260
-4
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
## 0.3.3-dev1
2+
3+
* Adds the `partition_email` partitioning brick
4+
* Adds the `replace_mime_encodings` cleaning bricks
5+
* Small fix to HTML parsing related to processing list items with sub-tags
6+
17
## 0.3.2
28

39
* Added `translate_text` brick for translating text between languages

README.md

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
<div align="center">
2020
<img src="https://user-images.githubusercontent.com/38184042/205945013-99670127-0bf3-4851-b4ac-0bc23e357476.gif" title="unstructured in action!">
2121
</div>
22-
22+
2323
<h3 align="center">
2424
<p>Open-Source Pre-Processing Tools for Unstructured Data</p>
2525
</h3>
@@ -148,6 +148,48 @@ has an `element` attribute consisting of `Element` objects. Sub-types of the `El
148148
represent different components of a document, such as `NarrativeText` and `Title`. You can use
149149
these normalized elements to zero in on the components of a document you most care about.
150150

151+
### E-mail Parsing
152+
153+
The `partition_email` function within `unstructured` is helpful for parsing `.eml` files. Common
154+
e-mail clients such as Microsoft Outlook and Gmail support exproting e-mails as `.eml` files.
155+
`partition_email` accepts filenames, file-like object, and raw text as input. The following
156+
three snippets for parsing `.eml` files are equivalent:
157+
158+
```python
159+
from unstructured.partition.email import partition_email
160+
161+
elements = partition_email(filename="example-docs/fake-email.eml")
162+
163+
with open("example-docs/fake-email.eml", "r") as f:
164+
elements = partition_email(file=f)
165+
166+
with open("example-docs/fake-email.eml", "r") as f:
167+
text = f.read()
168+
elements = partition_email(text=text)
169+
```
170+
171+
The `elements` output will look like the following:
172+
173+
```python
174+
[<unstructured.documents.html.HTMLNarrativeText at 0x13ab14370>,
175+
<unstructured.documents.html.HTMLTitle at 0x106877970>,
176+
<unstructured.documents.html.HTMLListItem at 0x1068776a0>,
177+
<unstructured.documents.html.HTMLListItem at 0x13fe4b0a0>]
178+
```
179+
180+
Run `print("\n\n".join([str(el) for el in elements]))` to get a string representation of the
181+
output, which looks like:
182+
183+
```python
184+
This is a test email to use for unit tests.
185+
186+
Important points:
187+
188+
Roses are red
189+
190+
Violets are blue
191+
```
192+
151193
## :guardsman: Security Policy
152194

153195
See our [security policy](https://github.com/Unstructured-IO/unstructured/security/policy) for

docs/source/bricks.rst

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,30 @@ Examples:
5454
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf")
5555
5656
57+
``partition_email``
58+
---------------------
59+
60+
The ``partition_email`` function partitions ``.eml`` documents and works with exports
61+
from email clients such as Microsoft Outlook and Gmail. The ``partition_email``
62+
takes a filename, file-like object, or raw text as input and produces a list of
63+
document ``Element`` objects as output.
64+
65+
Examples:
66+
67+
.. code:: python
68+
69+
from unstructured.partition.email import partition_email
70+
71+
elements = partition_email(filename="example-docs/fake-email.eml")
72+
73+
with open("example-docs/fake-email.eml", "r") as f:
74+
elements = partition_email(file=f)
75+
76+
with open("example-docs/fake-email.eml", "r") as f:
77+
text = f.read()
78+
elements = partition_email(text=text)
79+
80+
5781
``is_bulleted_text``
5882
----------------------
5983

example-docs/fake-email.eml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
MIME-Version: 1.0
2+
Date: Fri, 16 Dec 2022 17:04:16 -0500
3+
Message-ID: <CADc-_xaLB2FeVQ7mNsoX+NJb_7hAJhBKa_zet-rtgPGenj0uVw@mail.gmail.com>
4+
Subject: Test Email
5+
From: Matthew Robinson <[email protected]>
6+
To: Matthew Robinson <[email protected]>
7+
Content-Type: multipart/alternative; boundary="00000000000095c9b205eff92630"
8+
9+
--00000000000095c9b205eff92630
10+
Content-Type: text/plain; charset="UTF-8"
11+
12+
This is a test email to use for unit tests.
13+
14+
Important points:
15+
16+
- Roses are red
17+
- Violets are blue
18+
19+
--00000000000095c9b205eff92630
20+
Content-Type: text/html; charset="UTF-8"
21+
22+
<div dir="ltr"><div>This is a test email to use for unit tests.</div><div><br></div><div>Important points:</div><div><ul><li>Roses are red</li><li>Violets are blue</li></ul></div></div>
23+
24+
--00000000000095c9b205eff92630--

test_unstructured/cleaners/test_core.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,14 @@ def test_replace_unicode_quotes(text, expected):
2929
assert core.replace_unicode_quotes(text=text) == expected
3030

3131

32+
@pytest.mark.parametrize(
33+
"text, expected",
34+
[("5 w=E2=80=99s", "5 w’s")],
35+
)
36+
def test_replace_mime_encodings(text, expected):
37+
assert core.replace_mime_encodings(text=text) == expected
38+
39+
3240
@pytest.mark.parametrize(
3341
"text, expected",
3442
[
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
import os
2+
import pathlib
3+
import pytest
4+
5+
from unstructured.documents.elements import NarrativeText, Title, ListItem
6+
from unstructured.partition.email import partition_email
7+
8+
9+
DIRECTORY = pathlib.Path(__file__).parent.resolve()
10+
11+
12+
EXPECTED_OUTPUT = [
13+
NarrativeText(text="This is a test email to use for unit tests."),
14+
Title(text="Important points:"),
15+
ListItem(text="Roses are red"),
16+
ListItem(text="Violets are blue"),
17+
]
18+
19+
20+
def test_partition_email_from_filename():
21+
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email.eml")
22+
elements = partition_email(filename=filename)
23+
assert len(elements) > 0
24+
assert elements == EXPECTED_OUTPUT
25+
26+
27+
def test_partition_email_from_file():
28+
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email.eml")
29+
with open(filename, "r") as f:
30+
elements = partition_email(file=f)
31+
assert len(elements) > 0
32+
assert elements == EXPECTED_OUTPUT
33+
34+
35+
def test_partition_email_from_text():
36+
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email.eml")
37+
with open(filename, "r") as f:
38+
text = f.read()
39+
elements = partition_email(text=text)
40+
assert len(elements) > 0
41+
assert elements == EXPECTED_OUTPUT
42+
43+
44+
def test_partition_email_raises_with_none_specified():
45+
with pytest.raises(ValueError):
46+
partition_email()
47+
48+
49+
def test_partition_email_raises_with_too_many_specified():
50+
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email.eml")
51+
with open(filename, "r") as f:
52+
text = f.read()
53+
54+
with pytest.raises(ValueError):
55+
partition_email(filename=filename, text=text)
56+
57+
58+
def test_partition_email_raises_with_invalid_content_type():
59+
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "fake-email.eml")
60+
with pytest.raises(ValueError):
61+
partition_email(filename=filename, content_source="application/json")

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.3.2" # pragma: no cover
1+
__version__ = "0.3.3-dev1" # pragma: no cover

unstructured/cleaners/core.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
import re
22
import sys
33
import unicodedata
4+
import quopri
5+
46
from unstructured.nlp.patterns import UNICODE_BULLETS_RE
57

68

@@ -81,6 +83,16 @@ def clean_trailing_punctuation(text: str) -> str:
8183
return text.strip().rstrip(".,:;")
8284

8385

86+
def replace_mime_encodings(text: str) -> str:
87+
"""Replaces MIME encodings with their UTF-8 equivalent characters.
88+
89+
Example
90+
-------
91+
5 w=E2=80-99s -> 5 w’s
92+
"""
93+
return quopri.decodestring(text.encode()).decode("utf-8")
94+
95+
8496
def clean_prefix(text: str, pattern: str, ignore_case: bool = False, strip: bool = True) -> str:
8597
"""Removes prefixes from a string according to the specified pattern. Strips leading
8698
whitespace if the strip parameter is set to True.

unstructured/documents/html.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -225,8 +225,13 @@ def _construct_text(tag_elem: etree.Element) -> str:
225225
return text.strip()
226226

227227

228-
def _is_text_tag(tag_elem: etree.Element) -> bool:
228+
def _is_text_tag(tag_elem: etree.Element, max_predecessor_len: int = 5) -> bool:
229229
"""Deteremines if a tag potentially contains narrative text."""
230+
# NOTE(robinson) - Only consider elements with limited depth. Otherwise,
231+
# it could be the text representation of a giant div
232+
if len(tag_elem) > max_predecessor_len:
233+
return False
234+
230235
if tag_elem.tag in TEXT_TAGS + HEADING_TAGS:
231236
return True
232237

@@ -250,7 +255,7 @@ def _process_list_item(
250255
we can skip processing if bullets are found in a div element."""
251256
if tag_elem.tag in LIST_ITEM_TAGS:
252257
text = _construct_text(tag_elem)
253-
return HTMLListItem(text=text, tag=tag_elem.tag), None
258+
return HTMLListItem(text=text, tag=tag_elem.tag), tag_elem
254259

255260
elif tag_elem.tag == "div":
256261
text = _construct_text(tag_elem)

unstructured/partition/email.py

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
import email
2+
from typing import Dict, Final, IO, List, Optional
3+
4+
from unstructured.cleaners.core import replace_mime_encodings
5+
from unstructured.documents.elements import Element, Text
6+
from unstructured.partition.html import partition_html
7+
8+
9+
VALID_CONTENT_SOURCES: Final[List[str]] = ["text/html"]
10+
11+
12+
def partition_email(
13+
filename: Optional[str] = None,
14+
file: Optional[IO] = None,
15+
text: Optional[str] = None,
16+
content_source: str = "text/html",
17+
) -> List[Element]:
18+
"""Partitions an .eml documents into its constituent elements.
19+
Parameters
20+
----------
21+
filename
22+
A string defining the target filename path.
23+
file
24+
A file-like object using "r" mode --> open(filename, "r").
25+
text
26+
The string representation of the .eml document.
27+
"""
28+
if content_source not in VALID_CONTENT_SOURCES:
29+
raise ValueError(
30+
f"{content_source} is not a valid value for content_source. "
31+
f"Valid content sources are: {VALID_CONTENT_SOURCES}"
32+
)
33+
34+
if not any([filename, file, text]):
35+
raise ValueError("One of filename, file, or text must be specified.")
36+
37+
if filename is not None and not file and not text:
38+
with open(filename, "r") as f:
39+
msg = email.message_from_file(f)
40+
41+
elif file is not None and not filename and not text:
42+
file_text = file.read()
43+
msg = email.message_from_string(file_text)
44+
45+
elif text is not None and not filename and not file:
46+
_text: str = str(text)
47+
msg = email.message_from_string(_text)
48+
49+
else:
50+
raise ValueError("Only one of filename, file, or text can be specified.")
51+
52+
content_map: Dict[str, str] = {
53+
part.get_content_type(): part.get_payload() for part in msg.walk()
54+
}
55+
56+
content = content_map.get(content_source, "")
57+
if not content:
58+
raise ValueError("text/html content not found in email")
59+
60+
# NOTE(robinson) - In the .eml files, the HTML content gets stored in a format that
61+
# looks like the following, resulting in extraneous "=" chracters in the output if
62+
# you don't clean it up
63+
# <ul> =
64+
# <li>Item 1</li>=
65+
# <li>Item 2<li>=
66+
# </ul>
67+
content = "".join(content.split("=\n"))
68+
69+
elements = partition_html(text=content)
70+
for element in elements:
71+
if isinstance(element, Text):
72+
element.apply(replace_mime_encodings)
73+
74+
return elements

0 commit comments

Comments
 (0)