Skip to content

Commit 1eceac2

Browse files
rfctr(email): eml partitioner rewrite (#3694)
**Summary** Initial attempts to incrementally refactor `partition_email()` into shape to allow pluggable partitioning quickly became too complex for ready code-review. Prepare separate rewritten module and tests and swap them out whole. **Additional Context** - Uses the modern stdlib `email` module to reliably accomplish several manual decoding steps in the legacy code. - Remove obsolete email-specific element-types which were replaced 18 months or so ago with email-specific metadata fields for things like Cc: addresses, subject, etc. - Remove accepting an email as `text: str` because MIME-email is inherently a binary format which can and often does contain multiple and contradictory character-encodings. - Remove `encoding` parameters as it is now unused. An email file is not a text file and as such does not have a single overall encoding. Character encoding is specified individually for each MIME-part within the message and often varies from one part to another in the same message. - Remove the need for a caller to specify `attachment_partitioner`. There is only one reasonable choice for this which is `auto.partition()`, consistent with the same interface and operation in `partition_msg()`. - Fixes #3671 along the way by silently skipping attachments with a file-type for which there is no partitioner. - Substantially extend the test-suite to cover multiple transport-encoding/charset combinations. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: scanny <[email protected]>
1 parent 9049e4e commit 1eceac2

29 files changed

+2061
-1195
lines changed

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,13 @@
1+
## 0.16.1-dev0
2+
3+
### Enhancements
4+
5+
### Features
6+
7+
### Fixes
8+
9+
* **Rewrite of `partition.email` module and tests.** Use modern Python stdlib `email` module interface to parse email messages and attachments. This change shortens and simplifies the code, and makes it more robust and maintainable. Several historical problems were remedied in the process.
10+
111
## 0.16.0
212

313
### Enhancements

example-docs/eml/empty.eml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
2+
3+

example-docs/eml/mime-attach-mp3.eml

Lines changed: 934 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
2+
3+
Date: Tue, 01 Oct 2024 12:34:56 -0500
4+
Subject: Example MIME Email
5+
MIME-Version: 1.0
6+
Content-Type: multipart/alternative; boundary="boundary123"
7+
8+
--boundary123
9+
Content-Type: text/plain; charset="UTF-8"
10+
Content-Transfer-Encoding: 7bit
11+
12+
This is the text/plain part.
13+
14+
Did you know that the first email was sent by Ray Tomlinson in 1971? He used the "@" symbol to separate the user's name from the computer name, a practice that is still in use today.
15+
16+
Another interesting fact is that the first known instance of email spam occurred in 1978. A marketing message was sent to 393 recipients on ARPANET, marking the beginning of what we now know as email spam.
17+
18+
--boundary123
19+
Content-Type: text/html; charset="UTF-8"
20+
Content-Transfer-Encoding: 7bit
21+
22+
<!DOCTYPE html>
23+
<html>
24+
<head>
25+
<title>Example MIME Email</title>
26+
</head>
27+
<body>
28+
<p>This is the <code>text/html</code> part.</p>
29+
<p>Did you know that the first <b>networked email</b> was sent by Ray Tomlinson in 1971? He used the "@" symbol to separate the user's name from the computer name, a practice that is still in use today.</p>
30+
<p>Another interesting fact is that the first known instance of <i>email spam</i> occurred in 1978. A marketing message was sent to 393 recipients on ARPANET, marking the beginning of what we now know as email spam.</p>
31+
</body>
32+
</html>
33+
34+
--boundary123--
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
MIME-Version: 1.0
2+
3+
4+
Date: Tue, 01 Oct 2024 12:34:56 -0500
5+
Subject: Example HTML Only MIME Email
6+
Content-Type: text/html; charset="ISO-8859-1"
7+
Content-Transfer-Encoding: base64
8+
9+
PHA+VGhpcyBpcyBhIHRleHQvaHRtbCBwYXJ0LjwvcD4KPGRpdiBpZD0iY29udGVudCI+PHA+VGhl
10+
IGZpcnN0IGVtb3RpY29uLCA6KSAsIHdhcyBwcm9wb3NlZCBieSBTY290dCBGYWhsbWFuIGluIDE5
11+
ODIgdG8gaW5kaWNhdGUganVzdCBvciBzYXJjYXNtIGluIHRleHQgZW1haWxzLjwvcD4KPHA+R21h
12+
aWwgd2FzIGxhdW5jaGVkIGJ5IEdvb2dsZSBpbiAyMDA0IHdpdGggMSBHQiBvZiBmcmVlIHN0b3Jh
13+
Z2UsIHNpZ25pZmljYW50bHkgbW9yZSB0aGFuIHdoYXQgb3RoZXIgc2VydmljZXMgb2ZmZXJlZCBh
14+
dCB0aGUgdGltZS48L3A+PC9kaXY+
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
2+
3+
4+
5+
Subject: Example Plain-Text MIME Message
6+
Message-ID: <[email protected]>
7+
MIME-Version: 1.0
8+
Content-Type: text/plain; charset="UTF-8"
9+
10+
This is a plain-text message.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
2+
3+
4+
5+
Subject: Example Multipart Digest Email
6+
Message-ID: <[email protected]>
7+
MIME-Version: 1.0
8+
Content-Type: multipart/digest; boundary="boundary123"
9+
10+
--boundary123
11+
Content-Type: message/rfc822
12+
13+
14+
15+
Subject: First Message
16+
17+
This is the first message in the digest.
18+
19+
--boundary123
20+
Content-Type: message/rfc822
21+
22+
23+
24+
Subject: Second Message
25+
26+
This is the second message in the digest.
27+
28+
--boundary123
29+
Content-Type: message/rfc822
30+
31+
32+
33+
Subject: Third Message
34+
35+
This is the third message in the digest.
36+
37+
--boundary123--

example-docs/eml/mime-no-body.eml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
2+
3+
Date: Tue, 01 Oct 2024 12:34:56 -0500
4+
Subject: Image Only Email
5+
MIME-Version: 1.0
6+
Content-Type: multipart/mixed; boundary="boundary123"
7+
8+
--boundary123
9+
Content-Type: image/jpeg
10+
Content-Disposition: attachment; filename="image.jpg"
11+
Content-Transfer-Encoding: base64
12+
13+
/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAkGBxISEBAQEhISEBAWFRUVFhUVFRUWFRUWFhUWFhUV
14+
FRUYHSggGBolGxUVITEhJSkrLi4uFx8zODMtNygtLisBCgoKDg0OGhAQGi0fHx8rLS0rLS0rLS0t
15+
LS0rLS0rLS0rLS0rLS0rLS0rLS0rLS0rLS0tLS0rLS0rLS0rLS0rLf/AABEIAMgAyAMBIgACEQED
16+
EQH/xAAbAAEAAgMBAQAAAAAAAAAAAAAABAUCAwYBB//EAD0QAAIBAwMBBgQEBgIDCQAAAAECAwAE
17+
ERIhBTFBBhMiUWFxgZEykaGxFCNCUrHB0fAUM2JygpLwFySTwsL/xAAYAQEBAQEBAAAAAAAAAAAA
18+
AAAABQEDBP/EAB8RAQEBAQEBAQEBAQEAAAAAAAABEQIhEjEEQVFhcf/aAAwDAQACEQMRAD8A+6qK
19+
CiiggqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgq
20+
CiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCooIKgqCiiCo
21+
[Base64 encoded image data continues]
22+
--boundary123--
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
2+
3+
MIME-Version: 1.0
4+
Content-Type: text/plain; charset="UTF-8"
5+
6+
This is a simple email message without a subject.

example-docs/eml/mime-no-to.eml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
2+
3+
4+
Subject: Example Plain-Text MIME Message
5+
MIME-Version: 1.0
6+
Content-Type: text/plain; charset="UTF-8"
7+
8+
This is a plain-text message.

0 commit comments

Comments
 (0)