Skip to content

Commit 147514f

Browse files
feat: msg and email metadata (#3444)
Update partition_eml and partition_msg to capture cc, bcc, and message id fields. Docs PR: https://github.com/Unstructured-IO/docs/pull/135/files Testing ``` from unstructured.partition.email import partition_email from test_unstructured.unit_utils import example_doc_path elements = partition_email(filename=example_doc_path("eml/fake-email-header.eml"), include_headers=True) print(elements) elements[0].metadata.to_dict() ``` Note to reviewers: Tests in `test_unstructured/partition/test_email.py` were refactored and rearranged to group similar tests together, so it will be easiest to review those changes commit by commit. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: Coniferish <[email protected]>
1 parent 0f05718 commit 147514f

File tree

14 files changed

+416
-325
lines changed

14 files changed

+416
-325
lines changed

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66

77
### Features
88

9+
* **Update partition_eml and partition_msg to capture cc, bcc, and message_id fields** Cc, bcc, and message_id information is captured in element metadata for both msg and email partitioning and `Recipient` elements are generated for cc and bcc when `include_headers=True` for email partitioning.
910
* **Mark ingest as deprecated** Begin sunset of ingest code in this repo as it's been moved to a dedicated repo.
10-
1111
* **Add `pdf_hi_res_max_pages` argument for partitioning, which allows rejecting PDF files that exceed this page number limit, when the `high_res` strategy is chosen.** By default, it will allow parsing PDF files with an unlimited number of pages.
1212

1313
### Fixes

example-docs/eml/fake-email-header.eml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,15 @@
11
Received: from ABCDEFG-000.ABC.guide (00.0.0.00) by ABCDEFG-000.ABC.guide
22
([ba23::58b5:2236:45g2:88h2]) with Unstructured TTTT Server (version=ABC0_0,
33
cipher=ABC_ABCDE_ABC_NOPE_ABC_000_ABC_ABC000) id 00.0.000.0 via Techbox
4-
Transport; Wed, 20 Feb 2023 10:03:18 +1200
4+
Transport; Wed, 20 Feb 2023 10:03:18 +1200
55
MIME-Version: 1.0
66
Date: Fri, 16 Dec 2022 17:04:16 -0500
7+
Bcc: Hello <[email protected]>
78
Message-ID: <CADc-_xaLB2FeVQ7mNsoX+NJb_7hAJhBKa_zet-rtgPGenj0uVw@mail.gmail.com>
89
Subject: Test Email
910
From: Matthew Robinson <[email protected]>
1011
To: Matthew Robinson <[email protected]>
12+
1113
Content-Type: multipart/alternative; boundary="00000000000095c9b205eff92630"
1214

1315
--00000000000095c9b205eff92630
13.5 KB
Binary file not shown.

0 commit comments

Comments
 (0)