Skip to content

Commit e4158de

Browse files
authored
fix(msg): use python-oxmsg for MSG email parsing (#3142)
**Summary** `partition_msg()` previously used the `msg_parser` library for parsing Outlook MSG email files (.msg files). The `msg_parser` library is unmaintained and has several major shortcomings such as not being able to parse MSG files with 8-bit encoded strings and not reliably extracting attachments. Use the new and permissively licenced `python-oxmsg` library instead. **Additional Context** For reviewability purposes, this PR temporarily places the new `partition_msg()` implementation in `new_msg.py` and references that implementation from `msg.py`. `new_msg.py` will be renamed to `msg.py` in a closely following PR. This avoids a very messy interleaving of hunks in a diff between the old and re-written `partition_msg()` implementation. Fixes #2481 Fixes #3006
1 parent b777864 commit e4158de

File tree

10 files changed

+593
-374
lines changed

10 files changed

+593
-374
lines changed

CHANGELOG.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,17 @@
1-
## 0.14.5-dev1
1+
## 0.14.5-dev2
22

33
### Enhancements
44

55
* **Filtering for tar extraction** Adds tar filtering to the compression module for connectors to avoid decompression malicious content in `.tar.gz` files. This was added to the Python `tarfile` lib in Python 3.12. The change only applies when using Python 3.12 and above.
6+
* **Use `python-oxmsg` for `partition_msg()`.** Outlook MSG emails are now partitioned using the `python-oxmsg` package which resolves some shortcomings of the prior MSG parser.
67

78
### Features
89

910
### Fixes
1011

12+
* **8-bit string Outlook MSG files are parsed.** `partition_msg()` is now able to parse non-unicode Outlook MSG emails.
13+
* **Attachments to Outlook MSG files are extracted intact.** `partition_msg()` is now able to extract attachments without corruption.
14+
1115
## 0.14.4
1216

1317
### Enhancements

Makefile

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -343,7 +343,9 @@ test-extra-markdown:
343343

344344
.PHONY: test-extra-msg
345345
test-extra-msg:
346-
PYTHONPATH=. CI=$(CI) pytest test_unstructured/partition/test_msg.py
346+
# NOTE(scanny): exclude attachment test because partitioning attachments requires other extras
347+
PYTHONPATH=. CI=$(CI) pytest test_unstructured/partition/test_msg.py \
348+
-k "not test_partition_msg_can_process_attachments"
347349

348350
.PHONY: test-extra-odt
349351
test-extra-odt:

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ lint.select = [
3131
]
3232
lint.ignore = [
3333
"COM812", # -- over aggressively insists on trailing commas where not desireable --
34+
"PT001", # -- wants empty parens on @pytest.fixture where not used (essentially always) --
3435
"PT005", # -- flags mock fixtures with names intentionally matching private method name --
3536
"PT011", # -- pytest.raises({exc}) too broad, use match param or more specific exception --
3637
"PT012", # -- pytest.raises() block should contain a single simple statement --

requirements/extra-msg.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
-c ./deps/constraints.txt
22
-c base.txt
33

4-
msg_parser
4+
python-oxmsg

requirements/extra-msg.txt

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,15 @@
44
#
55
# pip-compile ./extra-msg.in
66
#
7-
msg-parser==1.2.0
8-
# via -r ./extra-msg.in
7+
click==8.1.7
8+
# via
9+
# -c ./base.txt
10+
# python-oxmsg
911
olefile==0.47
10-
# via msg-parser
12+
# via python-oxmsg
13+
python-oxmsg==0.0.1
14+
# via -r ./extra-msg.in
15+
typing-extensions==4.12.1
16+
# via
17+
# -c ./base.txt
18+
# python-oxmsg

0 commit comments

Comments
 (0)