Skip to content

Commit e684f52

Browse files
Merge pull request #374 from TeamMsgExtractor/next-release
Next release
2 parents 3cffc2e + bea3ea8 commit e684f52

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

83 files changed

+5702
-4220
lines changed

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,6 @@
1919
__pycache__/
2020

2121
# Ignore new .msg files added from testing
22-
2322
/example-msg-files/expected-outputs/
2423
/example-msg-files/*.msg
2524

CHANGELOG.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,59 @@
1+
**v0.42.0**
2+
* [[TeamMsgExtractor #372](https://github.com/TeamMsgExtractor/msg-extractor/issues/372)] Changed the way that the save functions return a value. This makes the return value from all save functions much more informative, allowing a user to separate if a fole or folder (or if more than one) was saved from the function. It also guarentees that all classes from this module will return the relevent path(s) if data is actually saved.
3+
* [[TeamMsgExtractor #288](https://github.com/TeamMsgExtractor/msg-extractor/issues/288)] Added feature to allow attachment save functions to simply overwrite existing files of the same name. This can be done with the `overwriteExisting` keyword argument from code or the `--overwrite-existing` option from the command line.
4+
* [[TeamMsgExtractor #40](https://github.com/TeamMsgExtractor/msg-extractor/issues/40)] Added new submodule `custom_attachments`. This submodule provides an extendable way to handle custom attachment types, attachment types whose structure and formatting are not defined in the Microsoft documentation for MSG files. This includes a handler to at least partially cover support for Outlook images.
5+
* [[TeamMsgExtractor #373](https://github.com/TeamMsgExtractor/msg-extractor/issues/373)] Added the `encoding` submodule for encoding tasks, including proper support for Microsoft's implementation of CP950. This gets added to the codecs list as "windows-950".
6+
* Added infrastructure to make it easy to add variable-byte (up to two bytes) encodings and single-byte encodings.
7+
* Added the following encodings:
8+
* windows-874
9+
* x-mac-ce
10+
* x-mac-cyrillic
11+
* x-mac-greek
12+
* x-mac-icelandic
13+
* x-mac-turkish
14+
* Fixed an issue in the save functions that left the possibility for the zip files to not end up closing if the save function created it and then had an exception.
15+
* Added new property `AttachmentBase.clsid` which returns the listed CLSID value of the data stream/storage of the attachment.
16+
* Changed internal behavior of `MSGFile.attachments`. This should not cause any noticeable changes to the output.
17+
* Refactored code significantly to make it more organized.
18+
* Changed the exports from the main module to only include an important subset of the module. For other items, you'll have to import the submodule that it falls under to access it. Submodules export all important pieces, so it will be easier to find.
19+
* This includes having many modules be under entirely new paths. Some of these changes have been done with no deprecation, something I generally try to avoid. This is happening at the same time as the public api is significantly changing, which makes it more acceptable.
20+
* Fixed `__main__` using the wrong enum for error behavior.
21+
* Fixed `Named.get` being severely out of date (it's not used anywhere by the module which is why it wasn't noticed).
22+
* Fixed `Named.__getitem__` being entirely case-sensitive.
23+
* Switched much of the internal code (and the `treePath` property of all classes that have it) to using `weakref.ReferenceType` to avoid hard cyclic references.
24+
* Fixed `Recipient._getTypedStream` never returning a value.
25+
* Added additional type hints in various places.
26+
* Modified tests.py to only run if it is run as a file instead of imported.
27+
* Changed `knownMsgClass` to a private function since it is explicitly not being exported by any part of the module.
28+
* Removed unusued function `getFullClassName`.
29+
* Fixes to the HTML body when saving as HTML will no longer require the `preparedHtml`/`--prepared-html` option.
30+
* Removed unused exceptions.
31+
* Entirely reoganized the way attachments are initialized, including the class that will be used in various circumstances. Embedded MSG files, custom attachments, and web attachments will all use dedicated classes that are subclasses of AttachmentBase.
32+
* With this change, the way to specify a new Attachment class is to override the function used when creating attachments. This can be done by passing `attachmentInit = myFunction` as an option to `openMsg`. This function MUST return an instance of AttachmentBase.
33+
* Added first implementation of web attachments. Saving is not currently possible, but basic relevent property access is now possible. Saving will not be stopped by this attachment if `skipNotImplemented = True` is passed to the save function.
34+
* Changed the option to suppress `RTFDE` errors to fall under the `ErrorBehavior` enum. Usage of the original option will be allowable, but is being marked as deprecated. However, it is still a dedicated option from the command line.
35+
* Also fixed the option not properly ignoring some RTFDE errors, specifically the ones that it is normal for the module to throw.
36+
* Removed some constants that are not used by the module.
37+
* Updated to support `RTFDE` version `0.1.0`. Users encountering random erros from that module should find that those errors have disappeared. If you get errors from it still, bring up the issue on their GitHub.
38+
* Fixed bug that would cause weird behavior if you gave an empty string as the path for an MSG file.
39+
* Added support for `IPM.StickyNote`.
40+
* Fixed an issue that would cause MSG file to never close if an error happened during any of the `__init__` functions for MSG classes.
41+
* Removed unneeded `chardet` dependency.
42+
* Removed `Contact.__init__` as it didn't provide any unique behavior.
43+
* Changed the documentation of `openMsg` to specify that it accepts all options recognized by MSGFile subclasses, allowing the doc string to not be modified every time one of them is changed.
44+
* Changed the documentaion of various `__init__` methods to do the same thing.
45+
* Added `dataType` property to `AttachmentBase` and `SignedAttachment` for checking the class that the data will be, if accessible. Returns `None` if the data is inaccessible, including because accessing it would throw an exception.
46+
* Added new enum `InsecureFeatures` and option `insecureFeatures`. This option will allow certain features with security implcations to be used for files that you trust. Currently the only feature it supports is the usage of `PIL`/`Pillow` to open and modify images. All features like this will be opt-in to reduce possible vulnerabilities.
47+
* Modified all custom exceptions the module uses to derive from a single base class for better organization.
48+
* Added new exceptions to handle some of the situations previously handled by base Python exceptions.
49+
* Changed internal handling of the `prefix` option for `MSGFile.__init__` (and therefore `openMsg`). If you are not setting this manually, you should notice little difference.
50+
* Made enums less strict and converted all using `fromBits` to be `IntFlag` enums.
51+
* Fixed `CalendarBase.keywords` being blatantly incorrect (it was so bad I don't know how it slipped through).
52+
* Fixed `Contact.gender` being blatantly incorrect.
53+
* Fixed sender not being properly decoded in some circumstances.
54+
* Changed behavior of `MSGFile` to have olefile raise defects of type `DEFECT_INCORRECT` and above instead of just `DEFECT_FATAL`. Uncaught issues of `DEFECT_INCORRECT` can often cause the module to have parsing issues that may be misleading, this just ensures the issue is clarified. This behavior can be reverted back to the previous with `ErrorBehavior.OLE_DEFECT_INCORRECT`.
55+
* Fixed potential issues that may have made is possible for certain attachments to ignore filename conflict resolution code.
56+
157
**v0.41.5**
258
* Fixed an issue from version `0.41.3` where the header being present but missing the `From` field would cause an exception.
359

README.rst

Lines changed: 50 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -59,62 +59,54 @@ Currently, the README is in the process of being redone. For now, please
5959
refer to the usage information provided from the program's help dialog:
6060
::
6161

62-
usage: extract_msg [-h] [--use-content-id] [--validate] [--json] [--file-logging] [-v] [--log LOG] [--config CONFIGPATH]
63-
[--out OUTPATH] [--use-filename] [--dump-stdout] [--html] [--pdf] [--wk-path WKPATH]
64-
[--wk-options [WKOPTIONS ...]] [--prepared-html] [--charset CHARSET] [--raw] [--rtf] [--allow-fallback]
65-
[--skip-body-not-found] [--zip ZIP] [--save-header] [--attachments-only] [--skip-hidden] [--no-folders]
66-
[--skip-embedded] [--extract-embedded] [--skip-not-implemented] [--out-name OUTNAME | --glob] [--ignore-rtfde]
67-
[--progress]
68-
msg [msg ...]
69-
70-
extract_msg: Extracts emails and attachments saved in Microsoft Outlook's .msg files. https://github.com/TeamMsgExtractor/msg-
71-
extractor
72-
73-
positional arguments:
74-
msg An MSG file to be parsed.
75-
76-
optional arguments:
77-
-h, --help show this help message and exit
78-
--use-content-id, --cid
79-
Save attachments by their Content ID, if they have one. Useful when working with the HTML body.
80-
--validate Turns on file validation mode. Turns off regular file output.
81-
--json Changes to write output files as json.
82-
--file-logging Enables file logging. Implies --verbose level 1.
83-
-v, --verbose Turns on console logging. Specify more than once for higher verbosity.
84-
--log LOG Set the path to write the file log to.
85-
--config CONFIGPATH Set the path to load the logging config from.
86-
--out OUTPATH Set the folder to use for the program output. (Default: Current directory)
87-
--use-filename Sets whether the name of each output is based on the msg filename.
88-
--dump-stdout Tells the program to dump the message body (plain text) to stdout. Overrides saving arguments.
89-
--html Sets whether the output should be HTML. If this is not possible, will error.
90-
--pdf Saves the body as a PDF. If this is not possible, will error.
91-
--wk-path WKPATH Overrides the path for finding wkhtmltopdf.
92-
--wk-options [WKOPTIONS ...]
93-
Sets additional options to be used in wkhtmltopdf. Should be a series of options and values, replacing the -
94-
or -- in the beginning with + or ++, respectively. For example: --wk-options "+O Landscape"
95-
--prepared-html When used in conjunction with --html, sets whether the HTML output should be prepared for embedded
96-
attachments.
97-
--charset CHARSET Character set to use for the prepared HTML in the added tag. (Default: utf-8)
98-
--raw Sets whether the output should be raw. If this is not possible, will error.
99-
--rtf Sets whether the output should be RTF. If this is not possible, will error.
100-
--allow-fallback Tells the program to fallback to a different save type if the selected one is not possible.
101-
--skip-body-not-found
102-
Skips saving the body if the body cannot be found, rather than throwing an error.
103-
--zip ZIP Path to use for saving to a zip file.
104-
--save-header Store the header in a separate file.
105-
--attachments-only Specify to only save attachments from an msg file.
106-
--skip-hidden Skips any attachment marked as hidden (usually ones embedded in the body).
107-
--no-folders Stores everything in the location specified by --out. Requires --attachments-only and is incompatible with
108-
--out-name.
109-
--skip-embedded Skips all embedded MSG files when saving attachments.
110-
--extract-embedded Extracts the embedded MSG files as MSG files instead of running their save functions.
111-
--skip-not-implemented, --skip-ni
112-
Skips any attachments that are not implemented, allowing saving of the rest of the message.
113-
--out-name OUTNAME Name to be used with saving the file output. Cannot be used if you are saving more than one file.
114-
--glob, --wildcard Interpret all paths as having wildcards. Incompatible with --out-name.
115-
--ignore-rtfde Ignores all errors thrown from RTFDE when trying to save. Useful for allowing fallback to continue when an
116-
exception happens.
117-
--progress Shows what file the program is currently working on during it's progress.
62+
usage: extract_msg [-h] [--use-content-id] [--json] [--file-logging] [-v] [--log LOG] [--config CONFIGPATH] [--out OUTPATH] [--use-filename] [--dump-stdout] [--html] [--pdf] [--wk-path WKPATH] [--wk-options [WKOPTIONS ...]]
63+
[--prepared-html] [--charset CHARSET] [--raw] [--rtf] [--allow-fallback] [--skip-body-not-found] [--zip ZIP] [--save-header] [--attachments-only] [--skip-hidden] [--no-folders] [--skip-embedded] [--extract-embedded]
64+
[--overwrite-existing] [--skip-not-implemented] [--out-name OUTNAME | --glob] [--ignore-rtfde] [--progress]
65+
msg [msg ...]
66+
67+
extract_msg: Extracts emails and attachments saved in Microsoft Outlook's .msg files. https://github.com/TeamMsgExtractor/msg-extractor
68+
69+
positional arguments:
70+
msg An MSG file to be parsed.
71+
72+
options:
73+
-h, --help show this help message and exit
74+
--use-content-id, --cid
75+
Save attachments by their Content ID, if they have one. Useful when working with the HTML body.
76+
--json Changes to write output files as json.
77+
--file-logging Enables file logging. Implies --verbose level 1.
78+
-v, --verbose Turns on console logging. Specify more than once for higher verbosity.
79+
--log LOG Set the path to write the file log to.
80+
--config CONFIGPATH Set the path to load the logging config from.
81+
--out OUTPATH Set the folder to use for the program output. (Default: Current directory)
82+
--use-filename Sets whether the name of each output is based on the msg filename.
83+
--dump-stdout Tells the program to dump the message body (plain text) to stdout. Overrides saving arguments.
84+
--html Sets whether the output should be HTML. If this is not possible, will error.
85+
--pdf Saves the body as a PDF. If this is not possible, will error.
86+
--wk-path WKPATH Overrides the path for finding wkhtmltopdf.
87+
--wk-options [WKOPTIONS ...]
88+
Sets additional options to be used in wkhtmltopdf. Should be a series of options and values, replacing the - or -- in the beginning with + or ++, respectively. For example: --wk-options "+O Landscape"
89+
--prepared-html When used in conjunction with --html, sets whether the HTML output should be prepared for embedded attachments.
90+
--charset CHARSET Character set to use for the prepared HTML in the added tag. (Default: utf-8)
91+
--raw Sets whether the output should be raw. If this is not possible, will error.
92+
--rtf Sets whether the output should be RTF. If this is not possible, will error.
93+
--allow-fallback Tells the program to fallback to a different save type if the selected one is not possible.
94+
--skip-body-not-found
95+
Skips saving the body if the body cannot be found, rather than throwing an error.
96+
--zip ZIP Path to use for saving to a zip file.
97+
--save-header Store the header in a separate file.
98+
--attachments-only Specify to only save attachments from an msg file.
99+
--skip-hidden Skips any attachment marked as hidden (usually ones embedded in the body).
100+
--no-folders Stores everything in the location specified by --out. Requires --attachments-only and is incompatible with --out-name.
101+
--skip-embedded Skips all embedded MSG files when saving attachments.
102+
--extract-embedded Extracts the embedded MSG files as MSG files instead of running their save functions.
103+
--overwrite-existing Disables filename conflict resolution code for attachments when saving a file, causing files to be overwriten if two attachments with the same filename are on an MSG file.
104+
--skip-not-implemented, --skip-ni
105+
Skips any attachments that are not implemented, allowing saving of the rest of the message.
106+
--out-name OUTNAME Name to be used with saving the file output. Cannot be used if you are saving more than one file.
107+
--glob, --wildcard Interpret all paths as having wildcards. Incompatible with --out-name.
108+
--ignore-rtfde Ignores all errors thrown from RTFDE when trying to save. Useful for allowing fallback to continue when an exception happens.
109+
--progress Shows what file the program is currently working on during it's progress.
118110

119111
**To use this in your own script**, start by using:
120112

@@ -250,8 +242,8 @@ your access to the newest major version of extract-msg.
250242
.. |License: GPL v3| image:: https://img.shields.io/badge/License-GPLv3-blue.svg
251243
:target: LICENSE.txt
252244

253-
.. |PyPI3| image:: https://img.shields.io/badge/pypi-0.41.5-blue.svg
254-
:target: https://pypi.org/project/extract-msg/0.41.5/
245+
.. |PyPI3| image:: https://img.shields.io/badge/pypi-0.42.0-blue.svg
246+
:target: https://pypi.org/project/extract-msg/0.42.0/
255247

256248
.. |PyPI2| image:: https://img.shields.io/badge/python-3.8+-brightgreen.svg
257249
:target: https://www.python.org/downloads/release/python-3816/

extract_msg/__init__.py

Lines changed: 22 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -27,53 +27,42 @@
2727
# along with this program. If not, see <http://www.gnu.org/licenses/>.
2828

2929
__author__ = 'Destiny Peterson & Matthew Walker'
30-
__date__ = '2023-06-11'
31-
__version__ = '0.41.5'
30+
__date__ = '2023-07-29'
31+
__version__ = '0.42.0'
3232

3333
__all__ = [
3434
# Modules:
35-
'constants',
35+
'attachments',
3636
'enums',
3737
'exceptions',
38+
'msg_classes',
39+
'properties',
3840

3941
# Classes:
40-
'AppointmentMeeting',
4142
'Attachment',
42-
'Contact',
43-
'MeetingForwardNotification',
44-
'MeetingRequest',
45-
'MeetingResponse',
43+
'AttachmentBase',
4644
'Message',
47-
'MessageBase',
48-
'MessageSigned',
49-
'MessageSignedBase',
5045
'MSGFile',
51-
'Post',
52-
'Properties',
46+
'Named',
47+
'NamedProperties',
48+
'OleWriter',
49+
'PropertiesStore',
5350
'Recipient',
54-
'Task',
51+
'SignedAttachment',
5552

56-
#Functions:
57-
'createProp',
53+
# Functions:
5854
'openMsg',
5955
'openMsgBulk',
6056
]
6157

58+
59+
# Ensure these are imported before anything else.
6260
from . import constants, enums, exceptions
63-
from .appointment import AppointmentMeeting
64-
from .attachment import Attachment
65-
from .contact import Contact
66-
from .meeting_forward import MeetingForwardNotification
67-
from .meeting_request import MeetingRequest
68-
from .meeting_response import MeetingResponse
69-
from .message import Message
70-
from .message_base import MessageBase
71-
from .message_signed import MessageSigned
72-
from .message_signed_base import MessageSignedBase
73-
from .msg import MSGFile
74-
from .post import Post
75-
from .prop import createProp
76-
from .properties import Properties
77-
from .recipient import Recipient
78-
from .task import Task
79-
from .utils import openMsg, openMsgBulk
61+
62+
from . import attachments, msg_classes, properties
63+
from .attachments import Attachment, AttachmentBase, SignedAttachment
64+
from .msg_classes import Message, MSGFile
65+
from .ole_writer import OleWriter
66+
from .open_msg import openMsg, openMsgBulk
67+
from .properties import Named, NamedProperties, PropertiesStore
68+
from .recipient import Recipient

0 commit comments

Comments
 (0)