-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
GH-115512: Optimize peak memory usage and runtime for large emails #132709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before reviewing this PR, please
- Remove all type annotations. They are left to https://github.com/python/typeshed.
- Wrap all lines under 80 characters.
- Avoid comments that state what the code does. There are some trivial comments here and there. Some are not very useful.
- Do not check statistical overheads. They entirely depend on the host machine and other parameters that are hard to guarantee. We only test functionalities but we don't want to necessarily test that X or Y takes more or less time than this.
- If possible, make smaller PRs, targeting either time or memory improvements, and if possible, only one function or method at a time.
Note: it'd be a good idea to provide the full benchmarking script so that others can also verify the results.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
I have made the requested changes; please review again Re: Failing build check: This looks unrelated to my changes, and was not failing in previous commits. Re: Testing for memory usage: I removed the memory usage tests, but I do think there's some value to testing something of that nature in some sort of automated way. Memory usage tests are easier to get consistency out of than time-based tests. Maybe some subset of tests could be run in a controlled environment (i.e: the Ubuntu tests check), and skipped otherwise. Maybe it merits its own separate test suite. In general though, the way it was being tested was intentionally as close to deterministic as possible. Repeatedly running the same tests on the same machine seemed to be consistently producing the same results, as best as I could tell, at least. I understand if that's entirely out of the scope of this issue & PR, though. Re: Splitting the PR: The Re: Benchmark script: I'll have to look into releasing some variation of the benchmark script. It may take a fair bit of time (at least a week), and it's unfortunately not a zero-effort endeavor. It hadn't occurred to me that it might be helpful to include. Let me follow up on this. |
Thanks for making the requested changes! @picnixz: please review the changes made to this pull request. |
yield NeedMoreData | ||
continue | ||
self._cur.epilogue = EMPTYSTRING.join(epilogue) | ||
self._cur.epilogue = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I kept the behavior here the same, but I'm not actually certain whether it's correct. The previous code appeared as though it intended to assign the remainder of the message to the epilogue, but then did not.
So, this data is discarded. This is the only place where we discard the rest of the message like this. It's out of the scope of this PR, and it's very narrow in scope, but it's interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's just to make sure that the attribute is initialized.
We have the
No, now it's fine. Splitting the PR is worth only if the splitted parts can be used and are not just dead code. I still want to know: is it possible to improve only the time complexity without touching the memory complexity and then improve the memory complexity in a second phase without too many changes, are the improvements entangled? |
Thanks, I wasn't aware of this, and I'm happy to poke at this in a separate PR. The main 3 test cases I was focusing in on were "1 large non-multipart plaintext email" (the simplest case), "1 large multipart email with 1 properly body-formatted large attachment" (the ideal case for attachments), and "1 large multipart email with 1 non-body-formatted large attachment on a single line" (the worst case), since each of those had slightly different memory usage characteristics.
Short answer: Long answer: If these absolutely had to be separated, it would be far less work to have a PR for the memory optimizations first (which introduces a small time performance regression, as splitting on universal newlines in Python is slower than using That said, I can point out the functions which exist primarily for those time optimizations:
|
I would be happy if we can manage to land this before beta but I'm away for the next 10 days so I won't be able to review this. Maybe @bitdancer has time to review this, or @vadmium / @serhiy-storchaka but otherwise it will need to wait until 3.15. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still hard to review as there are many things that happen. I can remember one or two booleans but maybe we could have a better real state machine with named states to remember what we are doing with having to add comments everywhere.
This PR will need more eyes and I'm sorry in advance that we'll likely have a lot of back and forth just for me to correctly understand the implementaion, though I have my idea now.
Ok, avoid having blank lines. We add blank lines to separate logical sections in the function but standard library likes to be more compact. I also stopped correcting this, but if your comment starts with a capital letter, add a period at the end (or just do everything in lower case). It may seem silly, but without a period I'm actually expecting something to follow the line after so, keep in mind the possible accessibility issues.
Now, you can also add a note in whatsnew/3.15.rst
because this is substantial improvement.
# We previously pushed a dangling line expecting \n to follow, | ||
# however we received other data instead. Therefore, that \r | ||
# does actually terminate a line. Go ahead and push it. | ||
self._flush_partial() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to reduce indentation, and make it easier to read perhaps, is to actually handle self._dangling_partial
in flush_partial
directly, or in another method, say self._try_flush_partial()
.
self._partial.append(data) | ||
self._dangling_partial = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest having a method doing the partial.append() and the dangling_partial=True (AFAICT, doing the append should always set the flag)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would make sense; the thought also later occurred to me that we can just have a property which checks "is there a partial, and does it end in '\r'
?", but I didn't want to re-benchmark this at the time to see if there was any measurable difference. I'm a little less worried now though, and tracking less explicit state may be more maintainable. It is admittedly error-prone.
We only set the flag though when the line is ending on a '\r'
specifically, not on every append(), but on every append ending in '\r'
. Here, we know the last character is '\r'
because the universal newline is present, but lacks an end (so we need to see if a '\n'
follows).
if data[-1] == '\r': | ||
self._dangling_partial = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a recurrent check, so how about having a "self._check_if_danling_partial()".
needs_more_data = False | ||
for line in self: | ||
if line is NeedMoreData: | ||
needs_more_data = True | ||
break | ||
_dump_destination.append(line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use itertools.takewhile
:
iself = iter(self)
x = itertools.takewhile(operator.is_(NeedMoreData), iself)
_dump_destination.extend(x)
if next(iself, None) is None:
# self was consumed
else:
# we can pull more data
if needs_more_data: | ||
# Flush our partial, if we can | ||
if self._partial and self._can_dump_partial(self._partial[0]): | ||
_dump_destination.extend(self._partial) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be self._partial
or ''.join(self._partial)
in this case? also, why are we only checking the first partial[0]
content but still extend _dump_destination
with the entire partial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're only checking the partial, we only need to check the very first character of the partial (to see if it's a -
), as well as the state of the parser. If it starts with a potential boundary (-
), and we care about boundary matches, we assume we can't dump this at all, as we don't know the content of the rest of the line, and so we can't do the boundary matches. The rest of the line's content isn't looked at. It's worth noting that the partial will never contain a complete line here.
The extend() here just ensures we're definitely grabbing everything, since the partial could be many parts if it's a very long partial line.
This is fine; the more eyes, the better 👍
This sounds fine. I will do a pass to apply all of the stylistic comments soon, but I've gone ahead and made sure to at least reply to the functional comments. At some point I picked up "don't leave trailing periods in comments" as part of my personal style, but I'm not really sure when or why. I think it might've had something to do with inconsistent trailing periods in comments. In any case, it's easy to change these here to match the project.
Ack; will do alongside the other comments. |
I can understand, and I also debated with myself whether to be pedantic or not. However, with a period, I'm sure that people will read "sentences" and not just half-sentences. I just observed that sometimes, without a final period, I'm expecting for more comment to follow which sometimes breaks the reading flow. For now, you can just forget about the style comments and we can focus on functionality only. |
Thank you all for your work so far on this! Is it still active? |
It's still active, but I need to set aside some time to incorporate the requested changes and re-validate that the behavior is unchanged with those requested changes. |
# We were dumping, but we've now reached the end of the dump. | ||
self._dump_destination = None | ||
self._lines.append(line) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my personal style, I'm not like to use if else
here. I belive there are for different status here
- Do Nothing
- flush partial
- end of the dump
- duming
So I prefer to create a enum here to represent four different status and use a match case statement here.
I think this will help us to sementic the status and make the code more readable
if not self._eofstack: | ||
# There's nothing that will ever tell us to stop dumping. | ||
# This does absolute wonders for large non-multipart emails. | ||
assert not self._lines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to use the assert here. Normally the assert will raise an exception here.
So in other words, I'm not sure we need to raise an exception here.
If we don't need to raise an exception here, I think we simplify the code here as return not self._lines and not self._dangling_partial and not self._partial
If we need to raise an exception here, I think it would be better to cumtomize a new Exception and raise it with some meaningful message not just a simple assert error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion here is to protect against future changes, as it would be indicative of a programmatic error in the email library specifically. If self._dump_destination
is present (i.e: we are dumping) and there is no _eofstack
(i.e: nothing to prevent dumping arbitrary data at all, because we're dumping to end of stream), then there must not be any pending lines (because they should already be dumped), and there must not be a partial line (because it should've already been dumped). Hence why an assertion seems the most appropriate in these few places specifically.
The assertion itself could be removed entirely, as it should never fail, but I fear that could introduce risk of programmatic errors later.
Edit: forgot to mention the _eofstack
specifically
def _can_dump_partial(self, line, start=0, end=sys.maxsize): | ||
# Very similar to _can_dump_data above, except we can make some | ||
# additional assumptions for partials/lines. | ||
assert not self._partial or line is self._partial[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same with previous comment
separator = '--' + boundary | ||
def boundarymatch(line): | ||
if not line.startswith(separator): | ||
def boundarymatch(line, pos = 0, endpos = sys.maxsize): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we need endpos here? In my thought all of the boundarymatch
is called without endpos. So all of the string endpos will be sys.maxsiz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The endpos
value is provided by _check_eofstack
, which is leveraged by one of the callers.
return None | ||
return boundaryendRE.match(line, len(separator)) | ||
return boundaryendRE.match(line, pos + len(separator), endpos) | ||
boundarymatch.is_boundary_match = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to combine an independent status var with a function. I prefer we need to use a class instead of it.
Co-authored-by: Bénédikt Tran <[email protected]>
GH-115512: email.message_from_bytes heavy memory use
Note: In the rest of this,
time taken
,peak overhead
, andoverhead ratio
refer to the similarly named variables in the below snippet:Changes:
text.decode
inparser.BytesParser.parsebytes
, by decoding one chunk of bytes at a time instead.StringIO
use inparser.Parser.parsestr
, slicing into the string instead. This also impactedparser.BytesParser.parsebytes
StringIO
use infeedparser.BufferedSubFile
, by reverting back to using a list, while still retaining the universal newline behavior exhibited byStringIO
BufferedSubFile
. For multi part messages, it dumps every chunk lacking any potential boundaries (i.e: no-
character), as well as every line lacking any boundary, up until the boundary we are looking for as indicated by_eofstack
. Without this change, the above changes would've introduced a noticeable runtime performance regression. With this change, runtime performance is significantly improved.Benchmarking
As a part of internal testing, I performed some benchmarking by directly measuring the time to parse ~907k email files using
message_from_bytes
. For each blob, a script calledemail.message_from_bytes
, measured the memory usage usingtracemalloc
as well as the time taken usingtime.perf_counter()
, and then did the same function call and measurements using a fork of the email library which at the time included only these changes. It then deep-compared the output of each, to validate that they're exactly equal.General information:
Without the changes, these were the stats of the Python 3.12.9 email parser:
Time stats:
With the changes, these were the stats of this email parser when using Python 3.12.9:
Time stats:
Focusing in on the totals, this represents: