-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
gh-116477: Improve performance of range for the single argument case #116478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
f3e6ab9 to
bf88016
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general LGTM. But please remove redundant spaces.
Misc/NEWS.d/next/Core and Builtins/2024-03-07-22-38-07.gh-issue-116477.AqwRAV.rst
Outdated
Show resolved
Hide resolved
…e-116477.AqwRAV.rst Co-authored-by: Serhiy Storchaka <[email protected]>
Objects/rangeobject.c
Outdated
| lstop = PyLong_AsLong(r->stop); | ||
| if (lstop == -1 && PyErr_Occurred()) { | ||
|
|
||
| if (r->start == _PyLong_GetZero() && r->step == _PyLong_GetOne() ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How large is overhead of two PyLong_AsLong() calls for 0 and 1? Is it noticeable?
Run something like:
./python -m timeit -s 'r = range(10)' 'for i in r: break'
AFAIK it is the fastest code that is affected by the iterator creation time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overhead for the two PyLong_AsLong calls (plus some bits and pieces) are from the first benchmark above:
range(1): Mean +- std dev: [main_pgo] 44.4 ns +- 1.0 ns -> [pr_pgo] 34.7 ns +- 4.0 ns: 1.28x faster
So about 10 ns.
The last benchmark from the first comment:
for loop: Mean +- std dev: [main_pgo] 295 ns +- 9 ns -> [pr_pgo] 267 ns +- 5 ns: 1.10x faster
executes:
def g():
x=0
for ii in range(10):
x += 1
So a small for loop with minimal work becomes 10% faster. (timings may vary, my system is noisy)
Co-authored-by: Serhiy Storchaka <[email protected]>
Co-authored-by: Erlend E. Aasland <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but I'd like Serhiy's thumbs up before landing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll merge this later at the sprints. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced that this change is worth to do, but I will not object against committing it if other core developer likes it.
Co-authored-by: Serhiy Storchaka <[email protected]>
|
@eendebakpt, are the benchmarks up to date? |
|
Here is a set of benchmarks for the PR rebased to main: Benchmark script(compiled with PGO, executed with There is some noise in the data, but there are enough datapoints to make the trends clear. Most relevant for real applications are the |
|
So, if we exclude the iter optimisation, there will be no slowdowns (unlesss I misread the benchmark)? |
Both optimizations use the same check (two pointer comparisons to Two runs of benchmarks of main vs. the updated PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @serhiy-storchaka, any objections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please repeat benchmarks with the current code.
|
New benchmarks (Ubuntu, with the new tail calling interpreter, non-PGO): |
|
Pieter, can you fix the linting CI? Let's land this! |
I understand your concern. For example, we did not optimise |
Benchmark
The fast paths eliminate several calls to
PyLong_AsLongAndOverflowand the construction of a new Python integer withPyLong_FromLong.The fast path for the range object is in
compute_range_length. That implies a small performance penalty (two pointer comparisons) for the cases wherestart != 0orstep != 1. We could move the fast path torange_from_array(thecase 1:part), but that leads to a bit more duplicated code.