Skip to content

Conversation

@eendebakpt
Copy link
Contributor

@eendebakpt eendebakpt commented Mar 7, 2024

Benchmark

range(1): Mean +- std dev: [main_pgo] 44.4 ns +- 1.0 ns -> [pr_pgo] 34.7 ns +- 4.0 ns: 1.28x faster
iter(range(1)): Mean +- std dev: [main_pgo] 78.7 ns +- 1.8 ns -> [pr_pgo] 61.7 ns +- 3.8 ns: 1.28x faster
list(iter(range(1))): Mean +- std dev: [main_pgo] 219 ns +- 2 ns -> [pr_pgo] 199 ns +- 2 ns: 1.10x faster
range(2, 1): Mean +- std dev: [main_pgo] 51.8 ns +- 1.3 ns -> [pr_pgo] 50.1 ns +- 2.1 ns: 1.03x faster
iter(range(2, 1)): Mean +- std dev: [main_pgo] 84.1 ns +- 1.2 ns -> [pr_pgo] 86.2 ns +- 3.7 ns: 1.02x slower
range(10): Mean +- std dev: [main_pgo] 44.0 ns +- 0.6 ns -> [pr_pgo] 34.5 ns +- 1.6 ns: 1.28x faster
iter(range(10)): Mean +- std dev: [main_pgo] 78.5 ns +- 2.0 ns -> [pr_pgo] 61.6 ns +- 3.4 ns: 1.27x faster
list(iter(range(10))): Mean +- std dev: [main_pgo] 267 ns +- 2 ns -> [pr_pgo] 244 ns +- 5 ns: 1.09x faster
range(2, 10): Mean +- std dev: [main_pgo] 52.3 ns +- 3.7 ns -> [pr_pgo] 57.5 ns +- 1.5 ns: 1.10x slower
iter(range(2, 10)): Mean +- std dev: [main_pgo] 82.5 ns +- 3.0 ns -> [pr_pgo] 100 ns +- 1 ns: 1.21x slower
list(iter(range(2, 10))): Mean +- std dev: [main_pgo] 267 ns +- 4 ns -> [pr_pgo] 277 ns +- 2 ns: 1.04x slower
range(100): Mean +- std dev: [main_pgo] 44.1 ns +- 1.0 ns -> [pr_pgo] 34.1 ns +- 1.9 ns: 1.30x faster
iter(range(100)): Mean +- std dev: [main_pgo] 78.5 ns +- 1.8 ns -> [pr_pgo] 60.6 ns +- 2.2 ns: 1.29x faster
list(iter(range(100))): Mean +- std dev: [main_pgo] 707 ns +- 29 ns -> [pr_pgo] 681 ns +- 26 ns: 1.04x faster
range(2, 100): Mean +- std dev: [main_pgo] 52.6 ns +- 2.3 ns -> [pr_pgo] 57.3 ns +- 1.2 ns: 1.09x slower
iter(range(2, 100)): Mean +- std dev: [main_pgo] 83.3 ns +- 2.6 ns -> [pr_pgo] 101 ns +- 2 ns: 1.21x slower
list(iter(range(2, 100))): Mean +- std dev: [main_pgo] 705 ns +- 23 ns -> [pr_pgo] 718 ns +- 28 ns: 1.02x slower
range(400): Mean +- std dev: [main_pgo] 58.4 ns +- 0.8 ns -> [pr_pgo] 36.1 ns +- 1.1 ns: 1.62x faster
iter(range(400)): Mean +- std dev: [main_pgo] 93.9 ns +- 1.6 ns -> [pr_pgo] 62.3 ns +- 1.6 ns: 1.51x faster
list(iter(range(400))): Mean +- std dev: [main_pgo] 3.92 us +- 0.06 us -> [pr_pgo] 3.87 us +- 0.05 us: 1.01x faster
range(2, 400): Mean +- std dev: [main_pgo] 66.1 ns +- 1.1 ns -> [pr_pgo] 71.2 ns +- 1.0 ns: 1.08x slower
iter(range(2, 400)): Mean +- std dev: [main_pgo] 103 ns +- 6 ns -> [pr_pgo] 115 ns +- 1 ns: 1.11x slower
for loop: Mean +- std dev: [main_pgo] 295 ns +- 9 ns -> [pr_pgo] 267 ns +- 5 ns: 1.10x faster

Benchmark hidden because not significant (2): list(iter(range(2, 1))), list(iter(range(2, 400)))

Geometric mean: 1.08x faster

The fast paths eliminate several calls to PyLong_AsLongAndOverflow and the construction of a new Python integer with PyLong_FromLong.

The fast path for the range object is in compute_range_length. That implies a small performance penalty (two pointer comparisons) for the cases where start != 0 or step != 1. We could move the fast path to range_from_array (the case 1: part), but that leads to a bit more duplicated code.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM. But please remove redundant spaces.

lstop = PyLong_AsLong(r->stop);
if (lstop == -1 && PyErr_Occurred()) {

if (r->start == _PyLong_GetZero() && r->step == _PyLong_GetOne() ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large is overhead of two PyLong_AsLong() calls for 0 and 1? Is it noticeable?

Run something like:

./python -m timeit -s 'r = range(10)' 'for i in r: break'

AFAIK it is the fastest code that is affected by the iterator creation time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overhead for the two PyLong_AsLong calls (plus some bits and pieces) are from the first benchmark above:

range(1): Mean +- std dev: [main_pgo] 44.4 ns +- 1.0 ns -> [pr_pgo] 34.7 ns +- 4.0 ns: 1.28x faster

So about 10 ns.

The last benchmark from the first comment:

for loop: Mean +- std dev: [main_pgo] 295 ns +- 9 ns -> [pr_pgo] 267 ns +- 5 ns: 1.10x faster

executes:

def g():
    x=0
    for ii in range(10):
        x += 1

So a small for loop with minimal work becomes 10% faster. (timings may vary, my system is noisy)

Co-authored-by: Serhiy Storchaka <[email protected]>
Copy link
Contributor

@erlend-aasland erlend-aasland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'd like Serhiy's thumbs up before landing.

@erlend-aasland erlend-aasland changed the title gh-116477: Improve performance of range gh-116477: Improve performance of range for single arguments Apr 4, 2024
@erlend-aasland erlend-aasland changed the title gh-116477: Improve performance of range for single arguments gh-116477: Improve performance of range for the single argument case Apr 4, 2024
Copy link
Contributor

@erlend-aasland erlend-aasland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this later at the sprints. Thanks!

@erlend-aasland erlend-aasland self-assigned this May 20, 2024
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced that this change is worth to do, but I will not object against committing it if other core developer likes it.

@erlend-aasland
Copy link
Contributor

@eendebakpt, are the benchmarks up to date?

@eendebakpt
Copy link
Contributor Author

Here is a set of benchmarks for the PR rebased to main:

range(1): Mean +- std dev: [main_v2] 49.9 ns +- 1.1 ns -> [pr_v2] 39.7 ns +- 1.1 ns: 1.26x faster
iter(range(1)): Mean +- std dev: [main_v2] 82.1 ns +- 0.6 ns -> [pr_v2] 68.4 ns +- 1.7 ns: 1.20x faster
list(range(1)): Mean +- std dev: [main_v2] 151 ns +- 1 ns -> [pr_v2] 134 ns +- 4 ns: 1.13x faster
list(iter(range(1))): Mean +- std dev: [main_v2] 220 ns +- 1 ns -> [pr_v2] 204 ns +- 1 ns: 1.08x faster
range(2, 1): Mean +- std dev: [main_v2] 56.8 ns +- 0.4 ns -> [pr_v2] 54.5 ns +- 1.1 ns: 1.04x faster
iter(range(2, 1)): Mean +- std dev: [main_v2] 90.1 ns +- 1.7 ns -> [pr_v2] 93.0 ns +- 3.5 ns: 1.03x slower
list(iter(range(2, 1))): Mean +- std dev: [main_v2] 213 ns +- 6 ns -> [pr_v2] 218 ns +- 3 ns: 1.02x slower
for loop length 1: Mean +- std dev: [main_v2] 140 ns +- 7 ns -> [pr_v2] 118 ns +- 3 ns: 1.19x faster
range(10): Mean +- std dev: [main_v2] 49.8 ns +- 1.4 ns -> [pr_v2] 39.7 ns +- 1.2 ns: 1.25x faster
iter(range(10)): Mean +- std dev: [main_v2] 82.2 ns +- 0.8 ns -> [pr_v2] 69.3 ns +- 3.3 ns: 1.19x faster
list(range(10)): Mean +- std dev: [main_v2] 192 ns +- 4 ns -> [pr_v2] 176 ns +- 4 ns: 1.09x faster
list(iter(range(10))): Mean +- std dev: [main_v2] 262 ns +- 6 ns -> [pr_v2] 248 ns +- 6 ns: 1.05x faster
range(2, 10): Mean +- std dev: [main_v2] 57.5 ns +- 0.8 ns -> [pr_v2] 62.0 ns +- 1.8 ns: 1.08x slower
iter(range(2, 10)): Mean +- std dev: [main_v2] 90.1 ns +- 2.2 ns -> [pr_v2] 106 ns +- 2 ns: 1.18x slower
list(iter(range(2, 10))): Mean +- std dev: [main_v2] 264 ns +- 12 ns -> [pr_v2] 282 ns +- 11 ns: 1.07x slower
for loop length 10: Mean +- std dev: [main_v2] 313 ns +- 18 ns -> [pr_v2] 292 ns +- 15 ns: 1.07x faster
range(100): Mean +- std dev: [main_v2] 50.1 ns +- 2.7 ns -> [pr_v2] 40.6 ns +- 2.7 ns: 1.23x faster
iter(range(100)): Mean +- std dev: [main_v2] 82.7 ns +- 2.5 ns -> [pr_v2] 69.7 ns +- 4.2 ns: 1.19x faster
list(range(100)): Mean +- std dev: [main_v2] 599 ns +- 26 ns -> [pr_v2] 581 ns +- 10 ns: 1.03x faster
list(iter(range(100))): Mean +- std dev: [main_v2] 682 ns +- 45 ns -> [pr_v2] 663 ns +- 38 ns: 1.03x faster
range(2, 100): Mean +- std dev: [main_v2] 57.4 ns +- 0.6 ns -> [pr_v2] 62.3 ns +- 1.3 ns: 1.08x slower
iter(range(2, 100)): Mean +- std dev: [main_v2] 90.3 ns +- 3.9 ns -> [pr_v2] 107 ns +- 5 ns: 1.19x slower
list(iter(range(2, 100))): Mean +- std dev: [main_v2] 689 ns +- 41 ns -> [pr_v2] 681 ns +- 16 ns: 1.01x faster
for loop length 100: Mean +- std dev: [main_v2] 1.97 us +- 0.12 us -> [pr_v2] 1.90 us +- 0.04 us: 1.04x faster
range(400): Mean +- std dev: [main_v2] 63.2 ns +- 3.9 ns -> [pr_v2] 43.5 ns +- 3.3 ns: 1.45x faster
iter(range(400)): Mean +- std dev: [main_v2] 100 ns +- 7 ns -> [pr_v2] 73.1 ns +- 4.3 ns: 1.37x faster
range(2, 400): Mean +- std dev: [main_v2] 71.6 ns +- 3.1 ns -> [pr_v2] 76.5 ns +- 3.6 ns: 1.07x slower
iter(range(2, 400)): Mean +- std dev: [main_v2] 107 ns +- 5 ns -> [pr_v2] 123 ns +- 4 ns: 1.15x slower
list(iter(range(2, 400))): Mean +- std dev: [main_v2] 3.76 us +- 0.12 us -> [pr_v2] 3.72 us +- 0.08 us: 1.01x faster

Benchmark hidden because not significant (3): list(range(400)), list(iter(range(400))), for loop length 400

Geometric mean: 1.06x faster
Benchmark script

(compiled with PGO, executed with --rigorous)

import pyperf
runner = pyperf.Runner()

loop = """
def g(n):
    x=0
    for ii in range(n):
        x += 1
"""
        
for s in [1, 10, 100, 400]:
	time = runner.timeit(name=f'range({s})', stmt=f"range({s})")
	time = runner.timeit(name=f'iter(range({s}))', stmt=f"iter(range({s}))")
	time = runner.timeit(name=f'list(range({s}))', stmt=f"list(range({s}))")
	time = runner.timeit(name=f'list(iter(range({s})))', stmt=f"list(iter(range({s})))")

	time = runner.timeit(name=f'range(2, {s})', stmt=f"range(2, {s})")
	time = runner.timeit(name=f'iter(range(2, {s}))', stmt=f"iter(range(2, {s}))")
	time = runner.timeit(name=f'list(iter(range(2, {s})))', stmt=f"list(iter(range(2, {s})))")

	time = runner.timeit(name=f'for loop length {s}', stmt=f"g({s})", setup=loop)

There is some noise in the data, but there are enough datapoints to make the trends clear. Most relevant for real applications are the for loop length * benchmarks.

@erlend-aasland
Copy link
Contributor

So, if we exclude the iter optimisation, there will be no slowdowns (unlesss I misread the benchmark)?

@eendebakpt
Copy link
Contributor Author

So, if we exclude the iter optimisation, there will be no slowdowns (unlesss I misread the benchmark)?

Both optimizations use the same check (two pointer comparisons to _PyLong_GetZero() and _PyLong_GetOne()), so I initially expected both to have the same (small) slowdown in the in slow path. But the test results above indeed suggest otherwise (although part if it is noise). I did more tests both with this PR and one with the iter optimalisation excluded and I think we can drop the iter optimization. The gain for the iter optimization is less than for the constructor and also in the range constructor PyLong_GetZero() is already available, whereas in the iter constructor it has to be called.

Two runs of benchmarks of main vs. the updated PR

range(10): Mean +- std dev: [main] 51.4 ns +- 2.8 ns -> [v3] 37.4 ns +- 1.7 ns: 1.38x faster
iter(range(10)): Mean +- std dev: [main] 83.8 ns +- 2.9 ns -> [v3] 70.3 ns +- 2.6 ns: 1.19x faster
list(range(10)): Mean +- std dev: [main] 154 ns +- 6 ns -> [v3] 128 ns +- 6 ns: 1.21x faster
list(iter(range(10))): Mean +- std dev: [main] 204 ns +- 12 ns -> [v3] 186 ns +- 7 ns: 1.10x faster
list(iter(range(2, 10))): Mean +- std dev: [main] 206 ns +- 10 ns -> [v3] 198 ns +- 8 ns: 1.04x faster
for loop length 10: Mean +- std dev: [main] 304 ns +- 14 ns -> [v3] 284 ns +- 14 ns: 1.07x faster
for loop range(2, 10): Mean +- std dev: [main] 279 ns +- 9 ns -> [v3] 274 ns +- 11 ns: 1.02x faster
range(400): Mean +- std dev: [main] 59.5 ns +- 2.6 ns -> [v3] 37.8 ns +- 1.5 ns: 1.58x faster
iter(range(400)): Mean +- std dev: [main] 96.0 ns +- 4.0 ns -> [v3] 70.5 ns +- 3.3 ns: 1.36x faster
list(range(400)): Mean +- std dev: [main] 2.37 us +- 0.10 us -> [v3] 2.20 us +- 0.09 us: 1.08x faster
list(iter(range(400))): Mean +- std dev: [main] 2.48 us +- 0.10 us -> [v3] 2.32 us +- 0.12 us: 1.07x faster
iter(range(2, 400)): Mean +- std dev: [main] 105 ns +- 4 ns -> [v3] 103 ns +- 4 ns: 1.02x faster
list(iter(range(2, 400))): Mean +- std dev: [main] 2.51 us +- 0.14 us -> [v3] 2.38 us +- 0.12 us: 1.06x faster

Benchmark hidden because not significant (5): range(2, 10), iter(range(2, 10)), range(2, 400), for loop length 400, for loop range(2, 400)

Geometric mean: 1.11x faster
range(10): Mean +- std dev: [maint] 47.5 ns +- 2.1 ns -> [v3tt] 37.3 ns +- 1.8 ns: 1.27x faster
iter(range(10)): Mean +- std dev: [maint] 82.2 ns +- 3.5 ns -> [v3tt] 70.6 ns +- 2.7 ns: 1.16x faster
list(range(10)): Mean +- std dev: [maint] 152 ns +- 5 ns -> [v3tt] 131 ns +- 5 ns: 1.16x faster
list(iter(range(10))): Mean +- std dev: [maint] 200 ns +- 8 ns -> [v3tt] 186 ns +- 6 ns: 1.07x faster
range(2, 10): Mean +- std dev: [maint] 59.1 ns +- 2.6 ns -> [v3tt] 58.4 ns +- 2.8 ns: 1.01x faster
iter(range(2, 10)): Mean +- std dev: [maint] 92.8 ns +- 3.2 ns -> [v3tt] 93.8 ns +- 3.2 ns: 1.01x slower
list(iter(range(2, 10))): Mean +- std dev: [maint] 206 ns +- 8 ns -> [v3tt] 202 ns +- 10 ns: 1.02x faster
for loop length 10: Mean +- std dev: [maint] 311 ns +- 17 ns -> [v3tt] 288 ns +- 10 ns: 1.08x faster
for loop range(2, 10): Mean +- std dev: [maint] 285 ns +- 13 ns -> [v3tt] 278 ns +- 12 ns: 1.02x faster
range(400): Mean +- std dev: [maint] 81.6 ns +- 23.9 ns -> [v3tt] 38.3 ns +- 1.8 ns: 2.13x faster
iter(range(400)): Mean +- std dev: [maint] 115 ns +- 20 ns -> [v3tt] 72.5 ns +- 3.0 ns: 1.58x faster
list(range(400)): Mean +- std dev: [maint] 2.52 us +- 0.14 us -> [v3tt] 2.34 us +- 0.11 us: 1.08x faster
list(iter(range(400))): Mean +- std dev: [maint] 2.46 us +- 0.13 us -> [v3tt] 2.43 us +- 0.13 us: 1.01x faster
range(2, 400): Mean +- std dev: [maint] 70.6 ns +- 2.9 ns -> [v3tt] 73.9 ns +- 3.5 ns: 1.05x slower
list(iter(range(2, 400))): Mean +- std dev: [maint] 2.59 us +- 0.23 us -> [v3tt] 2.51 us +- 0.14 us: 1.03x faster
for loop range(2, 400): Mean +- std dev: [maint] 9.91 us +- 0.55 us -> [v3tt] 11.1 us +- 1.5 us: 1.12x slower

Benchmark hidden because not significant (2): iter(range(2, 400)), for loop length 400

Geometric mean: 1.11x faster

Copy link
Contributor

@erlend-aasland erlend-aasland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @serhiy-storchaka, any objections?

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please repeat benchmarks with the current code.

@eendebakpt
Copy link
Contributor Author

eendebakpt commented Feb 6, 2025

New benchmarks (Ubuntu, with the new tail calling interpreter, non-PGO):

range(1): Mean +- std dev: [main] 52.0 ns +- 1.1 ns -> [pr] 38.8 ns +- 0.9 ns: 1.34x faster
iter(range(1)): Mean +- std dev: [main] 99.4 ns +- 2.6 ns -> [pr] 88.2 ns +- 2.9 ns: 1.13x faster
list(range(1)): Mean +- std dev: [main] 177 ns +- 2 ns -> [pr] 160 ns +- 2 ns: 1.11x faster
iter(range(2, 1)): Mean +- std dev: [main] 101 ns +- 1 ns -> [pr] 103 ns +- 2 ns: 1.03x slower
for loop length 1: Mean +- std dev: [main] 155 ns +- 8 ns -> [pr] 149 ns +- 5 ns: 1.04x faster
range(10): Mean +- std dev: [main] 52.2 ns +- 1.2 ns -> [pr] 41.2 ns +- 2.8 ns: 1.27x faster
iter(range(10)): Mean +- std dev: [main] 99.5 ns +- 2.5 ns -> [pr] 86.9 ns +- 1.5 ns: 1.15x faster
list(range(10)): Mean +- std dev: [main] 226 ns +- 12 ns -> [pr] 201 ns +- 3 ns: 1.13x faster
iter(range(2, 10)): Mean +- std dev: [main] 113 ns +- 4 ns -> [pr] 110 ns +- 2 ns: 1.03x faster
for loop length 10: Mean +- std dev: [main] 332 ns +- 11 ns -> [pr] 326 ns +- 15 ns: 1.02x faster
range(100): Mean +- std dev: [main] 54.0 ns +- 2.2 ns -> [pr] 39.6 ns +- 1.4 ns: 1.36x faster
iter(range(100)): Mean +- std dev: [main] 98.0 ns +- 1.0 ns -> [pr] 88.4 ns +- 2.6 ns: 1.11x faster
for loop length 100: Mean +- std dev: [main] 2.09 us +- 0.11 us -> [pr] 1.99 us +- 0.11 us: 1.05x faster
range(400): Mean +- std dev: [main] 65.9 ns +- 2.4 ns -> [pr] 40.9 ns +- 1.2 ns: 1.61x faster
iter(range(400)): Mean +- std dev: [main] 118 ns +- 10 ns -> [pr] 92.5 ns +- 5.2 ns: 1.28x faster
list(range(400)): Mean +- std dev: [main] 3.05 us +- 0.13 us -> [pr] 3.19 us +- 0.31 us: 1.05x slower
range(2, 400): Mean +- std dev: [main] 73.9 ns +- 7.7 ns -> [pr] 71.2 ns +- 2.0 ns: 1.04x faster
iter(range(2, 400)): Mean +- std dev: [main] 132 ns +- 7 ns -> [pr] 124 ns +- 3 ns: 1.07x faster
for loop length 400: Mean +- std dev: [main] 9.37 us +- 0.34 us -> [pr] 9.54 us +- 0.25 us: 1.02x slower

Benchmark hidden because not significant (5): range(2, 1), range(2, 10), list(range(100)), range(2, 100), iter(range(2, 100))

Geometric mean: 1.10x faster

@erlend-aasland
Copy link
Contributor

Pieter, can you fix the linting CI? Let's land this!

@erlend-aasland
Copy link
Contributor

I am not convinced that this change is worth to do, but I will not object against committing it if other core developer likes it.

I understand your concern. For example, we did not optimise zip creation using vector calls, since normally the loop itself would eat up the small performance gain. However, contrary to vectorcall optimisations, this is a trivial modification. It may be worth it for smaller loops. Benchmarks for such cases could be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants