Skip to content

Conversation

@mdickinson
Copy link
Member

@mdickinson mdickinson commented Jan 2, 2023

Addresses #100687; see the issue for discussion and rationale.

This PR does some extra work to get a tighter upper bound on the number of digits needed of addition of two like-signed multidigit ints (or subtraction of two oppositely-signed multidigit ints).

Here's a benchmark script, tracking both memory allocations and performance in the case that's most significant, where both of the addends have exactly two digits (the case of single digit ints takes a fast path and doesn't use x_add)

import random
import timeit
import tracemalloc

def sum_pairs(pairs):
    for x, y in pairs:
        x + y

int_samples = 10**5
timeit_samples = 1000
seed = 12345

# Create test pairs, keeping track of memory allocations
tracemalloc.start()
snapshot1 = tracemalloc.take_snapshot()
random.seed(seed)
test_pairs = [
    (random.randrange(2**30, 2**60), random.randrange(2**30, 2**60))
    for _ in range(int_samples)
]
snapshot2 = tracemalloc.take_snapshot()
stats = snapshot2.compare_to(snapshot1, "lineno")
tracemalloc.stop()

# Get iteration time
time = timeit.timeit(stmt="sum_pairs(test_pairs)", globals=globals(), number=timeit_samples) / (timeit_samples * int_samples)

# Report
for stat in stats:
    print(stat)
print(f"Time per iteration (ns): {time * 1e9:.1f}")

Some sample results on my macOS / Intel machine. (cpython is main, cpython-modified is this branch; both were built with --enable-optimizations). Surprisingly, I actually get a minor speed increase where I was expecting a decrease.

mdickinson@lovelace python % ./cpython/python.exe time_int_sum.py         
/Users/mdickinson/Repositories/python/cpython/Lib/random.py:314: size=7031 KiB (+7031 KiB), count=200000 (+200000), average=36 B
/Users/mdickinson/Repositories/python/time_int_sum.py:18: size=6250 KiB (+6250 KiB), count=99984 (+99984), average=64 B
/Users/mdickinson/Repositories/python/cpython/Lib/tracemalloc.py:560: size=312 B (+312 B), count=2 (+2), average=156 B
/Users/mdickinson/Repositories/python/cpython/Lib/tracemalloc.py:423: size=312 B (+312 B), count=2 (+2), average=156 B
/Users/mdickinson/Repositories/python/cpython/Lib/tracemalloc.py:558: size=56 B (+56 B), count=1 (+1), average=56 B
Time per iteration (ns): 39.7
mdickinson@lovelace python % ./cpython-modified/python.exe time_int_sum.py
/Users/mdickinson/Repositories/python/cpython-modified/Lib/random.py:314: size=6250 KiB (+6250 KiB), count=200000 (+200000), average=32 B
/Users/mdickinson/Repositories/python/time_int_sum.py:18: size=6250 KiB (+6250 KiB), count=99984 (+99984), average=64 B
/Users/mdickinson/Repositories/python/cpython-modified/Lib/tracemalloc.py:560: size=312 B (+312 B), count=2 (+2), average=156 B
/Users/mdickinson/Repositories/python/cpython-modified/Lib/tracemalloc.py:423: size=312 B (+312 B), count=2 (+2), average=156 B
/Users/mdickinson/Repositories/python/cpython-modified/Lib/tracemalloc.py:558: size=56 B (+56 B), count=1 (+1), average=56 B
Time per iteration (ns): 37.5

To do:

  • More benchmarking, particularly the case of two digits + single digit; we should also check subtractions
  • Verify that there unit tests that exercise the case where we still overallocate

@eendebakpt
Copy link
Contributor

I tested the PR with the pidigits benchmark (from the pyperformance package). Performance impact is neutral:

Mean +- std dev: [main] 209 ms +- 1 ms -> [pr] 209 ms +- 1 ms: 1.00x slower
Not significant!

(some profiling with valgrind shows that only 11% is spend in the x_add method, and within x_add most time is spend in the for loops performing the actual additions)

In the pidigits calculation there are roughly 156200 calls to x_add for which only 760 times the extra_digit is set (and zero times a false positive).

@mdickinson mdickinson marked this pull request as draft October 25, 2023 07:59
@mdickinson
Copy link
Member Author

This needs a couple more changes thanks to the recent changes in long format. Converting to draft while I investigate.

@mdickinson
Copy link
Member Author

Converting to draft while I investigate.

Should be good to go now. I've removed the special-case handling of normalization: it seems safer to call long_normalize just like everything else does, in case normalization rules change in the future.

@mdickinson mdickinson marked this pull request as ready for review October 25, 2023 09:06
@mdickinson
Copy link
Member Author

mdickinson commented Oct 25, 2023

@eendebakpt Thank you for looking into this, and for the suggestions and test! I fear that my changes have invalidated your benchmarks, but I suspect the take-away that performance isn't significantly affected either way remains valid.

@eendebakpt
Copy link
Contributor

@eendebakpt Thank you for looking into this, and for the suggestions and test! I fear that my changes have invalidated your benchmarks, but I suspect the take-away that performance isn't significantly affected either way remains valid.

The pidigits benchmark results are indeed unaffected. The computation time is dominated by the actual addition (size_a is large for the additions in pidigits). The gain of this PR is in the slightly improved memory allocation for small size_a >= 2.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants