Skip to content

Conversation

@nascheme
Copy link
Member

@nascheme nascheme commented Jan 22, 2025

This PR implements prefetching as suggested in the issue. The benchmarks that are part of pyperformance generally don't show much difference since they don't use enough memory to make prefetch effective.

Source code for the two "big" benchmarks. These create a fairly large object graph and then call gc.collect() to time it.

gc big tree
gc big

Updated benchmarks for commit abfc49a. The 5ff2fbc commit made previous versions of this PR look much worse on certain benchmarks. That's because that commit improves the "mark alive" pass so that it can find many more alive objects in some cases (fixes finding objects referred to only from stack).

I tried optimizing this PR so that it was at least as fast as current "main" when the number of objects was relatively small. However, I was not successful with that so I instead made it so that the prefetch approach is only used if the number of long-lived objects is over 200k. Based on tests on my Ryzen 5, that seems about the point were the prefetch approach is faster.

Run on a Macbook Pro M3, clang 19, PGO/LTO enabled. Both "base" and "prefetch" are configured with ./configure --enable-optimizations --disable-gil --with-lto.

benchmark build prefetch time [ms] speedup
gc_big base - 787 1.00
gc_big prefetch on 651 1.21
gc_big prefetch off 787 1.00
gc_big_tree base - 334 1.00
gc_big_tree prefetch on 293 1.14
gc_big_tree prefetch off 318 1.05
bm_gc_collect base - 1200 1.00
bm_gc_collect prefetch on 1199 1.00
bm_gc_collect prefetch off 1191 1.01
bm_gc_traversal base - 619 1.00
bm_gc_traversal prefetch on 620 1.00
bm_gc_traversal prefetch off 619 1.00

Benchmarks below run on a Ryzen 5 7600X.

benchmark build prefetch time [ms] speedup
gc_big base - 634 1.00
gc_big prefetch on 452 1.40
gc_big prefetch off 1013 0.63
gc_big_tree base - 425 1.00
gc_big_tree prefetch on 346 1.23
gc_big_tree prefetch off 502 0.85

pyperformance results comparing to merge base.

Some notes about the implementation:

  • On my Macbook M3, it seems that using PREFETCH_T0 is slightly faster whereas on my AMD Ryzen 5 desktop the PREFETCH_T1 is a bit faster. The difference is pretty small. There are also prefetch instructions that indicate you are going to write to the word (second parameter in the __builtin_prefetch() variation). If we are visiting a GC object for the first time, we will be writing to it to set the "alive" bit in the flags. However if it's not a GC object or we have visited it already, we don't write. So reads would seem quite a bit more common.
  • cross-platform prefetch preprocessor logic was inspired by the zstd source code. I haven't tested the Windows or __aarch64__ conditions of that block.
  • The BUFFER_SIZE, BUFFER_HI, and BUFFER_LO have been moderately tuned based on benchmarking on my Ryzen 5 machine. I'm kind of surprised that BUFFER_HI is so small but that's what works best. There could be some additional tuning done on the logic of when to put objects on the stack vs into the buffer, when to push a new "span" vs adding part of it to the buffer. I did what I thought was logical but didn't benchmark all different approaches.
  • I run some tests that recorded a trace of the prefetch and access (following pointer) and then computed some statistics from the trace log. Based on this, it seems like the prefetch buffer is working fairly well (the median delay between prefetch and access was 14 on one benchmark run, ASCII histogram below).
Prefetch window stats
mean 52.07285263157895
median 14.0
max 256
   25.60 |65,304  | ******************************
   51.20 |5,590   | **
   76.80 |3,562   | *
  102.40 |2,683   | *
  128.00 |2,278   | *
  153.60 |2,285   | *
  179.20 |2,377   | *
  204.80 |2,238   | *
  230.40 |2,753   | *
  256.00 |5,930   | **
-------- |------- | -------
      N= |95,000 
  • Having the specialized gc_mark_traverse_list and gc_mark_traverse_tuple functions seems like a win. Even if the program doesn't use many long lists and tuples it doesn't cost much to have them.
  • The _PyTuple_MaybeUntrack() call is a bit expensive and it gets done on every collection. I considered adding a _PyGC_BITS_NEW bit and set it on newly created tuples. Then we would only need to check a tuple once rather than once per collection. I tried disabling the tuple untracking but things got slower.
  • As currently implemented, all the object pointers revealed by tp_traverse go through the mark buffer/stack. That seems wasteful given that a good fraction of them will be non-GC objects and some of them are known things like None or True. I tried adding a check for BSS allocated objects (using __bss_start and end) but that doesn't seem like a win. Maybe with some more refinement this would work. We could allocate a bunch of known non-GC objects into a certain region of memory and then gc_propagate_alive() could just quickly skip them based on the pointer address.
  • I considered adding a new number field to the type object, to allow gc_propagate_alive() to use a switch statement to use the correct traverse function. Something like switch (Py_TYPE(ob)->tp_gc_ordinal) where the tp_gc_ordinal value on known types is set on startup. That way, the switch statement can use an efficient jump table or similar. However, with only list and tuple specialized at this time, it doesn't seem worth it.

@nascheme nascheme added type-feature A feature request or enhancement performance Performance or resource usage topic-free-threading labels Jan 22, 2025
When traversing a list or tuple, use a "span" if the buffer can't hold
all the items from the collection.  This reduces the size of the object
stack needed if large collections are encountered.  It also helps
keeps the buffer size optimal for prefetching.
@nascheme nascheme force-pushed the gh-129201-gc-mark-prefetch branch from fe9898a to 7f51104 Compare January 24, 2025 07:49
It's possible for lists or tuples to have a NULL item.  Handle that
in the case that all item elements fit into the buffer.
@nascheme
Copy link
Member Author

Benchmarks for an earlier version of the PR, without the "span" for lists/tuple.

The benchmark results below come from my own workstation (AMD Ryzen 5 7600X, Linux, GCC 12.2.0). I'm using --enable-optimizations but LTO is turned off. The "prefetch off" means the prefetch CPU instruction is omitted, otherwise code is the same.

There is something bad happening with the default build GC is gc.disable() is not used when creating a big data structure. When enabled, the "gc big" benchmark takes 5x as long. My guess is that the generation 0 collections are shuffling the objects in the gc next/prev linked list but that's only a guess.

The "bm_gc_collect" benchmark was taken from pyperformance and the constants adjusted: CYCLES = 100_000 and LINKS = 40.

The "prefetch (7f756eb0)" code branch is essentially the same as this PR (1b4e8c3). I just rebased it on the current main and removed some dead code.

code branch build prefetch on benchmark time [ms]
benchmark: bm_async_tree_io_tg
prefetch (7f756eb0) nogil yes bm_async_tree_io_tg 6,752
prefetch (7f756eb0) nogil no bm_async_tree_io_tg 6,964
main (3829104) nogil - bm_async_tree_io_tg 6,960
main (3829104) default - bm_async_tree_io_tg 7,323
benchmark: bm_gc_collect
prefetch (7f756eb0) nogil yes bm_gc_collect 1,418
prefetch (7f756eb0) nogil no bm_gc_collect 1,517
main (3829104) nogil - bm_gc_collect 1,473
main (3829104) default - bm_gc_collect 20,966
benchmark: gc big
prefetch (7f756eb0) nogil yes gc big 587
prefetch (7f756eb0) nogil no gc big 1,064
main (3829104) nogil - gc big 963
main (3829104) default - gc big 5,785
benchmark: gc big tree
prefetch (7f756eb0) nogil yes gc big tree 363
prefetch (7f756eb0) nogil no gc big tree 519
main (3829104) nogil - gc big tree 723
main (3829104) default - gc big tree 3,443
benchmark: gc big (gc disabled)
main (3829104) default - gc big (gc disabled) 641

@nascheme nascheme marked this pull request as ready for review January 27, 2025 20:49
@corona10
Copy link
Member

corona10 commented Jan 29, 2025

Just saying this in passing, I recommend using the word "freethreaded" intead of "nogil". :)

@nascheme nascheme requested a review from mpage January 30, 2025 21:12
Need to clear the "alive" bit.
@nascheme nascheme marked this pull request as draft January 31, 2025 08:58
@nascheme
Copy link
Member Author

Merging with 'main' has pretty significantly impacted the gc_traversal benchmark, for the worse. I'll do some investigation on that. Something to do with 5ff2fbc would be my guess.

Using the prefetch buffer only helps if there are enough objects.  Use
the long-lived count to decide if it's worth enabling.  If not, fallback
to the previous method of marking objects alive (dereference object
pointers as we encounter them).

Improve code by adding some additional helper functions, adding comments
and general tidying.  The buffer logic has been changed to use a mask
for size rather than the % operator.  Some other small optimizations
that only help a little.
@nascheme nascheme marked this pull request as ready for review February 3, 2025 21:18
Copy link
Contributor

@mpage mpage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The use of spans is a nice touch :) I'd love to see this on a real world application with a larger heap than what's currently available in the benchmark suite. Might also be good to get another set of eyes who are familiar with the GC (Dino or Sam?) to take a look as well.

@nascheme nascheme merged commit cdcacec into python:main Feb 5, 2025
42 checks passed
@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot aarch64 RHEL8 LTO 3.x has failed when building commit cdcacec.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/338/builds/7997) and take a look at the build logs.
  4. Check if the failure is related to this commit (cdcacec) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/338/builds/7997

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-aarch64.lto/build/Lib/threading.py", line 1054, in _bootstrap_inner
    self.run()
    ~~~~~~~~^^
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-aarch64.lto/build/Lib/threading.py", line 996, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-aarch64.lto/build/Lib/test/test_interpreters/test_stress.py", line 47, in run
    interp = interpreters.create()
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-aarch64.lto/build/Lib/test/support/interpreters/__init__.py", line 76, in create
    id = _interpreters.create(reqrefs=True)
interpreters.InterpreterError: interpreter creation failed
k

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this pull request Feb 7, 2025
For the free-threaded version of the cyclic GC, restructure the "mark alive" phase to use software prefetch instructions.  This gives a speedup in most cases when the number of objects is large enough.  The prefetching is enabled conditionally based on the number of long-lived objects the GC finds.
cmaloney pushed a commit to cmaloney/cpython that referenced this pull request Feb 8, 2025
For the free-threaded version of the cyclic GC, restructure the "mark alive" phase to use software prefetch instructions.  This gives a speedup in most cases when the number of objects is large enough.  The prefetching is enabled conditionally based on the number of long-lived objects the GC finds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance or resource usage topic-free-threading type-feature A feature request or enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants