-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
gh-129201: Use prefetch in GC mark alive phase. #129203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When traversing a list or tuple, use a "span" if the buffer can't hold all the items from the collection. This reduces the size of the object stack needed if large collections are encountered. It also helps keeps the buffer size optimal for prefetching.
fe9898a
to
7f51104
Compare
It's possible for lists or tuples to have a NULL item. Handle that in the case that all item elements fit into the buffer.
Benchmarks for an earlier version of the PR, without the "span" for lists/tuple. The benchmark results below come from my own workstation (AMD Ryzen 5 7600X, Linux, GCC 12.2.0). I'm using There is something bad happening with the default build GC is The "bm_gc_collect" benchmark was taken from pyperformance and the constants adjusted: The "prefetch (7f756eb0)" code branch is essentially the same as this PR (1b4e8c3). I just rebased it on the current main and removed some dead code.
|
Just saying this in passing, I recommend using the word "freethreaded" intead of "nogil". :) |
Need to clear the "alive" bit.
Merging with 'main' has pretty significantly impacted the gc_traversal benchmark, for the worse. I'll do some investigation on that. Something to do with 5ff2fbc would be my guess. |
Using the prefetch buffer only helps if there are enough objects. Use the long-lived count to decide if it's worth enabling. If not, fallback to the previous method of marking objects alive (dereference object pointers as we encounter them). Improve code by adding some additional helper functions, adding comments and general tidying. The buffer logic has been changed to use a mask for size rather than the % operator. Some other small optimizations that only help a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The use of spans is a nice touch :) I'd love to see this on a real world application with a larger heap than what's currently available in the benchmark suite. Might also be good to get another set of eyes who are familiar with the GC (Dino or Sam?) to take a look as well.
|
For the free-threaded version of the cyclic GC, restructure the "mark alive" phase to use software prefetch instructions. This gives a speedup in most cases when the number of objects is large enough. The prefetching is enabled conditionally based on the number of long-lived objects the GC finds.
For the free-threaded version of the cyclic GC, restructure the "mark alive" phase to use software prefetch instructions. This gives a speedup in most cases when the number of objects is large enough. The prefetching is enabled conditionally based on the number of long-lived objects the GC finds.
This PR implements prefetching as suggested in the issue. The benchmarks that are part of pyperformance generally don't show much difference since they don't use enough memory to make prefetch effective.
Source code for the two "big" benchmarks. These create a fairly large object graph and then call
gc.collect()
to time it.gc big tree
gc big
Updated benchmarks for commit abfc49a. The 5ff2fbc commit made previous versions of this PR look much worse on certain benchmarks. That's because that commit improves the "mark alive" pass so that it can find many more alive objects in some cases (fixes finding objects referred to only from stack).
I tried optimizing this PR so that it was at least as fast as current "main" when the number of objects was relatively small. However, I was not successful with that so I instead made it so that the prefetch approach is only used if the number of long-lived objects is over 200k. Based on tests on my Ryzen 5, that seems about the point were the prefetch approach is faster.
Run on a Macbook Pro M3, clang 19, PGO/LTO enabled. Both "base" and "prefetch" are configured with
./configure --enable-optimizations --disable-gil --with-lto
.Benchmarks below run on a Ryzen 5 7600X.
pyperformance results comparing to merge base.
Some notes about the implementation:
PREFETCH_T0
is slightly faster whereas on my AMD Ryzen 5 desktop thePREFETCH_T1
is a bit faster. The difference is pretty small. There are also prefetch instructions that indicate you are going to write to the word (second parameter in the__builtin_prefetch()
variation). If we are visiting a GC object for the first time, we will be writing to it to set the "alive" bit in the flags. However if it's not a GC object or we have visited it already, we don't write. So reads would seem quite a bit more common.__aarch64__
conditions of that block.BUFFER_SIZE
,BUFFER_HI
, andBUFFER_LO
have been moderately tuned based on benchmarking on my Ryzen 5 machine. I'm kind of surprised thatBUFFER_HI
is so small but that's what works best. There could be some additional tuning done on the logic of when to put objects on the stack vs into the buffer, when to push a new "span" vs adding part of it to the buffer. I did what I thought was logical but didn't benchmark all different approaches.gc_mark_traverse_list
andgc_mark_traverse_tuple
functions seems like a win. Even if the program doesn't use many long lists and tuples it doesn't cost much to have them._PyTuple_MaybeUntrack()
call is a bit expensive and it gets done on every collection. I considered adding a_PyGC_BITS_NEW
bit and set it on newly created tuples. Then we would only need to check a tuple once rather than once per collection. I tried disabling the tuple untracking but things got slower.tp_traverse
go through the mark buffer/stack. That seems wasteful given that a good fraction of them will be non-GC objects and some of them are known things like None or True. I tried adding a check for BSS allocated objects (using__bss_start
andend
) but that doesn't seem like a win. Maybe with some more refinement this would work. We could allocate a bunch of known non-GC objects into a certain region of memory and thengc_propagate_alive()
could just quickly skip them based on the pointer address.gc_propagate_alive()
to use a switch statement to use the correct traverse function. Something likeswitch (Py_TYPE(ob)->tp_gc_ordinal)
where thetp_gc_ordinal
value on known types is set on startup. That way, the switch statement can use an efficient jump table or similar. However, with only list and tuple specialized at this time, it doesn't seem worth it.