Skip to content

Commit 31a500a

Browse files
authored
Add internal docs about the free-threaded GC. (gh-132562)
1 parent e414a2d commit 31a500a

File tree

1 file changed

+181
-9
lines changed

1 file changed

+181
-9
lines changed

InternalDocs/garbage_collector.md

Lines changed: 181 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -477,17 +477,24 @@ specifically in a generation by calling `gc.collect(generation=NUM)`.
477477
```
478478

479479

480-
Optimization: visiting reachable objects
481-
========================================
480+
Optimization: excluding reachable objects
481+
=========================================
482482

483-
An object cannot be garbage if it can be reached.
483+
An object cannot be garbage if it can be reached. To avoid having to identify
484+
reference cycles across the whole heap, we can reduce the amount of work done
485+
considerably by first identifying objects reachable from objects known to be
486+
alive. These objects are excluded from the normal cyclic detection process.
484487

485-
To avoid having to identify reference cycles across the whole heap, we can
486-
reduce the amount of work done considerably by first moving most reachable objects
487-
to the `visited` space. Empirically, most reachable objects can be reached from a
488-
small set of global objects and local variables.
489-
This step does much less work per object, so reduces the time spent
490-
performing garbage collection by at least half.
488+
The default and free-threaded build both implement this optimization but in
489+
slightly different ways.
490+
491+
Finding reachable objects for the default build GC
492+
--------------------------------------------------
493+
494+
This works by first moving most reachable objects to the `visited` space.
495+
Empirically, most reachable objects can be reached from a small set of global
496+
objects and local variables. This step does much less work per object, so
497+
reduces the time spent performing garbage collection by at least half.
491498

492499
> [!NOTE]
493500
> Objects that are not determined to be reachable by this pass are not necessarily
@@ -515,6 +522,171 @@ increment. All objects directly referred to from those stack frames are
515522
added to the working set.
516523
Then the above algorithm is repeated, starting from step 2.
517524

525+
526+
Finding reachable objects for the free-threaded GC
527+
--------------------------------------------------
528+
529+
Within the `gc_free_threading.c` implementation, this is known as the "mark
530+
alive" pass or phase. It is similar in concept to what is done for the default
531+
build GC. Rather than moving objects between double-linked lists, the
532+
free-threaded GC uses a flag in `ob_gc_bits` to track if an object is
533+
found to be definitely alive (not garbage).
534+
535+
To find objects reachable from known alive objects, known as the "roots", the
536+
`gc_mark_alive_from_roots()` function is used. Root objects include
537+
`interp->sysdict` (the `sys` module dictionary), `interp->builtins`, and
538+
`interp->types`. Also included are all objects referred to by active Python
539+
frames. These objects and the transitive closure of objects reachable from
540+
them have `_PyGC_BITS_ALIVE` set. Any object with that bit set is excluded
541+
from the rest of the cyclic garbage detection process, since we know it cannot
542+
be unreachable.
543+
544+
> [!NOTE]
545+
> If the `gc.freeze()` function has been used, this phase of the collector is
546+
> skipped. That is done for two reasons. First, it is unlikely to be a
547+
> performance win if most of the objects have been marked as frozen. As such,
548+
> they would be excluded for the cyclic garbage detection, without this extra
549+
> work anyhow. Second, one of the purposes of using `gc.freeze()` is to avoid
550+
> dirtying the memory pages holding frozen objects. If this phase was executed,
551+
> the toggling of the `ob_gc_bits` flags would dirty pages and defeat that.
552+
553+
Software prefetch hinting
554+
-------------------------
555+
556+
To speed up the "mark alive" phase of the free-threaded GC, an additional
557+
optimization, known as software prefetching, is used. The GC will execute
558+
explicit prefetch CPU instructions in order to reduce the latency due to
559+
loading data from main memory. This added complexity can pay off since main
560+
memory is so much slower compared to accessing data in the CPU cache. This
561+
is enabled only if the number of long-lived objects exceeds a threshold. If
562+
the set of objects being garbage collected is small, the accessed memory is
563+
likely to fit entirely in the CPU cache and software prefetch is not helpful.
564+
565+
The details of this optimization are intricate, with the source code being the
566+
best reference. However, the rest of this section gives a high level
567+
description of how it works and explains some design decisions.
568+
569+
Software prefetching is only used during the "mark alive" phase of the GC.
570+
Specifically, when we are performing the transitive closure of the "alive"
571+
status of objects (i.e. objects reachable from known alive objects, known as
572+
roots). For each object we find, we need to traverse all objects directly
573+
reachable from that object. If that set of referred objects is in scattered
574+
locations of memory, the hardware prefetch is unlikely to predict the next
575+
accessed memory location.
576+
577+
Making software prefetch work well hinges on a key principle: allow enough time
578+
between issuing the prefetch instruction for a memory location and actually
579+
accessing that location's data. We can call that time difference the prefetch
580+
window. If the window is too large, we fill up the CPU caches with data that's
581+
not needed yet. Worse, the data might be discarded from the cache before we
582+
actually use it. If the window is too small then the memory system does not
583+
have enough time to finish loading the memory and the CPU will have to wait.
584+
The window is indirectly tuned by the prefetch buffer parameters. The buffer
585+
will be explained next.
586+
587+
The prefetch buffer is a FIFO queue of fixed size. When we enqueue an object
588+
reference into the buffer, we also issue a software prefetch instruction for
589+
that memory location. When we dequeue an object reference from the buffer, we
590+
assume or hope that enough time has elapsed so that the memory has been loaded
591+
into the cache. This is the mechanism that provides the window.
592+
593+
When performing the transitive closure of "alive" status, the set of objects
594+
yet to visit are stored in one of two places. First, they can be stored in the
595+
prefech buffer. Second, there is a LIFO stack, of unlimited size. When object
596+
references are found using `tp_traverse`, they are enqueued in the buffer if
597+
it is not full, otherwise they are pushed to the stack.
598+
599+
We must take special care not to access the memory referred to by an object
600+
pointer until we take that reference out of the prefetch buffer. That means we
601+
cannot check if an object is a "container" (if the `PyObject_IS_GC()` test is
602+
true) or if the object already has the "alive" flag set. Both of those tests
603+
would require that the object memory is accessed. There are some additional
604+
elaborations that try to keep the buffer filled to the optimal size but to keep
605+
things simple we will omit those details here. Not discussed in details are
606+
"spans", which help reduce the amount of stack used and make it easier to
607+
control the size of the buffer.
608+
609+
As mentioned, the prefetch window is the time delay between the issue of the
610+
prefetch instruction, on buffer enqueue, and the memory access, after buffer
611+
dequeue. It is tuned by adjusting some buffer parameters. If processing time
612+
were equal for every object then the buffer length would be proportional to the
613+
window. Since processing each object can actually take a variable amount of
614+
time, the relation between the buffer length and the prefetch window is only
615+
approximate. However, this proportional relationship is assumed to hold true
616+
and we avoid the overhead of actually measuring object processing times.
617+
618+
The relevant parameters are the maximum buffer size and the low and high
619+
thresholds for filling. The buffer parameters are set as follows: maximum
620+
length is 256, low threshold is 8, high threshold is 16. These parameters are
621+
used as follows. If the buffer has reached the maximum size, new object
622+
pointers found while following references are pushed to the stack, rather than
623+
put in the buffer. When dequeuing objects from the buffer, we will "prime" the
624+
buffer if the current length drops below the low threshold. Priming means
625+
popping objects from the stack and enqueing them into the buffer. While
626+
priming, we will fill it only until the high threshold is reached.
627+
628+
To measure the effectiveness of the buffer, some benchmark programs were run
629+
with a trace log of memory location prefetch and access instructions. The
630+
prefetch window for each object processed was computed from the trace log.
631+
Each enqueue and dequeue operation were treated as taking one unit of time.
632+
Time to actually process the object was assumed to be zero. A histogram of the
633+
windows is shown below. These traces suggest the buffer length is mostly being
634+
kept between the low and high thresholds, which is what we want. Variations of
635+
the buffer parameters were benchmarked and the best performing parameters were
636+
chosen. Obviously it is unlikely these parameters will be optimal for all
637+
hardware and programs.
638+
639+
```
640+
Prefetch window stats
641+
mean 52.1
642+
median 14.0
643+
max 256
644+
25.60 |65,304 | ******************************
645+
51.20 |5,590 | **
646+
76.80 |3,562 | *
647+
102.40 |2,683 | *
648+
128.00 |2,278 | *
649+
153.60 |2,285 | *
650+
179.20 |2,377 | *
651+
204.80 |2,238 | *
652+
230.40 |2,753 | *
653+
256.00 |5,930 | **
654+
-------- |------- | -------
655+
N= |95,000
656+
```
657+
658+
Using software prefetch instructions is only a win if the set of objects being
659+
examined by the GC does not fit into CPU caches. Otherwise, using the buffer
660+
and prefetch instructions is just overhead. Using the long lived object count
661+
seems a good estimate of if things will fit in the cache. On 64-bit platforms,
662+
the minimum object size is 32 bytes. A 4MB L2 cache would hold about 130,000
663+
objects.
664+
665+
The current threshold for enabling prefetch is that the number of long-lived
666+
objects must exceed 200,000. Based on benchmarking, this seems in the range
667+
where prefetch becomes a net gain. Obviously it depends on hardware details
668+
and also the "shape" of the object graph. For example, your object graph may
669+
be constructed by linearly allocating objects in memory. Then, traversing the
670+
object graph might naturally result in mostly ordered memory access. In that
671+
case, the hardware prefetch is likely to do a nearly perfect job without any
672+
software prefetch hints.
673+
674+
This optimization, as of March 2025, was tuned on the following hardware
675+
platforms:
676+
677+
- Apple M3 Pro, 32 GB RAM, 192+128 KB L1, 16 MB L2, compiled with Clang 19
678+
- AMD Ryzen 5 7600X, 64 GB RAM, 384 KB L1, 6 GB L2, 32 MB L3, compiled with GCC 12.2.0
679+
680+
Benchmarking the effectiveness of this optimization is particularly difficult.
681+
It depends both on hardware details, like CPU cache sizes and memory latencies,
682+
and the specifics of the program's memory access patterns, where objects are
683+
located in memory and in what order they are accessed during the mark alive
684+
phase. When the program's memory access patterns are favourable, working set
685+
of data larger than the CPU cache, objects allocated in such a way that access
686+
order is not linear, then the speedup from using software prefetching is in the
687+
range of 20% to 40% faster for the entire full GC collection.
688+
689+
518690
Optimization: reusing fields to save memory
519691
===========================================
520692

0 commit comments

Comments
 (0)