diff --git a/images/2025_05_allocation_sampling_images_2/infocalltree.png b/images/2025_05_allocation_sampling_images_2/infocalltree.png new file mode 100644 index 000000000..7d9c65f5f Binary files /dev/null and b/images/2025_05_allocation_sampling_images_2/infocalltree.png differ diff --git a/images/2025_05_allocation_sampling_images_2/infoflamegraph.png b/images/2025_05_allocation_sampling_images_2/infoflamegraph.png new file mode 100644 index 000000000..4c9840287 Binary files /dev/null and b/images/2025_05_allocation_sampling_images_2/infoflamegraph.png differ diff --git a/images/2025_05_allocation_sampling_images_2/object.svg b/images/2025_05_allocation_sampling_images_2/object.svg new file mode 100644 index 000000000..86de06840 --- /dev/null +++ b/images/2025_05_allocation_sampling_images_2/object.svg @@ -0,0 +1,351 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Object +Content + type id +(16 bit) + GC flags +(16 bit) + Padding +(32 bit) + + diff --git a/images/2025_05_allocation_sampling_images_2/pure_from_args_calltree.png b/images/2025_05_allocation_sampling_images_2/pure_from_args_calltree.png new file mode 100644 index 000000000..743bf2334 Binary files /dev/null and b/images/2025_05_allocation_sampling_images_2/pure_from_args_calltree.png differ diff --git a/posts/2025/05/allocation_sampling2.md b/posts/2025/05/allocation_sampling2.md new file mode 100644 index 000000000..beb5cf6d2 --- /dev/null +++ b/posts/2025/05/allocation_sampling2.md @@ -0,0 +1,251 @@ + + +## Introduction + +Not long ago, we wrote a [blogpost](https://pypy.org/posts/2025/02/pypy-gc-sampling.html) describing our approach to low overhead allocation sampling for PyPy. +In this post we want to discuss extending the allocation sampling profiler to also store two extra pieces of information for every allocation sample: + +- the (RPython-level) type of the allocated object +- whether the object lived for a long time, or died relatively quickly + +We also want to store some general statistics about the memory use of the PyPy GC. + +Additionally, we want to elaborate on how we tested the interaction of those +features with the GC logic. Finally, we discuss the overhead of those features +and provide a case study to demonstrate how allocation sampling can be used. + +## Getting more information from PyPy's GC + +First, let us talk about the technical details. Hopefully you still remember +how nursery allocations and sampling worked, as described in the previous +blog post. Now we want to extend that approach by also finding out what type of +allocation (i.e., object) triggered a sample and, if it survived the next +minor collection. So let us directly dive into how PyPy's GC stores the type of +an object, by looking at an example depiction of how an object with metadata looks like. + + + +Every object allocated by the GC has a header, composed of a 16-bit type ID and +16 bits for GC flags. (The padding is only on 64-bit platforms and emitted on +32-bit PyPy.) +This means for every object allocated, as soon as we've got its address, we can +just read the type from its header. Unfortunately, those type IDs correspond to +RPython types and not Python (app-level) types. The RPython-level types may not +be very intuitive for non PyPy developers. Unfortunately getting the type at +the Python level would either be computationally costly or even impossible, due +to the objects not being completely initialized at the point where the GC would +like to find out the type. + +Let's talk about the second piece of information we want to get our hands on: +Did an allocated object survive the next minor collection after it was +allocated, or did it not? During minor collections, there is a point in time +where surviving objects are known and marked via one GC header flag. This is +only a rough indication of how long the object survived, but since most objects +die young (i.e. before the first minor collection), it's still interesting to +identify call sites where object get allocated that survive longer than one +minor collection. + +A quick recap of how sampling worked: Every `n` bytes allocated, a sample will +be triggered. Afterwards, the allocation request will be fulfilled, returning +the address of free memory for the newly allocated object to the caller. That +is where our new modifications start. Before returning the address of that +freshly allocated object, we put that address into a list. In this way, we +maintain a list `sampled_objects` of the sampled objects that were allocated +since that last collection. + +This list is then used during the next minor collection. After the heap has +been traced and all surviving objects are known, we can take every address +inside `sampled_objects` and look into the corresponding GC header to read the +type ID and to find out whether the object survived or not. These two pieces of +information are then stored into the profile on disk, together with some +statistics about the current state of the GC. + +That information is the following: +The `total_size_of_arenas` tells us how much space the GC actually has to +allocate tenured objects, while `total_memory_used` tells us how much of that +is already occupied. But there is more to a VM than just the memory the GC +manages, thus the `vmRSS` tells us how much memory PyPy consumes from the point +of view of the operating system. +Finally, the `GC state` tells the current major collection phase, which is one +of: `scanning, marking, sweeping, finalizing`. + +To be able to map type IDs of RPython types from numbers to something human-readable, +we also dump a mapping of RPython type IDs to their respective names into the profile so that an UI tool like the +[vmprof-firefox-converter](https://github.com/Cskorpion/vmprof-firefox-converter) +may use that to display the actual type names. As said earlier, those RPython +names may not be super useful for non PyPy developers, so we plan to add +descriptions for most common RPython types to the vmprof-firefox-converter. + + +## Visualising the New Information in the Firefox Profiler + +To be able to use the new collected information, we adapted the vmprof-firefox-converter +to include the sampled object types and GC statistics in the converted profile. +The sampled object types and their survival information are displayed by +placing an additional frame on top of the corresponding sampled call stack, +that contains the (RPython) type of the sampled object. +Furthermore, the category (color of the frame) tells if the object was collected (green) or if it survived (red). + + + +The little ticks on top of the GC-sampled Thread timeline mark minor collections. + +There is also the flame graph view that gives a quick overview of what functions ran the most and now also tells what sampled object types died before the first minor collection and which types were tenured. + + + + +## Evaluation + +### Verification + +After talking about how it works, we now have to talk about how we can verify +the correctness of the extracted information. + +Of course, the first thing we did was write some unit tests, but a few unit +tests don't really guarantee the correctness of the complex interaction of the +GC- and allocation sampling logic. + +Secondly, we implemented allocation sampling into the already existing +[randomized testing facility (fuzzer)](https://pypy.org/posts/2024/03/fixing-bug-incremental-gc.html) for PyPy's GC. + +Fuzzing PyPy's GC with hypothesis has two phases. The first phase is generating +random action sequences. Those actions consist of object-, string-, or array +allocations, freeing allocated objects, accessing an object and from now on +also activating and deactivating allocation sampling with a random sampling +rate. In the second phase, these actions are executed against the GC +implementation and their intermediate results asserted. + +If there is a bug in the GC, e.g., freeing an object too early, the fuzzer +could produce a random action sequence that leads to an error when accessing an +already freed object. + +When generating these actions, we also keep track of how much memory will be +allocated when they are executed. With this information, we can decide if each +generated allocation action will trigger a sample. When these actions are +executed in the second phase, we can check for each allocation if it should +trigger a sample and if it actually did. For a failing check we then get the +sequence of actions that led to the failed check, so we can trace the bug down. + +Fuzzing was very helpful for getting rid of many bugs inside the allocation +sampling logic, because it demonstrated interactions between sampling and +garbage collection that we hadn't foreseen properly. + +### Overhead + +We already talked about that last time, but adding now features could add more +overhead, so we re-ran the benchmarks from the last post. We won't go into +further details about that here, but it turns out that the new features of also +persisting RPython types and whether the object survived the next minor +collection did not change the overhead of sampling in a measurable way. The +reason for that is that the main cost of allocation sampling is the cost of +walking the call stack at the allocation point. + +Every time we sample, we also store the address of the allocated object in a +list. That list, together with some GC stats, is then dumped to VMProf at every +minor collection. This only introduces little overhead when sampling is +enabled, and only one new pointer comparison at every minor collection when +sampling is turned off. + + +## Case Study + +Here we want to show you a real world example of how we used allocation +sampling with the vmprof-firefox-converter to profile some real-world program +and improved the allocation rate of the JIT optimizer in the process. + +### PyPy VM Optimization + +We profiled some benchmark executing SymPy functions and looked at +sampled list allocations. + + + +There we stumbled upon PyPy's JIT-optimizer functions `postprocess_INT_ADD` and +`pure_from_args`. Both those functions take part in optimizing pure integer +operations. + +If PyPy's JIT encounters a `INT_ADD` operation that cannot be [optimized away](https://pypy.org/posts/2024/10/jit-peephole-dsl.html) +`postprocess_INT_ADD` will be called to also cache some arithmetic rewrites of +that addition. E.g. if the JIT emits an operation `x = a + b` it will remember +that `x - a` can be optimized to `b` from then on. +This is done by calling an API `pure_from_args(INT_SUB, [arg0, arg1])`. Similar +logic exists for other integer operations like multiplication or xor. + +The reason why these `postprocess_...` functions of the JIT appear in the +memory profile is that they all allocate a list. This list is extremely +short-lived, one level deeper its elements are read out again and discarded, +and yet another list is built, which is discarded one function call further. +Additionally, in all of these postprocess methods, there are never more than +two arguments passed to `pure_from_args` inside that list. + +After seeing `pure_from_args` in the memory allocation profile, we decided to +rewrite it to make the list allocations unnecessary. To achieve this, we split +it up into two functions, `pure_from_args1` and `pure_from_args2`, that take +the elements of the list as extra arguments directly, foregoing the allocation. +Then all call sites were +[adapted](https://github.com/pypy/pypy/commit/ef590f639e529ebe319c7d5ff8f5e03e31bcc304) +to either call `pure_from_args1` or `pure_from_args2` directly, and thus saving +two list allocations per `pure_from_args` call. + +This experience is maybe typical for using allocation profiling: the internal +RPython class names are much more useful for PyPy core developers, we will have +to see whether application developers can also make use of the results. + + +## Summary + +In this and the previous blog post, we introduced and explained the technical +details of allocation sampling in PyPy's GC with VMProf. Using this tool to +simultaneously do allocation and time sampling can give insight into where the +program spends time and what functions allocate much memory, leading to garbage +collections. This tool is aimed at both PyPy developers and non-PyPy +developers, with the target of being easy to use while introducing little +overhead. + +Right now the tool is still in development, we hope to merge and release it with +a PyPy release some time soon. + +## Future Work + +We have a bunch of ideas for features that could be added to VMProf in the +future. + +One very important thing when it comes to profiling is the overhead. One idea +on how to reduce the overhead per sample is, to not walk the entire stack every +time but mark walked stack frames so that for the next sample, we only need to +walk up to an already marked stack frame. This could reduce the overhead if the +stacks do not change completely from sample to sample and there is indeed a +significant correlation between stack depth and overhead. + +PyPy has a logging interface, which is used to log GC and JIT events with a +timestamp. Unfortunately those timestamps are the clock counts read from the +CPU’s `TSC` (Time Stamp Counter ~ number of cycles since last reset) register +(at least on x86/x86_64), which are not perfectly suitable for measuring time. +Our modified version of VMProf on the other hand uses timestamps retrieved with +Unix' `CLOCK_MONOTONIC`. This means we cannot exactly associate pypylog events +with VMProf samples. An easy fix would be to use the same timestamps for +pypylog as we do for VMProf, but this could introduce more overhead. A better +way of associating them, could be to record the `TSC` with each sample so we’d +get an approximate alignment of logged events and samples. + +The very last thing to append to our ‘future work’ list are syscalls. Using +ptrace to trace syscalls could give an insight on where/when/how much your code +spent time opening files, reading from files, waiting for subprocesses to +finish, etc. + +We are also hoping to transfer some of the techniques used here for profiling +PyPy to profile other RPython languages, such as the CPU simulators generated +by Pydrofoil. + +-- Christoph Jung and CF Bolz-Tereick