|
| 1 | +--- |
| 2 | +title: Introduction to Arm_SPE and False Sharing |
| 3 | +weight: 2 |
| 4 | + |
| 5 | +### FIXED, DO NOT MODIFY |
| 6 | +layout: learningpathall |
| 7 | +--- |
| 8 | + |
| 9 | +## Introduction to Arm Statistical Profiling Extension (SPE) |
| 10 | + |
| 11 | +Standard performance tracing relies on counting whole instructions, capturing only architectural instructions without revealing the actual memory addresses, pipeline latencies, or considering micro-operations in flight. Moreover, the “skid” phenomenon where events are falsely attributed to later instructions can mislead developers. |
| 12 | + |
| 13 | +The Arm Statistical Profiling Extension (SPE) integrates sampling directly into the CPU pipeline, triggering on individual micro-operations rather than retired instructions, thereby eliminating skid and blind spots. Each SPE sample record includes relevant meta data, such as data addresses, per-µop pipeline latency, triggered PMU event masks, and the memory hierarchy source, enabling fine-grained and precise cache analysis. |
| 14 | + |
| 15 | +This enables software developers to tune user-space software for characteristics such as memory latency and cache accesses. Importantly, it is the mechanism on Arm to enable cache statistics with the Linux `perf` cache-to-cache utility, referred to as `perf c2c`. Please refer to the [Arm_SPE whitepaper](https://developer.arm.com/documentation/109429/latest/) for more details. |
| 16 | + |
| 17 | +In this learning path we will use the `arm_spe` and `perf c2c` to diagnose a cache issue for an application running on a Neoverse server. |
| 18 | + |
| 19 | +## False Sharing within the Cache |
| 20 | + |
| 21 | +Even when two threads touch entirely separate variables, modern processors move data in fixed-size cache lines (nominally 64-bytes). If those distinct variables happen to occupy bytes within the same line, every time one thread writes its variable the core’s cache must gain exclusive ownership of the whole line, forcing the other core’s copy to be invalidated. The second thread, still working on its own variable, then triggers a coherence miss to fetch the line back, and the ping-pong repeats. Please see the illustration below, taken from the [Arm_SPE whitepaper](https://developer.arm.com/documentation/109429/latest/), for a visual explanation. |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +Because false sharing hides behind ordinary writes, the easiest time to eliminate it is while reading or refactoring the source code padding or realigning the offending variables before compilation. In large, highly concurrent code-bases, however, data structures are often accessed through several layers of abstraction, and many threads touch memory via indirection, so the subtle cache-line overlap may not surface until profiling or performance counters reveal unexpected coherence misses. |
| 26 | + |
| 27 | +From a source-code perspective nothing is “shared,” but at the hardware level both variables are implicitly coupled by their physical colocation. |
| 28 | + |
| 29 | +## Alignment to Cache Lines |
| 30 | + |
| 31 | +In C++11, we can manually specify the alignment of an object using the `alignas` function. For example, in the C++11 source code below, we manually align the the `struct` every 64 bytes (typical cache line size on a modern processor). This ensures that each instance of `AlignedType` is on a separate cache line. |
| 32 | + |
| 33 | +```cpp |
| 34 | +#include <atomic> |
| 35 | +#include <iostream> |
| 36 | + |
| 37 | +struct alignas(64) AlignedType { |
| 38 | + AlignedType() { val = 0; } |
| 39 | + std::atomic<int> val; |
| 40 | +}; |
| 41 | + |
| 42 | + |
| 43 | +int main() { |
| 44 | + // If we create four atomic integers like this, there's a high probability |
| 45 | + // they'll wind up next to each other in memory |
| 46 | + std::atomic<int> a; |
| 47 | + std::atomic<int> b; |
| 48 | + std::atomic<int> c; |
| 49 | + std::atomic<int> d; |
| 50 | + |
| 51 | + std::cout << "\n\nWithout Alignment can occupy same cache line\n\n"; |
| 52 | + // Print out the addresses |
| 53 | + std::cout << "Address of atomic<int> a - " << &a << '\n'; |
| 54 | + std::cout << "Address of atomic<int> b - " << &b << '\n'; |
| 55 | + std::cout << "Address of atomic<int> c - " << &c << '\n'; |
| 56 | + std::cout << "Address of atomic<int> d - " << &d << '\n'; |
| 57 | + |
| 58 | + AlignedType e{}; |
| 59 | + AlignedType f{}; |
| 60 | + AlignedType g{}; |
| 61 | + AlignedType h{}; |
| 62 | + |
| 63 | + std::cout << "\n\nMin 1 cache-line* spacing between variables"; |
| 64 | + std::cout << "\n*64 bytes = minimum 0x40 address increments\n\n"; |
| 65 | + |
| 66 | + std::cout << "Address of AlignedType e - " << &e << '\n'; |
| 67 | + std::cout << "Address of AlignedType f - " << &f << '\n'; |
| 68 | + std::cout << "Address of AlignedType g - " << &g << '\n'; |
| 69 | + std::cout << "Address of AlignedType h - " << &h << '\n'; |
| 70 | + |
| 71 | + return 0; |
| 72 | +} |
| 73 | +``` |
| 74 | + |
| 75 | +Example output below shows the variables e, f, g and h occur at least 64-bytes addreses apart in our byte-addressable architecture. Whereas variables a, b, c and d occur 8 bytes apart (i.e. occupy the same cache line). |
| 76 | + |
| 77 | +Although this is a contrived example, in a production workload there may be several layers of indirection that unintentionally result in false sharing. For these complex cases, to understand the root cause we will use `perf c2c`. |
| 78 | + |
| 79 | +```output |
| 80 | +Without Alignment can occupy same cache line |
| 81 | +
|
| 82 | +Address of atomic<int> a - 0xffffeb6c61b8 |
| 83 | +Address of atomic<int> b - 0xffffeb6c61b0 |
| 84 | +Address of atomic<int> c - 0xffffeb6c61a8 |
| 85 | +Address of atomic<int> d - 0xffffeb6c61a0 |
| 86 | +
|
| 87 | +
|
| 88 | +Min 1 cache-line* spacing between variables |
| 89 | +*64 bytes = minimum 0x40 address increments |
| 90 | +
|
| 91 | +Address of AlignedType e - 0xffffeb6c6140 |
| 92 | +Address of AlignedType f - 0xffffeb6c6100 |
| 93 | +Address of AlignedType g - 0xffffeb6c60c0 |
| 94 | +Address of AlignedType h - 0xffffeb6c6080 |
| 95 | +``` |
0 commit comments