You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-1.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,11 +8,11 @@ layout: learningpathall
8
8
9
9
## Introduction to Arm Statistical Profiling Extension (SPE)
10
10
11
-
Traditional performance tracing relies on counting whole instructions, capturing only architectural instructions without revealing the actual memory addresses, pipeline latencies, or micro-operations in flight; moreover, the “skid” phenomenon where events are falsely attributed to later instructions can mislead developers.
11
+
Standard performance tracing relies on counting whole instructions, capturing only architectural instructions without revealing the actual memory addresses, pipeline latencies, or considering micro-operations in flight. Moreover, the “skid” phenomenon where events are falsely attributed to later instructions can mislead developers.
12
12
13
-
The Arm Statistical Profiling Extension (SPE) integrates sampling directly into the CPU pipeline, triggering on individual micro-operations rather than retired instructions, thereby eliminating skid and blind spots. Each SPE sample record includes relevant meta data, such as data addresses, per-µop pipeline latency, triggered PMU event masks, and the memory hierarchy source (L1 cache, last-level cache, or remote socket), enabling fine-grained and precise cache analysis.
13
+
The Arm Statistical Profiling Extension (SPE) integrates sampling directly into the CPU pipeline, triggering on individual micro-operations rather than retired instructions, thereby eliminating skid and blind spots. Each SPE sample record includes relevant meta data, such as data addresses, per-µop pipeline latency, triggered PMU event masks, and the memory hierarchy source, enabling fine-grained and precise cache analysis.
14
14
15
-
This enables software developers to tune user-space software for characteristics such as memory latency, and cache statistics. Importantly, it is the mechanism on Arm to enable cache statistics with the Linux `perf` cache-to-cache utility, referred to as `perf c2c`. Please refer to the [Arm_SPE whitepaper](https://developer.arm.com/documentation/109429/latest/) for more details.
15
+
This enables software developers to tune user-space software for characteristics such as memory latency and cache accesses. Importantly, it is the mechanism on Arm to enable cache statistics with the Linux `perf` cache-to-cache utility, referred to as `perf c2c`. Please refer to the [Arm_SPE whitepaper](https://developer.arm.com/documentation/109429/latest/) for more details.
16
16
17
17
In this learning path we will use the `arm_spe` and `perf c2c` to diagnose a cache issue for an application running on a Neoverse server.
18
18
@@ -22,15 +22,15 @@ Even when two threads touch entirely separate variables, modern processors move
Because false sharing hides behind ordinary writes, the easiest time to eliminate it is while reading or refactoring the source code—padding or realigning the offending variables before compilation. In large, highly concurrent code-bases, however, data structures are often accessed through several layers of abstraction, and many threads touch memory via indirection, so the subtle cache-line overlap may not surface until profiling or performance counters reveal unexpected coherence misses.
25
+
Because false sharing hides behind ordinary writes, the easiest time to eliminate it is while reading or refactoring the source codepadding or realigning the offending variables before compilation. In large, highly concurrent code-bases, however, data structures are often accessed through several layers of abstraction, and many threads touch memory via indirection, so the subtle cache-line overlap may not surface until profiling or performance counters reveal unexpected coherence misses.
26
26
27
27
From a source-code perspective nothing is “shared,” but at the hardware level both variables are implicitly coupled by their physical colocation.
28
28
29
29
## Alignment to Cache Lines
30
30
31
31
In C++11, we can manually specify the alignment of an object using the `alignas` function. For example, in the C++11 source code below, we manually align the the `struct` every 64 bytes (typical cache line size on a modern processor). This ensures that each instance of `AlignedType` is on a separate cache line.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-2.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,7 +39,7 @@ sudo modprobe arm_spe_pmu
39
39
40
40
## Run Sysreport
41
41
42
-
A handy python script is available to summarise your systems capabilities with regard to performance profiling. Install and run System Report python script ('sysreport`) using the [instructions in the learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/sysreport/).
42
+
A handy python script is available to summarise your systems capabilities with regard to performance profiling. Install and run System Report python script (`sysreport`) using the [instructions in the learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/sysreport/).
43
43
44
44
To check SPE is available on your system look at the `perf sampling` field. It should read `SPE` highlighted in green.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-3.md
+3-4Lines changed: 3 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,6 @@ layout: learningpathall
11
11
Copy and paste the `C++/C` example below into a file named `false_sharing_example.cpp`. The code example below has been adapted from [Joe Mario](https://github.com/joemario/perf-c2c-usage-files) and is discussed thoroughly in the [Arm Statistical Profiling Extension Whitepaper](https://developer.arm.com/documentation/109429/latest/).
12
12
13
13
14
-
15
14
```cpp
16
15
/*
17
16
* This is an example program to show false sharing between
@@ -285,7 +284,7 @@ int main ( int argc, char *argv[] )
285
284
286
285
### Code Explanation
287
286
288
-
The key data structure that occupies the cache is the `struct Buf`. With our system using a 64-byte cache line, each line can hold 8, 8-byte `long` integers. If we do **not** pass in the `NO_FALSE_SHARING` macro during compilation our `Buf` data structure will have the elements below. Where each structure neatly occupies the entire 64-byte cache line. However, the 4 readers and 2 locks are now on the same cache line.
287
+
The key data structure that occupies the cache is the `struct Buf`. With our system using a 64-byte cache line, each line can hold 8, 8-byte `long` integers. If we do **not** pass in the `NO_FALSE_SHARING` macro during compilation our `Buf` data structure will contain the elements below. Where each structure neatly occupies the entire 64-byte cache line. However, the 4 readers and 2 locks are now on the same cache line.
289
288
290
289
```output
291
290
typedef struct _buf {
@@ -300,7 +299,7 @@ typedef struct _buf {
300
299
} buf __attribute__((aligned (64)));
301
300
```
302
301
303
-
Alternatively if we pass in the `NO_FALSE_SHARING` macro during compilation, our `Buf` structure has a different shape. The `(5*8-byte)` padding pushes the reader variables onto a different cache line. However, notice that this is with the tradeoff that our new `Buf` structures occupies 1 and a half cache lines (12 `long`s). Therefore we have unused cache space of ~25% per `Buf` structure.
302
+
Alternatively if we pass in the `NO_FALSE_SHARING` macro during compilation, our `Buf` structure has a different shape. The `(5*8-byte)` padding pushes the reader variables onto a different cache line. However, notice that this is with the tradeoff that our new `Buf` structures occupies 1 and a half cache lines (12 `long`s). Therefore we have unused cache space of 25% per `Buf` structure.
Running both binaries with the command like argument of 1 will show the following, with both binaries successfully return a 0 exit status.
324
+
Running both binaries with the command like argument of 1 will show the following, with both binaries successfully return a 0 exit status but the `false_sharing` binary runs almost 2x slower!
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-4.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,7 +54,7 @@ Rerunning with the `no_false_sharing` shows the following.
54
54
6.4942219 +- 0.0000428 seconds time elapsed ( +- 0.00% )
55
55
```
56
56
57
-
Manually comparing we observe the run time is significantly different (13.01s to 6.49s). Additionally, the instructions per cycle (IPC) is notably different, (0.74 and 1.70) and looks commensurate to the run time.
57
+
Manually comparing we observe the run time is significantly different (13.01s to 6.49s). Additionally, the instructions per cycle (IPC) is notably different, (0.74 and 1.70) and looks to be commensurate to run time.
58
58
59
59
## Understanding the Root Cause
60
60
@@ -78,7 +78,7 @@ The output below clearly shows there are disproportionately more backend stall c
78
78
79
79
### Skid when using Perf Record
80
80
81
-
The naive approach would be to record the events using the `perf record` subcommand. Running the following commands can be used to demonstrate skid, mentioned in the previous section.
81
+
The naive approach would be to record the events using the `perf record` subcommand. Running the following commands can be used to demonstrate skid, mentioned in the "Introduction to Arm_SPE and False Sharing" section.
The left screenshot shows the canonical `perf record` command, here the `adrp` instruction falsely reports 52% of the time. However, using `perf c2c` that leverages `arm_spe`, we observe 99% of time associated with the `ldr`, load register command.
95
+
The left screenshot shows the canonical `perf record` command, here the `adrp` instruction falsely reports 52% of the time. However, using `perf c2c` that leverages `arm_spe`, we observe 99% of time associated with the `ldr`, load register command. The standard `perf record` data could be quite misleading for a developer!
Clearly `perf c2c` is more accurate. We are able to observe the instructure that is being used most frequently. Now let's find the specific variable so observe what in our source cause is causing this.
102
+
Clearly `perf c2c` is more accurate. We are able to observe the instructure that is being used most frequently. Now let's find the specific variable to observe what in our source cause is causing this.
103
103
104
104
Next, compile a debug version of both applications with the following command.
0 commit comments