Skip to content

Commit 238930f

Browse files
author
Your Name
committed
final review before PR
1 parent 922d19a commit 238930f

File tree

4 files changed

+14
-15
lines changed

4 files changed

+14
-15
lines changed

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-1.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,11 @@ layout: learningpathall
88

99
## Introduction to Arm Statistical Profiling Extension (SPE)
1010

11-
Traditional performance tracing relies on counting whole instructions, capturing only architectural instructions without revealing the actual memory addresses, pipeline latencies, or micro-operations in flight; moreover, the “skid” phenomenon where events are falsely attributed to later instructions can mislead developers.
11+
Standard performance tracing relies on counting whole instructions, capturing only architectural instructions without revealing the actual memory addresses, pipeline latencies, or considering micro-operations in flight. Moreover, the “skid” phenomenon where events are falsely attributed to later instructions can mislead developers.
1212

13-
The Arm Statistical Profiling Extension (SPE) integrates sampling directly into the CPU pipeline, triggering on individual micro-operations rather than retired instructions, thereby eliminating skid and blind spots. Each SPE sample record includes relevant meta data, such as data addresses, per-µop pipeline latency, triggered PMU event masks, and the memory hierarchy source (L1 cache, last-level cache, or remote socket), enabling fine-grained and precise cache analysis.
13+
The Arm Statistical Profiling Extension (SPE) integrates sampling directly into the CPU pipeline, triggering on individual micro-operations rather than retired instructions, thereby eliminating skid and blind spots. Each SPE sample record includes relevant meta data, such as data addresses, per-µop pipeline latency, triggered PMU event masks, and the memory hierarchy source, enabling fine-grained and precise cache analysis.
1414

15-
This enables software developers to tune user-space software for characteristics such as memory latency, and cache statistics. Importantly, it is the mechanism on Arm to enable cache statistics with the Linux `perf` cache-to-cache utility, referred to as `perf c2c`. Please refer to the [Arm_SPE whitepaper](https://developer.arm.com/documentation/109429/latest/) for more details.
15+
This enables software developers to tune user-space software for characteristics such as memory latency and cache accesses. Importantly, it is the mechanism on Arm to enable cache statistics with the Linux `perf` cache-to-cache utility, referred to as `perf c2c`. Please refer to the [Arm_SPE whitepaper](https://developer.arm.com/documentation/109429/latest/) for more details.
1616

1717
In this learning path we will use the `arm_spe` and `perf c2c` to diagnose a cache issue for an application running on a Neoverse server.
1818

@@ -22,15 +22,15 @@ Even when two threads touch entirely separate variables, modern processors move
2222

2323
![false_sharing_diagram](./false_sharing_diagram.png)
2424

25-
Because false sharing hides behind ordinary writes, the easiest time to eliminate it is while reading or refactoring the source codepadding or realigning the offending variables before compilation. In large, highly concurrent code-bases, however, data structures are often accessed through several layers of abstraction, and many threads touch memory via indirection, so the subtle cache-line overlap may not surface until profiling or performance counters reveal unexpected coherence misses.
25+
Because false sharing hides behind ordinary writes, the easiest time to eliminate it is while reading or refactoring the source code padding or realigning the offending variables before compilation. In large, highly concurrent code-bases, however, data structures are often accessed through several layers of abstraction, and many threads touch memory via indirection, so the subtle cache-line overlap may not surface until profiling or performance counters reveal unexpected coherence misses.
2626

2727
From a source-code perspective nothing is “shared,” but at the hardware level both variables are implicitly coupled by their physical colocation.
2828

2929
## Alignment to Cache Lines
3030

3131
In C++11, we can manually specify the alignment of an object using the `alignas` function. For example, in the C++11 source code below, we manually align the the `struct` every 64 bytes (typical cache line size on a modern processor). This ensures that each instance of `AlignedType` is on a separate cache line.
3232

33-
```c++
33+
```cpp
3434
#include <atomic>
3535
#include <iostream>
3636

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-2.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ sudo modprobe arm_spe_pmu
3939

4040
## Run Sysreport
4141

42-
A handy python script is available to summarise your systems capabilities with regard to performance profiling. Install and run System Report python script ('sysreport`) using the [instructions in the learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/sysreport/).
42+
A handy python script is available to summarise your systems capabilities with regard to performance profiling. Install and run System Report python script (`sysreport`) using the [instructions in the learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/sysreport/).
4343

4444
To check SPE is available on your system look at the `perf sampling` field. It should read `SPE` highlighted in green.
4545

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-3.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@ layout: learningpathall
1111
Copy and paste the `C++/C` example below into a file named `false_sharing_example.cpp`. The code example below has been adapted from [Joe Mario](https://github.com/joemario/perf-c2c-usage-files) and is discussed thoroughly in the [Arm Statistical Profiling Extension Whitepaper](https://developer.arm.com/documentation/109429/latest/).
1212

1313

14-
1514
```cpp
1615
/*
1716
* This is an example program to show false sharing between
@@ -285,7 +284,7 @@ int main ( int argc, char *argv[] )
285284
286285
### Code Explanation
287286
288-
The key data structure that occupies the cache is the `struct Buf`. With our system using a 64-byte cache line, each line can hold 8, 8-byte `long` integers. If we do **not** pass in the `NO_FALSE_SHARING` macro during compilation our `Buf` data structure will have the elements below. Where each structure neatly occupies the entire 64-byte cache line. However, the 4 readers and 2 locks are now on the same cache line.
287+
The key data structure that occupies the cache is the `struct Buf`. With our system using a 64-byte cache line, each line can hold 8, 8-byte `long` integers. If we do **not** pass in the `NO_FALSE_SHARING` macro during compilation our `Buf` data structure will contain the elements below. Where each structure neatly occupies the entire 64-byte cache line. However, the 4 readers and 2 locks are now on the same cache line.
289288
290289
```output
291290
typedef struct _buf {
@@ -300,7 +299,7 @@ typedef struct _buf {
300299
} buf __attribute__((aligned (64)));
301300
```
302301

303-
Alternatively if we pass in the `NO_FALSE_SHARING` macro during compilation, our `Buf` structure has a different shape. The `(5*8-byte)` padding pushes the reader variables onto a different cache line. However, notice that this is with the tradeoff that our new `Buf` structures occupies 1 and a half cache lines (12 `long`s). Therefore we have unused cache space of ~25% per `Buf` structure.
302+
Alternatively if we pass in the `NO_FALSE_SHARING` macro during compilation, our `Buf` structure has a different shape. The `(5*8-byte)` padding pushes the reader variables onto a different cache line. However, notice that this is with the tradeoff that our new `Buf` structures occupies 1 and a half cache lines (12 `long`s). Therefore we have unused cache space of 25% per `Buf` structure.
304303

305304
```output
306305
typedef struct _buf {
@@ -322,7 +321,7 @@ gcc -lnuma -pthread false_sharing_example.c -o false_sharing
322321
gcc -lnuma -pthread false_sharing_example.c -DNO_FALSE_SHARING -o no_false_sharing
323322
```
324323

325-
Running both binaries with the command like argument of 1 will show the following, with both binaries successfully return a 0 exit status.
324+
Running both binaries with the command like argument of 1 will show the following, with both binaries successfully return a 0 exit status but the `false_sharing` binary runs almost 2x slower!
326325

327326
```bash
328327
time ./false_sharing 1

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-4.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ Rerunning with the `no_false_sharing` shows the following.
5454
6.4942219 +- 0.0000428 seconds time elapsed ( +- 0.00% )
5555
```
5656

57-
Manually comparing we observe the run time is significantly different (13.01s to 6.49s). Additionally, the instructions per cycle (IPC) is notably different, (0.74 and 1.70) and looks commensurate to the run time.
57+
Manually comparing we observe the run time is significantly different (13.01s to 6.49s). Additionally, the instructions per cycle (IPC) is notably different, (0.74 and 1.70) and looks to be commensurate to run time.
5858

5959
## Understanding the Root Cause
6060

@@ -78,7 +78,7 @@ The output below clearly shows there are disproportionately more backend stall c
7878

7979
### Skid when using Perf Record
8080

81-
The naive approach would be to record the events using the `perf record` subcommand. Running the following commands can be used to demonstrate skid, mentioned in the previous section.
81+
The naive approach would be to record the events using the `perf record` subcommand. Running the following commands can be used to demonstrate skid, mentioned in the "Introduction to Arm_SPE and False Sharing" section.
8282

8383
```bash
8484
# record using canonical counters
@@ -92,14 +92,14 @@ sudo perf c2c record -g ./false_sharing 1
9292
sudo perf annotate
9393
```
9494

95-
The left screenshot shows the canonical `perf record` command, here the `adrp` instruction falsely reports 52% of the time. However, using `perf c2c` that leverages `arm_spe`, we observe 99% of time associated with the `ldr`, load register command.
95+
The left screenshot shows the canonical `perf record` command, here the `adrp` instruction falsely reports 52% of the time. However, using `perf c2c` that leverages `arm_spe`, we observe 99% of time associated with the `ldr`, load register command. The standard `perf record` data could be quite misleading for a developer!
9696

9797
![perf-record-annotate](./perf-record-error-skid.png)
9898
![perf-c2c-record-annotate](./perf-c2c-record.png)
9999

100100
### Using Perf C2C
101101

102-
Clearly `perf c2c` is more accurate. We are able to observe the instructure that is being used most frequently. Now let's find the specific variable so observe what in our source cause is causing this.
102+
Clearly `perf c2c` is more accurate. We are able to observe the instructure that is being used most frequently. Now let's find the specific variable to observe what in our source cause is causing this.
103103

104104
Next, compile a debug version of both applications with the following command.
105105

@@ -108,7 +108,7 @@ gcc -g -lnuma -pthread false_sharing_example.c -o false_sharing_.debug
108108
gcc -g -lnuma -pthread false_sharing_example.c -DNO_FALSE_SHARING -o no_false_sharing.debug
109109
```
110110

111-
Next, we record our application with call stats using the `perf c2c` subcommand.
111+
Next, we record our application with call stacks using the `perf c2c` subcommand with the `-g` flag.
112112

113113
```bash
114114
sudo perf c2c record -g ./false_sharing.debug 1

0 commit comments

Comments
 (0)