Skip to content

Commit 8fdda1a

Browse files
authored
Merge pull request #1970 from ArmDeveloperEcosystem/main
Production update
2 parents a55861e + cd35259 commit 8fdda1a

File tree

19 files changed

+546
-422
lines changed

19 files changed

+546
-422
lines changed

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/_index.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,17 +7,17 @@ cascade:
77

88
minutes_to_complete: 15
99

10-
who_is_this_for: Cloud developers who are looking to debug and optimize cache access patterns on cloud servers with perf c2c.
10+
who_is_this_for: This topic is for developers who want to optimize cache access patterns on Arm servers using Perf C2C.
1111

1212
learning_objectives:
13-
- Learn basic C++ techniques to avoid false sharing with alignas().
14-
- Learn how to enable and use Arm_SPE.
15-
- Learn how to investigate cache line performance with perf c2c.
13+
- Avoid false sharing in C++ using memory alignment.
14+
- Enable and use the Arm Statistical Profiling Extension (SPE) on Linux systems.
15+
- Investigate cache line performance with Perf C2C.
1616

1717
prerequisites:
18-
- Arm-based cloud instance with Arm Statistical Profiling Extension support.
19-
- basic understanding on cache hierarchy and how efficient cache accessing impact performance..
20-
- Familiarity with the Linux Perf tool.
18+
- Access to an Arm-based cloud instance with support for the Arm Statistical Profiling Extension (SPE).
19+
- A basic understanding of cache coherency and its impact on performance.
20+
- Familiarity with Linux Perf tools.
2121

2222
author: Kieran Hejmadi
2323

@@ -28,6 +28,7 @@ armips:
2828
- Neoverse
2929
tools_software_languages:
3030
- Perf
31+
- Runbook
3132
operatingsystems:
3233
- Linux
3334

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-1.md

Lines changed: 20 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,36 @@
11
---
2-
title: Introduction to Arm_SPE and False Sharing
2+
title: Introduction to Arm SPE and false sharing
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Introduction to Arm Statistical Profiling Extension (SPE)
9+
## Introduction to the Arm Statistical Profiling Extension (SPE)
1010

11-
Standard performance tracing relies on counting whole instructions, capturing only architectural instructions without revealing the actual memory addresses, pipeline latencies, or considering micro-operations in flight. Moreover, the “skid” phenomenon where events are falsely attributed to later instructions can mislead developers.
11+
Standard performance tracing relies on counting completed instructions, capturing only architectural instructions without revealing the actual memory addresses, pipeline latencies, or considering micro-operations in flight. Moreover, the “skid” phenomenon where events are falsely attributed to later instructions can mislead developers.
1212

13-
The Arm Statistical Profiling Extension (SPE) integrates sampling directly into the CPU pipeline, triggering on individual micro-operations rather than retired instructions, thereby eliminating skid and blind spots. Each SPE sample record includes relevant meta data, such as data addresses, per-µop pipeline latency, triggered PMU event masks, and the memory hierarchy source, enabling fine-grained and precise cache analysis.
13+
SPE integrates sampling directly into the CPU pipeline, triggering on individual micro-operations rather than retired instructions, thereby eliminating skid and blind spots. Each SPE sample record includes relevant metadata, such as data addresses, per-µop pipeline latency, triggered PMU event masks, and the memory hierarchy source, enabling fine-grained and precise cache analysis.
1414

15-
This enables software developers to tune user-space software for characteristics such as memory latency and cache accesses. Importantly, it is the mechanism on Arm to enable cache statistics with the Linux `perf` cache-to-cache utility, referred to as `perf c2c`. Please refer to the [Arm_SPE whitepaper](https://developer.arm.com/documentation/109429/latest/) for more details.
15+
This enables software developers to tune user-space software for characteristics such as memory latency and cache accesses. Importantly, cache statistics are enabled with the Linux Perf cache-to-cache (C2C) utility.
1616

17-
In this learning path we will use the `arm_spe` and `perf c2c` to diagnose a cache issue for an application running on a Neoverse server.
17+
Please refer to the [Arm SPE whitepaper](https://developer.arm.com/documentation/109429/latest/) for more details.
1818

19-
## False Sharing within the Cache
19+
In this Learning Path, you will use SPE and Perf C2C to diagnose a cache issue for an application running on a Neoverse server.
2020

21-
Even when two threads touch entirely separate variables, modern processors move data in fixed-size cache lines (nominally 64-bytes). If those distinct variables happen to occupy bytes within the same line, every time one thread writes its variable the core’s cache must gain exclusive ownership of the whole line, forcing the other core’s copy to be invalidated. The second thread, still working on its own variable, then triggers a coherence miss to fetch the line back, and the ping-pong repeats. Please see the illustration below, taken from the [Arm_SPE whitepaper](https://developer.arm.com/documentation/109429/latest/), for a visual explanation.
21+
## False sharing within the cache
22+
23+
Even when two threads touch entirely separate variables, modern processors move data in fixed-size cache lines (nominally 64-bytes). If those distinct variables happen to occupy bytes within the same line, every time one thread writes its variable the core’s cache must gain exclusive ownership of the whole line, forcing the other core’s copy to be invalidated. The second thread, still working on its own variable, then triggers a coherence miss to fetch the line back, and the ping-pong pattern repeats. Please see the illustration below, taken from the Arm SPE whitepaper, for a visual explanation.
2224

2325
![false_sharing_diagram](./false_sharing_diagram.png)
2426

25-
Because false sharing hides behind ordinary writes, the easiest time to eliminate it is while reading or refactoring the source code padding or realigning the offending variables before compilation. In large, highly concurrent code-bases, however, data structures are often accessed through several layers of abstraction, and many threads touch memory via indirection, so the subtle cache-line overlap may not surface until profiling or performance counters reveal unexpected coherence misses.
27+
Because false sharing hides behind ordinary writes, the easiest time to eliminate it is while reading or refactoring the source code by padding or realigning the offending variables before compilation. In large, highly concurrent codebases, however, data structures are often accessed through several layers of abstraction, and many threads touch memory via indirection, so the subtle cache-line overlap may not surface until profiling or performance counters reveal unexpected coherence misses.
2628

27-
From a source-code perspective nothing is “shared,” but at the hardware level both variables are implicitly coupled by their physical colocation.
29+
From a source-code perspective nothing is “shared,” but at the hardware level both variables are implicitly coupled by their physical location.
2830

29-
## Alignment to Cache Lines
31+
## Alignment to cache lines
3032

31-
In C++11, we can manually specify the alignment of an object using the `alignas` function. For example, in the C++11 source code below, we manually align the the `struct` every 64 bytes (typical cache line size on a modern processor). This ensures that each instance of `AlignedType` is on a separate cache line.
33+
In C++11, you can manually specify the alignment of an object with the `alignas` specifier. For example, the C++11 source code below manually aligns the the `struct` every 64 bytes (typical cache line size on a modern processor). This ensures that each instance of `AlignedType` is on a separate cache line.
3234

3335
```cpp
3436
#include <atomic>
@@ -48,7 +50,7 @@ int main() {
4850
std::atomic<int> c;
4951
std::atomic<int> d;
5052

51-
std::cout << "\n\nWithout Alignment can occupy same cache line\n\n";
53+
std::cout << "\n\nWithout alignment, variables can occupy the same cache line\n\n";
5254
// Print out the addresses
5355
std::cout << "Address of atomic<int> a - " << &a << '\n';
5456
std::cout << "Address of atomic<int> b - " << &b << '\n';
@@ -72,9 +74,9 @@ int main() {
7274
}
7375
```
7476

75-
Example output below shows the variables e, f, g and h occur at least 64-bytes addreses apart in our byte-addressable architecture. Whereas variables a, b, c and d occur 8 bytes apart (i.e. occupy the same cache line).
77+
The example output below shows the variables e, f, g and h occur at least 64-bytes apart in the byte-addressable architecture. Whereas variables a, b, c and d occur 8 bytes apart, occupying the same cache line.
7678

77-
Although this is a contrived example, in a production workload there may be several layers of indirection that unintentionally result in false sharing. For these complex cases, to understand the root cause we will use `perf c2c`.
79+
Although this is a contrived example, in a production workload there may be several layers of indirection that unintentionally result in false sharing. For these complex cases, to understand the root cause you will use Perf C2C.
7880

7981
```output
8082
Without Alignment can occupy same cache line
@@ -92,4 +94,6 @@ Address of AlignedType e - 0xffffeb6c6140
9294
Address of AlignedType f - 0xffffeb6c6100
9395
Address of AlignedType g - 0xffffeb6c60c0
9496
Address of AlignedType h - 0xffffeb6c6080
95-
```
97+
```
98+
99+
Continue to the next section to learn how to set up a system to run Perf C2C.
Lines changed: 62 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,73 @@
11
---
2-
title: Setup
2+
title: Configure your environment for Arm SPE profiling
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Setup
9+
## Select a system with SPE support
1010

11-
For this tutorial, I will use a `c6g.metal` instances running Amazon linux 2023 (AL23). Since `SPE` requires support both in hardware and the operating system, instances running specific distributions or kernels may not allow SPE-based profiling.
11+
SPE requires both hardware and operating system support. Many cloud instances running Linux do not enable SPE-based profiling.
1212

13-
We can check the underlying Neoverse IP and operating system kernel version with the following commands.
13+
You need to identify a system that supports SPE using the information below.
14+
15+
If you are looking for an AWS system, you can use a `c6g.metal` instance running Amazon Linux 2023 (AL2023).
16+
17+
Check the underlying Neoverse processor and operating system kernel version with the following commands.
1418

1519
```bash
1620
lscpu | grep -i "model name"
1721
uname -r
1822
```
1923

20-
Here we observe
24+
The output includes the CPU type and kernel release version:
2125

2226
```ouput
2327
Model name: Neoverse-N1
24-
6.1.134-150.224.amzn2023.aarch64
28+
6.1.134-152.225.amzn2023.aarch64
2529
```
2630

27-
Next install the prerequisite packages with the following command.
31+
Next, install the prerequisite packages using the package manager:
2832

2933
```bash
3034
sudo dnf update -y
3135
sudo dnf install perf git gcc cmake numactl-devel -y
3236
```
3337

34-
Since the `linux` perf utility is a userspace process and SPE is a hardware feature in silicon, we use a built-in kernel module `arm_spe_pmu` to interact. Run the following command.
38+
Linux Perf is a userspace process and SPE is a hardware feature. The Linux kernel must be compiled with SPE support or the kernel module named `arm_spe_pmu` must be loaded.
39+
40+
Run the following command to confirm if the SPE kernel module is loaded:
3541

3642
```bash
3743
sudo modprobe arm_spe_pmu
3844
```
3945

46+
If the module is not loaded (blank output), SPE may still be available.
47+
48+
Run this command to check if SPE is included in the kernel:
49+
50+
```bash
51+
ls /sys/bus/event_source/devices/ | grep arm_spe
52+
```
53+
54+
If SPE is available, the output is:
55+
56+
```output
57+
arm_spe_0
58+
```
59+
60+
If the output is blank then SPE is not available.
61+
4062
## Run Sysreport
4163

42-
A handy python script is available to summarise your systems capabilities with regard to performance profiling. Install and run System Report python script (`sysreport`) using the [instructions in the learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/sysreport/).
64+
You can install and run a Python script named Sysreport to summarize your system's performance profiling capabilities.
65+
66+
Refer to [Get ready for performance analysis with Sysreport](https://learn.arm.com/learning-paths/servers-and-cloud-computing/sysreport/) to learn how to install and run it.
67+
68+
Look at the Sysreport output and confirm SPE is available by checking the `perf sampling` field.
4369

44-
To check SPE is available on your system look at the `perf sampling` field. It should read `SPE` highlighted in green.
70+
If the printed value is SPE then SPE is available.
4571

4672
```output
4773
...
@@ -57,29 +83,48 @@ Performance features:
5783
perf in userspace: disabled
5884
```
5985

60-
## Confirm Arm_SPE Availability
86+
## Confirm Arm SPE is available to Perf
6187

62-
Running the following command will confirm the availability of `arm_spe`.
88+
Run the following command to confirm SPE is available to Perf:
6389

64-
```output
90+
```bash
6591
sudo perf list "arm_spe*"
6692
```
6793

68-
You should observe the following.
94+
You should see the output below indicating the PMU event is available.
6995

7096
```output
7197
List of pre-defined events (to be used in -e or -M):
7298
7399
arm_spe_0// [Kernel PMU event]
74100
```
75101

76-
If `arm_spe` is not available on your configuration, the `perf c2c` workload without `SPE` will fail. For example you will observe the following.
102+
Assign capabilities to Perf by running:
103+
104+
```bash
105+
sudo setcap cap_perfmon,cap_sys_ptrace,cap_sys_admin+ep $(which perf)
106+
```
107+
108+
If `arm_spe` is not available because of your system configuration or if you don't have PMU permission, the `perf c2c` command will fail.
109+
110+
To confirm Perf can access SPE run:
111+
112+
```bash
113+
perf c2c record
114+
```
115+
116+
The output showing the failure is:
77117

78118
```output
79-
$ perf c2c record
80119
failed: memory events not supported
81120
```
82121

83122
{{% notice Note %}}
84-
If you are unable to use Arm SPE. It may be a restriction based on your cloud instance size or operating system. Generally, access to a full server (also known as metal instances) with a relatively new kernel is needed for Arm_SPE support. For more information, see the [perf-arm-spe manual page](https://man7.org/linux/man-pages/man1/perf-arm-spe.1.html)
123+
If you are unable to use SPE it may be a restriction based on your cloud instance size or operating system.
124+
125+
Generally, access to a full server (also known as metal instances) with a relatively new kernel is needed for Arm SPE support.
126+
127+
For more information about enabling SPE, see the [perf-arm-spe manual page](https://man7.org/linux/man-pages/man1/perf-arm-spe.1.html)
85128
{{% /notice %}}
129+
130+
Continue to learn how to use Perf C2C on an example application.

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-3.md

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,11 @@ weight: 4
66
layout: learningpathall
77
---
88

9-
## Example
9+
## Example code
1010

11-
Copy and paste the `C++/C` example below into a file named `false_sharing_example.cpp`. The code example below has been adapted from [Joe Mario](https://github.com/joemario/perf-c2c-usage-files) and is discussed thoroughly in the [Arm Statistical Profiling Extension Whitepaper](https://developer.arm.com/documentation/109429/latest/).
11+
Use a text editor to copy and paste the C example code below into a file named `false_sharing_example.c`
1212

13+
The code is adapted from [Joe Mario](https://github.com/joemario/perf-c2c-usage-files) and is discussed thoroughly in the Arm Statistical Profiling Extension Whitepaper.
1314

1415
```cpp
1516
/*
@@ -282,9 +283,13 @@ int main ( int argc, char *argv[] )
282283
}
283284
```
284285
285-
### Code Explanation
286+
### Code explanation
286287
287-
The key data structure that occupies the cache is the `struct Buf`. With our system using a 64-byte cache line, each line can hold 8, 8-byte `long` integers. If we do **not** pass in the `NO_FALSE_SHARING` macro during compilation our `Buf` data structure will contain the elements below. Where each structure neatly occupies the entire 64-byte cache line. However, the 4 readers and 2 locks are now on the same cache line.
288+
The key data structure that occupies the cache is `struct Buf`. With a 64-byte cache line size, each line can hold 8, 8-byte `long` integers.
289+
290+
If you do not pass in the `NO_FALSE_SHARING` macro during compilation the `Buf` data structure will contain the elements below. Each structure neatly occupies the entire 64-byte cache line.
291+
292+
However, the 4 readers and 2 locks are now accessing the same cache line.
288293
289294
```output
290295
typedef struct _buf {
@@ -299,7 +304,9 @@ typedef struct _buf {
299304
} buf __attribute__((aligned (64)));
300305
```
301306

302-
Alternatively if we pass in the `NO_FALSE_SHARING` macro during compilation, our `Buf` structure has a different shape. The `(5*8-byte)` padding pushes the reader variables onto a different cache line. However, notice that this is with the tradeoff that our new `Buf` structures occupies 1 and a half cache lines (12 `long`s). Therefore we have unused cache space of 25% per `Buf` structure.
307+
Alternatively if you pass in the `NO_FALSE_SHARING` macro during compilation, the `Buf` structure has a different shape.
308+
309+
The 40 bytes of padding pushes the reader variables onto a different cache line. However, notice that this is with the tradeoff the new `Buf` structures occupies multiple cache lines (12 long integers). Therefore it leaves unused cache space of 25% per `Buf` structure.
303310

304311
```output
305312
typedef struct _buf {
@@ -314,14 +321,14 @@ typedef struct _buf {
314321
} buf __attribute__((aligned (64)));
315322
```
316323

317-
Compile the example with the command below.
324+
Compile the example with the commands:
318325

319326
```bash
320327
gcc -lnuma -pthread false_sharing_example.c -o false_sharing
321328
gcc -lnuma -pthread false_sharing_example.c -DNO_FALSE_SHARING -o no_false_sharing
322329
```
323330

324-
Running both binaries with the command like argument of 1 will show the following, with both binaries successfully return a 0 exit status but the `false_sharing` binary runs almost 2x slower!
331+
Run both binaries with the command line argument of 1. Both binaries successfully return a 0 exit status but the binary with the false sharing runs almost 2x slower!
325332

326333
```bash
327334
time ./false_sharing 1
@@ -338,3 +345,5 @@ user 0m8.869s
338345
sys 0m0.000s
339346
```
340347

348+
Continue to the next section to learn how to use Perf C2C to analyze the example code.
349+

0 commit comments

Comments
 (0)