Skip to content

Commit 01919b7

Browse files
Merge pull request #1953 from kieranhejmadi01/false-sharing-spe
Analyse-cache-behaviour-with-perf-c2c-on-Arm
2 parents d0e4078 + 238930f commit 01919b7

File tree

10 files changed

+713
-0
lines changed

10 files changed

+713
-0
lines changed
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
---
2+
title: Analyse cache behaviour with Perf C2C on Arm
3+
4+
minutes_to_complete: 15
5+
6+
who_is_this_for: Cloud developers who are looking to debug and optimise cache access patterns on cloud servers with perf c2c.
7+
8+
learning_objectives:
9+
- Learn basic C++ techniques to avoid false sharing with alignas()
10+
- Learn how to enable and use Arm_SPE
11+
- Learn how to investigate cache line performance with perf c2c
12+
13+
prerequisites:
14+
- Arm-based cloud instance with Arm Statistical Profiling Extension support
15+
- basic understanding on cache hierarchy and how efficient cache accessing impact performance.
16+
- Familiarity with the Linux Perf tool
17+
18+
author: Kieran Hejmadi
19+
20+
### Tags
21+
skilllevels: Introductory
22+
subjects: Performance
23+
armips:
24+
- Neoverse
25+
tools_software_languages:
26+
- Perf
27+
operatingsystems:
28+
- Linux
29+
30+
31+
further_reading:
32+
- resource:
33+
title: Arm Statistical Profiling Extension Whitepaper
34+
link: https://developer.arm.com/documentation/109429/latest/
35+
type: documentation
36+
- resource:
37+
title: Arm Topdown Methodology
38+
link: https://developer.arm.com/documentation/109542/0100/Arm-Topdown-methodology
39+
type: documentation
40+
41+
### FIXED, DO NOT MODIFY
42+
# ================================================================================
43+
weight: 1 # _index.md always has weight of 1 to order correctly
44+
layout: "learningpathall" # All files under learning paths have this same wrapper
45+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
46+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
118 KB
Loading
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
---
2+
title: Introduction to Arm_SPE and False Sharing
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Introduction to Arm Statistical Profiling Extension (SPE)
10+
11+
Standard performance tracing relies on counting whole instructions, capturing only architectural instructions without revealing the actual memory addresses, pipeline latencies, or considering micro-operations in flight. Moreover, the “skid” phenomenon where events are falsely attributed to later instructions can mislead developers.
12+
13+
The Arm Statistical Profiling Extension (SPE) integrates sampling directly into the CPU pipeline, triggering on individual micro-operations rather than retired instructions, thereby eliminating skid and blind spots. Each SPE sample record includes relevant meta data, such as data addresses, per-µop pipeline latency, triggered PMU event masks, and the memory hierarchy source, enabling fine-grained and precise cache analysis.
14+
15+
This enables software developers to tune user-space software for characteristics such as memory latency and cache accesses. Importantly, it is the mechanism on Arm to enable cache statistics with the Linux `perf` cache-to-cache utility, referred to as `perf c2c`. Please refer to the [Arm_SPE whitepaper](https://developer.arm.com/documentation/109429/latest/) for more details.
16+
17+
In this learning path we will use the `arm_spe` and `perf c2c` to diagnose a cache issue for an application running on a Neoverse server.
18+
19+
## False Sharing within the Cache
20+
21+
Even when two threads touch entirely separate variables, modern processors move data in fixed-size cache lines (nominally 64-bytes). If those distinct variables happen to occupy bytes within the same line, every time one thread writes its variable the core’s cache must gain exclusive ownership of the whole line, forcing the other core’s copy to be invalidated. The second thread, still working on its own variable, then triggers a coherence miss to fetch the line back, and the ping-pong repeats. Please see the illustration below, taken from the [Arm_SPE whitepaper](https://developer.arm.com/documentation/109429/latest/), for a visual explanation.
22+
23+
![false_sharing_diagram](./false_sharing_diagram.png)
24+
25+
Because false sharing hides behind ordinary writes, the easiest time to eliminate it is while reading or refactoring the source code padding or realigning the offending variables before compilation. In large, highly concurrent code-bases, however, data structures are often accessed through several layers of abstraction, and many threads touch memory via indirection, so the subtle cache-line overlap may not surface until profiling or performance counters reveal unexpected coherence misses.
26+
27+
From a source-code perspective nothing is “shared,” but at the hardware level both variables are implicitly coupled by their physical colocation.
28+
29+
## Alignment to Cache Lines
30+
31+
In C++11, we can manually specify the alignment of an object using the `alignas` function. For example, in the C++11 source code below, we manually align the the `struct` every 64 bytes (typical cache line size on a modern processor). This ensures that each instance of `AlignedType` is on a separate cache line.
32+
33+
```cpp
34+
#include <atomic>
35+
#include <iostream>
36+
37+
struct alignas(64) AlignedType {
38+
AlignedType() { val = 0; }
39+
std::atomic<int> val;
40+
};
41+
42+
43+
int main() {
44+
// If we create four atomic integers like this, there's a high probability
45+
// they'll wind up next to each other in memory
46+
std::atomic<int> a;
47+
std::atomic<int> b;
48+
std::atomic<int> c;
49+
std::atomic<int> d;
50+
51+
std::cout << "\n\nWithout Alignment can occupy same cache line\n\n";
52+
// Print out the addresses
53+
std::cout << "Address of atomic<int> a - " << &a << '\n';
54+
std::cout << "Address of atomic<int> b - " << &b << '\n';
55+
std::cout << "Address of atomic<int> c - " << &c << '\n';
56+
std::cout << "Address of atomic<int> d - " << &d << '\n';
57+
58+
AlignedType e{};
59+
AlignedType f{};
60+
AlignedType g{};
61+
AlignedType h{};
62+
63+
std::cout << "\n\nMin 1 cache-line* spacing between variables";
64+
std::cout << "\n*64 bytes = minimum 0x40 address increments\n\n";
65+
66+
std::cout << "Address of AlignedType e - " << &e << '\n';
67+
std::cout << "Address of AlignedType f - " << &f << '\n';
68+
std::cout << "Address of AlignedType g - " << &g << '\n';
69+
std::cout << "Address of AlignedType h - " << &h << '\n';
70+
71+
return 0;
72+
}
73+
```
74+
75+
Example output below shows the variables e, f, g and h occur at least 64-bytes addreses apart in our byte-addressable architecture. Whereas variables a, b, c and d occur 8 bytes apart (i.e. occupy the same cache line).
76+
77+
Although this is a contrived example, in a production workload there may be several layers of indirection that unintentionally result in false sharing. For these complex cases, to understand the root cause we will use `perf c2c`.
78+
79+
```output
80+
Without Alignment can occupy same cache line
81+
82+
Address of atomic<int> a - 0xffffeb6c61b8
83+
Address of atomic<int> b - 0xffffeb6c61b0
84+
Address of atomic<int> c - 0xffffeb6c61a8
85+
Address of atomic<int> d - 0xffffeb6c61a0
86+
87+
88+
Min 1 cache-line* spacing between variables
89+
*64 bytes = minimum 0x40 address increments
90+
91+
Address of AlignedType e - 0xffffeb6c6140
92+
Address of AlignedType f - 0xffffeb6c6100
93+
Address of AlignedType g - 0xffffeb6c60c0
94+
Address of AlignedType h - 0xffffeb6c6080
95+
```
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
title: Setup
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Setup
10+
11+
For this tutorial, I will use a `c6g.metal` instances running Amazon linux 2023 (AL23). Since `SPE` requires support both in hardware and the operating system, instances running specific distributions or kernels may not allow SPE-based profiling.
12+
13+
We can check the underlying Neoverse IP and operating system kernel version with the following commands.
14+
15+
```bash
16+
lscpu | grep -i "model name"
17+
uname -r
18+
```
19+
20+
Here we observe
21+
22+
```ouput
23+
Model name: Neoverse-N1
24+
6.1.134-150.224.amzn2023.aarch64
25+
```
26+
27+
Next install the prerequisite packages with the following command.
28+
29+
```bash
30+
sudo dnf update -y
31+
sudo dnf install perf git gcc cmake numactl-devel -y
32+
```
33+
34+
Since the `linux` perf utility is a userspace process and SPE is a hardware feature in silicon, we use a built-in kernel module `arm_spe_pmu` to interact. Run the following command.
35+
36+
```bash
37+
sudo modprobe arm_spe_pmu
38+
```
39+
40+
## Run Sysreport
41+
42+
A handy python script is available to summarise your systems capabilities with regard to performance profiling. Install and run System Report python script (`sysreport`) using the [instructions in the learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/sysreport/).
43+
44+
To check SPE is available on your system look at the `perf sampling` field. It should read `SPE` highlighted in green.
45+
46+
```output
47+
...
48+
Performance features:
49+
perf tools: True
50+
perf installed at: /usr/bin/perf
51+
perf with OpenCSD: False
52+
perf counters: 6
53+
perf sampling: SPE
54+
perf HW trace: None
55+
perf paranoid: -1
56+
kptr_restrict: 0
57+
perf in userspace: disabled
58+
```
59+
60+
## Confirm Arm_SPE Availability
61+
62+
Running the following command will confirm the availability of `arm_spe`.
63+
64+
```output
65+
sudo perf list "arm_spe*"
66+
```
67+
68+
You should observe the following.
69+
70+
```output
71+
List of pre-defined events (to be used in -e or -M):
72+
73+
arm_spe_0// [Kernel PMU event]
74+
```
75+
76+
If `arm_spe` is not available on your configuration, the `perf c2c` workload without `SPE` will fail. For example you will observe the following.
77+
78+
```output
79+
$ perf c2c record
80+
failed: memory events not supported
81+
```
82+
83+
{{% notice Note %}}
84+
If you are unable to use Arm SPE. It may be a restriction based on your cloud instance size or operating system. Generally, access to a full server (also known as metal instances) with a relatively new kernel is needed for Arm_SPE support. For more information, see the [perf-arm-spe manual page](https://man7.org/linux/man-pages/man1/perf-arm-spe.1.html)
85+
{{% /notice %}}

0 commit comments

Comments
 (0)