Skip to content

Commit b51bac0

Browse files
Merge pull request #1994 from madeline-underwood/cache
false_sharing_JA to review
2 parents 432e265 + 4696bb2 commit b51bac0

File tree

5 files changed

+100
-55
lines changed

5 files changed

+100
-55
lines changed

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/_index.md

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,12 @@
11
---
2-
title: Analyze cache behavior with Perf C2C on Arm
3-
4-
draft: true
5-
cascade:
6-
draft: true
2+
title: Analyze cache behavior with perf c2c on Arm
73

84
minutes_to_complete: 15
95

10-
who_is_this_for: This topic is for developers who want to optimize cache access patterns on Arm servers using Perf C2C.
6+
who_is_this_for: This topic is for performance-oriented developers working on Arm-based cloud or server systems who want to optimize memory access patterns and investigate cache inefficiencies using Perf C2C and Arm SPE.
117

128
learning_objectives:
13-
- Avoid false sharing in C++ using memory alignment.
9+
- Identify and fix false sharing issues using Perf C2C, a cache line analysis tool.
1410
- Enable and use the Arm Statistical Profiling Extension (SPE) on Linux systems.
1511
- Investigate cache line performance with Perf C2C.
1612

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-1.md

Lines changed: 47 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,61 @@
11
---
2-
title: Introduction to Arm SPE and false sharing
2+
title: Arm Statistical Profiling Extension and false sharing
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Introduction to the Arm Statistical Profiling Extension (SPE)
9+
## What is the Arm Statistical Profiling Extension (SPE), and what does it do?
1010

11-
Standard performance tracing relies on counting completed instructions, capturing only architectural instructions without revealing the actual memory addresses, pipeline latencies, or considering micro-operations in flight. Moreover, the “skid” phenomenon where events are falsely attributed to later instructions can mislead developers.
11+
{{% notice Learning goal%}}
12+
In this section, you’ll learn how to use SPE to gain low-level insight into how your applications interact with the CPU. You’ll explore how to detect and resolve false sharing. By combining cache line alignment techniques with `perf c2c`, you can identify inefficient memory access patterns and significantly boost CPU performance on Arm-based systems.
13+
{{% /notice %}}
1214

13-
SPE integrates sampling directly into the CPU pipeline, triggering on individual micro-operations rather than retired instructions, thereby eliminating skid and blind spots. Each SPE sample record includes relevant metadata, such as data addresses, per-µop pipeline latency, triggered PMU event masks, and the memory hierarchy source, enabling fine-grained and precise cache analysis.
15+
Arm’s Statistical Profiling Extension (SPE) gives you a powerful way to understand what’s really happening inside your applications at the microarchitecture level.
1416

15-
This enables software developers to tune user-space software for characteristics such as memory latency and cache accesses. Importantly, cache statistics are enabled with the Linux Perf cache-to-cache (C2C) utility.
17+
Introduced in Armv8.2, SPE captures a statistical view of how instructions move through the CPU, which allows you to dig into issues like memory access latency, cache misses, and pipeline behavior.
1618

17-
Please refer to the [Arm SPE white paper](https://developer.arm.com/documentation/109429/latest/) for more details.
19+
Most Linux profiling tools focus on retired instruction counts, which means they miss key details like memory addresses, cache latency, and micro-operation behavior. This can lead to misleading results, especially due to a phenomenon called “skid,” where events are falsely attributed to later instructions.
1820

19-
In this Learning Path, you will use SPE and Perf C2C to diagnose a cache issue for an application running on a Neoverse server.
21+
SPE integrates sampling directly into the CPU pipeline, triggering on individual micro-operations instead of retired instructions. This approach eliminates skid and blind spots. Each SPE sample record includes relevant metadata, such as:
2022

21-
## False sharing within the cache
23+
* Data addresses
24+
* Per-µop pipeline latency
25+
* Triggered PMU event masks
26+
* Memory hierarchy source
2227

23-
Even when two threads touch entirely separate variables, modern processors move data in fixed-size cache lines (nominally 64-bytes). If those distinct variables happen to occupy bytes within the same line, every time one thread writes its variable the core’s cache must gain exclusive ownership of the whole line, forcing the other core’s copy to be invalidated. The second thread, still working on its own variable, then triggers a coherence miss to fetch the line back, and the ping-pong pattern repeats. Please see the illustration below, taken from the Arm SPE white paper, for a visual explanation.
28+
This enables fine-grained, precise cache analysis.
2429

25-
![false_sharing_diagram](./false_sharing_diagram.png)
30+
SPE helps developers optimize user-space applications by showing where cache latency or memory access delays are happening. Importantly, cache statistics are enabled with the Linux `perf` cache-to-cache (C2C) utility.
2631

27-
Because false sharing hides behind ordinary writes, the easiest time to eliminate it is while reading or refactoring the source code by padding or realigning the offending variables before compilation. In large, highly concurrent codebases, however, data structures are often accessed through several layers of abstraction, and many threads touch memory via indirection, so the subtle cache-line overlap may not surface until profiling or performance counters reveal unexpected coherence misses.
32+
For more information, see the [*Arm Statistical Profiling Extension: Performance Analysis Methodology White Paper*](https://developer.arm.com/documentation/109429/latest/).
33+
34+
In this Learning Path, you will use SPE and `perf c2c` to diagnose a cache issue for an application running on a Neoverse server.
35+
36+
## What is false sharing and why should I care about it?
37+
38+
In large-scale, multithreaded applications, false sharing can degrade performance by introducing hundreds of unnecessary cache line invalidations per second - often with no visible red flags in the source code.
39+
40+
Even when two threads touch entirely separate variables, modern processors move data in fixed-size cache lines, which is typically 64 bytes. If those distinct variables happen to occupy bytes within the same line, every time one thread writes its variable the core’s cache must gain exclusive ownership of the whole line, forcing the other core’s copy to be invalidated.
41+
42+
The second thread, still working on its own variable, then triggers a coherence miss to fetch the line back, and the ping-pong pattern repeats.
43+
44+
The diagram below, taken from the Arm SPE white paper, provides a visual representation of two threads on separate cores alternately gaining exclusive access to the same cache line.
45+
46+
![false_sharing_diagram alt-text#center](./false_sharing_diagram.png "Two threads on separate cores alternately gain exclusive access to the same cache line.")
47+
48+
## Why false sharing is hard to spot and fix
49+
50+
False sharing often hides behind seemingly ordinary writes, making it tricky to catch without tooling. The best time to eliminate it is early, while reading or refactoring code, by padding or realigning variables before compilation. But in large, highly concurrent C++ codebases, memory is frequently accessed through multiple layers of abstraction. Threads may interact with shared data indirectly, causing subtle cache line overlaps that don’t become obvious until performance profiling reveals unexpected coherence misses. Tools like `perf c2c` can help uncover these issues by tracing cache-to-cache transfers and identifying hot memory locations affected by false sharing.
2851

2952
From a source-code perspective nothing is “shared,” but at the hardware level both variables are implicitly coupled by their physical location.
3053

3154
## Alignment to cache lines
3255

33-
In C++11, you can manually specify the alignment of an object with the `alignas` specifier. For example, the C++11 source code below manually aligns the the `struct` every 64 bytes (typical cache line size on a modern processor). This ensures that each instance of `AlignedType` is on a separate cache line.
56+
In C++11, you can manually specify the alignment of an object with the `alignas` specifier.
57+
58+
For example, the C++11 source code below manually aligns the `struct` every 64 bytes (typical cache line size on a modern processor). This ensures that each instance of `AlignedType` is on a separate cache line.
3459

3560
```cpp
3661
#include <atomic>
@@ -43,7 +68,7 @@ struct alignas(64) AlignedType {
4368

4469

4570
int main() {
46-
// If we create four atomic integers like this, there's a high probability
71+
// If you create four atomic integers like this, there's a high probability
4772
// they'll wind up next to each other in memory
4873
std::atomic<int> a;
4974
std::atomic<int> b;
@@ -74,9 +99,9 @@ int main() {
7499
}
75100
```
76101

77-
The example output below shows the variables e, f, g and h occur at least 64-bytes apart in the byte-addressable architecture. Whereas variables a, b, c and d occur 8 bytes apart, occupying the same cache line.
102+
The output below shows that the variables e, f, g and h occur at least 64 bytes apart in the byte-addressable architecture. Whereas variables a, b, c, and d occur 8 bytes apart, occupying the same cache line.
78103

79-
Although this is a contrived example, in a production workload there may be several layers of indirection that unintentionally result in false sharing. For these complex cases, to understand the root cause you will use Perf C2C.
104+
Although this is a simplified example, in a production workload there might be several layers of indirection that unintentionally result in false sharing. For these complex cases, use `perf c2c` to trace cache line interactions and pinpoint the root cause of performance issues.
80105

81106
```output
82107
Without Alignment can occupy same cache line
@@ -96,4 +121,10 @@ Address of AlignedType g - 0xffffeb6c60c0
96121
Address of AlignedType h - 0xffffeb6c6080
97122
```
98123

99-
Continue to the next section to learn how to set up a system to run Perf C2C.
124+
## Summary
125+
126+
In this section, you explored what Arm SPE is and why it offers a deeper, more accurate view of application performance. You also examined how a subtle issue like false sharing can impact multithreaded code, and how to mitigate it using data alignment techniques in C++.
127+
128+
Next, you'll set up your environment and use `perf c2c` to capture and analyze real-world cache behavior on an Arm Neoverse system.
129+
130+

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-2.md

Lines changed: 24 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,23 @@
11
---
2-
title: Configure your environment for Arm SPE profiling
2+
title: Set up your environment for Arm SPE and perf c2c profiling
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8-
98
## Select a system with SPE support
109

11-
SPE requires both hardware and operating system support. Many cloud instances running Linux do not enable SPE-based profiling.
10+
{{% notice Learning goal%}}
11+
Before you can start profiling cache behavior with Arm SPE and `perf c2c`, your system needs to meet a few requirements. In this section, you’ll learn how to check whether your hardware and kernel support Arm SPE, install the necessary tools, and validate that Linux perf can access the correct performance monitoring events. By the end, your environment will be ready to record and analyze memory access patterns using `perf c2c` on an Arm Neoverse system.
12+
{{% /notice %}}
13+
14+
SPE requires support from both your hardware and the operating system. Many cloud instances running Linux do not enable SPE-based profiling.
1215

1316
You need to identify a system that supports SPE using the information below.
1417

1518
If you are looking for an AWS system, you can use a `c6g.metal` instance running Amazon Linux 2023 (AL2023).
1619

17-
Check the underlying Neoverse processor and operating system kernel version with the following commands.
20+
Check the underlying Neoverse processor and operating system kernel version with the following commands:
1821

1922
```bash
2023
lscpu | grep -i "model name"
@@ -23,7 +26,7 @@ uname -r
2326

2427
The output includes the CPU type and kernel release version:
2528

26-
```ouput
29+
```output
2730
Model name: Neoverse-N1
2831
6.1.134-152.225.amzn2023.aarch64
2932
```
@@ -35,23 +38,23 @@ sudo dnf update -y
3538
sudo dnf install perf git gcc cmake numactl-devel -y
3639
```
3740

38-
Linux Perf is a userspace process and SPE is a hardware feature. The Linux kernel must be compiled with SPE support or the kernel module named `arm_spe_pmu` must be loaded.
41+
Linux perf is a userspace process and SPE is a hardware feature. The Linux kernel must be compiled with SPE support or the kernel module named `arm_spe_pmu` must be loaded.
3942

4043
Run the following command to confirm if the SPE kernel module is loaded:
4144

4245
```bash
4346
sudo modprobe arm_spe_pmu
4447
```
4548

46-
If the module is not loaded (blank output), SPE may still be available.
49+
If the module is not loaded (and there is blank output), SPE might still be available.
4750

4851
Run this command to check if SPE is included in the kernel:
4952

5053
```bash
5154
ls /sys/bus/event_source/devices/ | grep arm_spe
5255
```
5356

54-
If SPE is available, the output is:
57+
If SPE is available, the output you will see is:
5558

5659
```output
5760
arm_spe_0
@@ -63,11 +66,11 @@ If the output is blank then SPE is not available.
6366

6467
You can install and run a Python script named Sysreport to summarize your system's performance profiling capabilities.
6568

66-
Refer to [Get ready for performance analysis with Sysreport](https://learn.arm.com/learning-paths/servers-and-cloud-computing/sysreport/) to learn how to install and run it.
69+
See the Learning Path [Get ready for performance analysis with Sysreport](https://learn.arm.com/learning-paths/servers-and-cloud-computing/sysreport/) to learn how to install and run it.
6770

6871
Look at the Sysreport output and confirm SPE is available by checking the `perf sampling` field.
6972

70-
If the printed value is SPE then SPE is available.
73+
If the printed value is SPE, then SPE is available.
7174

7275
```output
7376
...
@@ -83,9 +86,9 @@ Performance features:
8386
perf in userspace: disabled
8487
```
8588

86-
## Confirm Arm SPE is available to Perf
89+
## Confirm Arm SPE is available to perf
8790

88-
Run the following command to confirm SPE is available to Perf:
91+
Run the following command to confirm SPE is available to `perf`:
8992

9093
```bash
9194
sudo perf list "arm_spe*"
@@ -99,32 +102,34 @@ List of pre-defined events (to be used in -e or -M):
99102
arm_spe_0// [Kernel PMU event]
100103
```
101104

102-
Assign capabilities to Perf by running:
105+
Assign capabilities to `perf` by running:
103106

104107
```bash
105108
sudo setcap cap_perfmon,cap_sys_ptrace,cap_sys_admin+ep $(which perf)
106109
```
107110

108-
If `arm_spe` is not available because of your system configuration or if you don't have PMU permission, the `perf c2c` command will fail.
111+
If `arm_spe` isn’t available due to your system configuration or limited PMU access, the `perf c2c` command will fail.
109112

110-
To confirm Perf can access SPE run:
113+
To confirm `perf` can access SPE, run:
111114

112115
```bash
113116
perf c2c record
114117
```
115118

116-
The output showing the failure is:
119+
If SPE access is blocked, you’ll see output like this:
117120

118121
```output
119122
failed: memory events not supported
120123
```
121124

122125
{{% notice Note %}}
123-
If you are unable to use SPE it may be a restriction based on your cloud instance size or operating system.
126+
If you are unable to use SPE it might be a restriction based on your cloud instance size or operating system.
124127

125-
Generally, access to a full server (also known as metal instances) with a relatively new kernel is needed for Arm SPE support.
128+
Generally, access to a full server (also known as metal instances) with a relatively new kernel is required for Arm SPE support.
126129

127130
For more information about enabling SPE, see the [perf-arm-spe manual page](https://man7.org/linux/man-pages/man1/perf-arm-spe.1.html)
128131
{{% /notice %}}
129132

130-
Continue to learn how to use Perf C2C on an example application.
133+
## Summary
134+
135+
You've confirmed that your system supports Arm SPE, installed the necessary tools, and verified that `perf` can access SPE events. You're now ready to start collecting detailed performance data using `perf c2c`. In the next section, you’ll run a real application and use `perf c2c` to capture cache sharing behavior and uncover memory performance issues.

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-3.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: False Sharing Example
2+
title: False sharing example
33
weight: 4
44

55
### FIXED, DO NOT MODIFY
@@ -8,6 +8,10 @@ layout: learningpathall
88

99
## Example code
1010

11+
{{% notice Learning Goal%}}
12+
The example code in this section demonstrates how false sharing affects performance by comparing two multithreaded programs; one with cache-aligned data structures, and one without. You’ll compile and run both versions, observe the runtime difference, and learn how memory layout affects cache behavior. This sets the stage for analyzing performance with `perf c2c` in the next section.
13+
{{% /notice %}}
14+
1115
Use a text editor to copy and paste the C example code below into a file named `false_sharing_example.c`
1216

1317
The code is adapted from [Joe Mario](https://github.com/joemario/perf-c2c-usage-files) and is discussed thoroughly in the Arm Statistical Profiling Extension Whitepaper.
@@ -285,7 +289,7 @@ int main ( int argc, char *argv[] )
285289
286290
### Code explanation
287291
288-
The key data structure that occupies the cache is `struct Buf`. With a 64-byte cache line size, each line can hold 8, 8-byte `long` integers.
292+
The key data structure that occupies the cache is `struct _buf`. With a 64-byte cache line size, each line can hold 8, 8-byte `long` integers.
289293
290294
If you do not pass in the `NO_FALSE_SHARING` macro during compilation the `Buf` data structure will contain the elements below. Each structure neatly occupies the entire 64-byte cache line.
291295
@@ -306,7 +310,7 @@ typedef struct _buf {
306310

307311
Alternatively if you pass in the `NO_FALSE_SHARING` macro during compilation, the `Buf` structure has a different shape.
308312

309-
The 40 bytes of padding pushes the reader variables onto a different cache line. However, notice that this is with the tradeoff the new `Buf` structures occupies multiple cache lines (12 long integers). Therefore it leaves unused cache space of 25% per `Buf` structure.
313+
The 40 bytes of padding pushes the reader variables onto a different cache line. However, notice that this is with the tradeoff the new `Buf` structures occupies multiple cache lines (12 long integers). Therefore it leaves unused cache space of 25% per `Buf` structure. This trade-off uses more memory but eliminates false sharing, improving performance by reducing cache line contention.
310314

311315
```output
312316
typedef struct _buf {
@@ -345,5 +349,6 @@ user 0m8.869s
345349
sys 0m0.000s
346350
```
347351

348-
Continue to the next section to learn how to use Perf C2C to analyze the example code.
352+
## Summary
353+
In this section, you ran a hands-on C example to see how false sharing can significantly degrade performance in multithreaded applications. By comparing two versions of the same program, one with aligned memory access and one without, you saw how something as subtle as cache line layout can result in a 2x difference in runtime. This practical example sets the foundation for using Perf C2C to capture and analyze real cache line sharing behavior in the next section.
349354

0 commit comments

Comments
 (0)