You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-1.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ layout: learningpathall
9
9
## What is the Arm Statistical Profiling Extension (SPE), and what does it do?
10
10
11
11
{{% notice Learning goal%}}
12
-
In this section, you’ll learn how to use SPE to gain low-level insight into how your applications interact with the CPU. You’ll explore how to detect and resolve false sharing. By combining cache line alignment techniques with `perf c2c`, you can identify inefficient memory access patterns and significantly boost CPU performance on Arm-based systems.
12
+
In this section, you’ll learn how to use SPE to gain low-level insight into how your applications interact with the CPU. You’ll explore how to detect and resolve false sharing. By combining cache line alignment techniques with Perf C2C, you can identify inefficient memory access patterns and significantly boost CPU performance on Arm-based systems.
13
13
{{% /notice %}}
14
14
15
15
Arm’s Statistical Profiling Extension (SPE) gives you a powerful way to understand what’s really happening inside your applications at the microarchitecture level.
@@ -27,11 +27,11 @@ SPE integrates sampling directly into the CPU pipeline, triggering on individual
27
27
28
28
This enables fine-grained, precise cache analysis.
29
29
30
-
SPE helps developers optimize user-space applications by showing where cache latency or memory access delays are happening. Importantly, cache statistics are enabled with the Linux `perf` cache-to-cache (C2C) utility.
30
+
SPE helps developers optimize user-space applications by showing where cache latency or memory access delays are happening. Importantly, cache statistics are enabled with the Linux Perf Cache-to-Cache (C2C) utility.
31
31
32
32
For more information, see the [*Arm Statistical Profiling Extension: Performance Analysis Methodology White Paper*](https://developer.arm.com/documentation/109429/latest/).
33
33
34
-
In this Learning Path, you will use SPE and `perf c2c` to diagnose a cache issue for an application running on a Neoverse server.
34
+
In this Learning Path, you will use SPE and Perf C2C to diagnose a cache issue for an application running on a Neoverse server.
35
35
36
36
## What is false sharing and why should I care about it?
37
37
@@ -47,7 +47,7 @@ The diagram below, taken from the Arm SPE white paper, provides a visual represe
47
47
48
48
## Why false sharing is hard to spot and fix
49
49
50
-
False sharing often hides behind seemingly ordinary writes, making it tricky to catch without tooling. The best time to eliminate it is early, while reading or refactoring code, by padding or realigning variables before compilation. But in large, highly concurrent C++ codebases, memory is frequently accessed through multiple layers of abstraction. Threads may interact with shared data indirectly, causing subtle cache line overlaps that don’t become obvious until performance profiling reveals unexpected coherence misses. Tools like `perf c2c` can help uncover these issues by tracing cache-to-cache transfers and identifying hot memory locations affected by false sharing.
50
+
False sharing often hides behind seemingly ordinary writes, making it tricky to catch without tooling. The best time to eliminate it is early, while reading or refactoring code, by padding or realigning variables before compilation. But in large, highly concurrent C++ codebases, memory is frequently accessed through multiple layers of abstraction. Threads may interact with shared data indirectly, causing subtle cache line overlaps that don’t become obvious until performance profiling reveals unexpected coherence misses. Tools like Perf C2C can help uncover these issues by tracing cache-to-cache transfers and identifying hot memory locations affected by false sharing.
51
51
52
52
From a source-code perspective nothing is “shared,” but at the hardware level both variables are implicitly coupled by their physical location.
53
53
@@ -101,7 +101,7 @@ int main() {
101
101
102
102
The output below shows that the variables e, f, g and h occur at least 64 bytes apart in the byte-addressable architecture. Whereas variables a, b, c, and d occur 8 bytes apart, occupying the same cache line.
103
103
104
-
Although this is a simplified example, in a production workload there might be several layers of indirection that unintentionally result in false sharing. For these complex cases, use `perf c2c` to trace cache line interactions and pinpoint the root cause of performance issues.
104
+
Although this is a simplified example, in a production workload there might be several layers of indirection that unintentionally result in false sharing. For these complex cases, use Perf C2C to trace cache line interactions and pinpoint the root cause of performance issues.
105
105
106
106
```output
107
107
Without Alignment can occupy same cache line
@@ -125,6 +125,6 @@ Address of AlignedType h - 0xffffeb6c6080
125
125
126
126
In this section, you explored what Arm SPE is and why it offers a deeper, more accurate view of application performance. You also examined how a subtle issue like false sharing can impact multithreaded code, and how to mitigate it using data alignment techniques in C++.
127
127
128
-
Next, you'll set up your environment and use `perf c2c` to capture and analyze real-world cache behavior on an Arm Neoverse system.
128
+
Next, you'll set up your environment and use Perf C2C to capture and analyze real-world cache behavior on an Arm Neoverse system.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-2.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: Set up your environment for Arm SPE and perf c2c profiling
2
+
title: Set up your environment for Arm SPE and Perf C2C profiling
3
3
weight: 3
4
4
5
5
### FIXED, DO NOT MODIFY
@@ -8,7 +8,7 @@ layout: learningpathall
8
8
## Select a system with SPE support
9
9
10
10
{{% notice Learning goal%}}
11
-
Before you can start profiling cache behavior with Arm SPE and `perf c2c`, your system needs to meet a few requirements. In this section, you’ll learn how to check whether your hardware and kernel support Arm SPE, install the necessary tools, and validate that Linux perf can access the correct performance monitoring events. By the end, your environment will be ready to record and analyze memory access patterns using `perf c2c` on an Arm Neoverse system.
11
+
Before you can start profiling cache behavior with Arm SPE and Perf C2C, your system needs to meet a few requirements. In this section, you’ll learn how to check whether your hardware and kernel support Arm SPE, install the necessary tools, and validate that Linux Perf can access the correct performance monitoring events. By the end, your environment will be ready to record and analyze memory access patterns using `perf c2c` on an Arm Neoverse system.
12
12
{{% /notice %}}
13
13
14
14
SPE requires support from both your hardware and the operating system. Many cloud instances running Linux do not enable SPE-based profiling.
Linux perf is a userspace process and SPE is a hardware feature. The Linux kernel must be compiled with SPE support or the kernel module named `arm_spe_pmu` must be loaded.
41
+
Linux Perf is a userspace process and SPE is a hardware feature. The Linux kernel must be compiled with SPE support or the kernel module named `arm_spe_pmu` must be loaded.
42
42
43
43
Run the following command to confirm if the SPE kernel module is loaded:
44
44
@@ -86,7 +86,7 @@ Performance features:
86
86
perf in userspace: disabled
87
87
```
88
88
89
-
## Confirm Arm SPE is available to perf
89
+
## Confirm Arm SPE is available to Perf
90
90
91
91
Run the following command to confirm SPE is available to `perf`:
92
92
@@ -132,4 +132,4 @@ For more information about enabling SPE, see the [perf-arm-spe manual page](http
132
132
133
133
## Summary
134
134
135
-
You've confirmed that your system supports Arm SPE, installed the necessary tools, and verified that `perf`can access SPE events. You're now ready to start collecting detailed performance data using `perf c2c`. In the next section, you’ll run a real application and use `perf c2c` to capture cache sharing behavior and uncover memory performance issues.
135
+
You've confirmed that your system supports Arm SPE, installed the necessary tools, and verified that Perf C2C can access SPE events. You're now ready to start collecting detailed performance data using Perf C2C. In the next section, you’ll run a real application and use Perf C2C to capture cache sharing behavior and uncover memory performance issues.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-3.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ layout: learningpathall
9
9
## Example code
10
10
11
11
{{% notice Learning Goal%}}
12
-
The example code in this section demonstrates how false sharing affects performance by comparing two multithreaded programs; one with cache-aligned data structures, and one without. You’ll compile and run both versions, observe the runtime difference, and learn how memory layout affects cache behavior. This sets the stage for analyzing performance with `perf c2c` in the next section.
12
+
The example code in this section demonstrates how false sharing affects performance by comparing two multithreaded programs; one with cache-aligned data structures, and one without. You’ll compile and run both versions, observe the runtime difference, and learn how memory layout affects cache behavior. This sets the stage for analyzing performance with Perf C2C in the next section.
13
13
{{% /notice %}}
14
14
15
15
Use a text editor to copy and paste the C example code below into a file named `false_sharing_example.c`
0 commit comments