Skip to content

Commit 1f5bb56

Browse files
Merge pull request #2004 from madeline-underwood/perf-naming-changes
terminology fixes
2 parents 9c03b8c + 3646ade commit 1f5bb56

File tree

4 files changed

+13
-13
lines changed

4 files changed

+13
-13
lines changed

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Analyze cache behavior with perf c2c on Arm
2+
title: Analyze cache behavior with Perf C2C on Arm
33

44
minutes_to_complete: 15
55

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-1.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ layout: learningpathall
99
## What is the Arm Statistical Profiling Extension (SPE), and what does it do?
1010

1111
{{% notice Learning goal%}}
12-
In this section, you’ll learn how to use SPE to gain low-level insight into how your applications interact with the CPU. You’ll explore how to detect and resolve false sharing. By combining cache line alignment techniques with `perf c2c`, you can identify inefficient memory access patterns and significantly boost CPU performance on Arm-based systems.
12+
In this section, you’ll learn how to use SPE to gain low-level insight into how your applications interact with the CPU. You’ll explore how to detect and resolve false sharing. By combining cache line alignment techniques with Perf C2C, you can identify inefficient memory access patterns and significantly boost CPU performance on Arm-based systems.
1313
{{% /notice %}}
1414

1515
Arm’s Statistical Profiling Extension (SPE) gives you a powerful way to understand what’s really happening inside your applications at the microarchitecture level.
@@ -27,11 +27,11 @@ SPE integrates sampling directly into the CPU pipeline, triggering on individual
2727

2828
This enables fine-grained, precise cache analysis.
2929

30-
SPE helps developers optimize user-space applications by showing where cache latency or memory access delays are happening. Importantly, cache statistics are enabled with the Linux `perf` cache-to-cache (C2C) utility.
30+
SPE helps developers optimize user-space applications by showing where cache latency or memory access delays are happening. Importantly, cache statistics are enabled with the Linux Perf Cache-to-Cache (C2C) utility.
3131

3232
For more information, see the [*Arm Statistical Profiling Extension: Performance Analysis Methodology White Paper*](https://developer.arm.com/documentation/109429/latest/).
3333

34-
In this Learning Path, you will use SPE and `perf c2c` to diagnose a cache issue for an application running on a Neoverse server.
34+
In this Learning Path, you will use SPE and Perf C2C to diagnose a cache issue for an application running on a Neoverse server.
3535

3636
## What is false sharing and why should I care about it?
3737

@@ -47,7 +47,7 @@ The diagram below, taken from the Arm SPE white paper, provides a visual represe
4747

4848
## Why false sharing is hard to spot and fix
4949

50-
False sharing often hides behind seemingly ordinary writes, making it tricky to catch without tooling. The best time to eliminate it is early, while reading or refactoring code, by padding or realigning variables before compilation. But in large, highly concurrent C++ codebases, memory is frequently accessed through multiple layers of abstraction. Threads may interact with shared data indirectly, causing subtle cache line overlaps that don’t become obvious until performance profiling reveals unexpected coherence misses. Tools like `perf c2c` can help uncover these issues by tracing cache-to-cache transfers and identifying hot memory locations affected by false sharing.
50+
False sharing often hides behind seemingly ordinary writes, making it tricky to catch without tooling. The best time to eliminate it is early, while reading or refactoring code, by padding or realigning variables before compilation. But in large, highly concurrent C++ codebases, memory is frequently accessed through multiple layers of abstraction. Threads may interact with shared data indirectly, causing subtle cache line overlaps that don’t become obvious until performance profiling reveals unexpected coherence misses. Tools like Perf C2C can help uncover these issues by tracing cache-to-cache transfers and identifying hot memory locations affected by false sharing.
5151

5252
From a source-code perspective nothing is “shared,” but at the hardware level both variables are implicitly coupled by their physical location.
5353

@@ -101,7 +101,7 @@ int main() {
101101

102102
The output below shows that the variables e, f, g and h occur at least 64 bytes apart in the byte-addressable architecture. Whereas variables a, b, c, and d occur 8 bytes apart, occupying the same cache line.
103103

104-
Although this is a simplified example, in a production workload there might be several layers of indirection that unintentionally result in false sharing. For these complex cases, use `perf c2c` to trace cache line interactions and pinpoint the root cause of performance issues.
104+
Although this is a simplified example, in a production workload there might be several layers of indirection that unintentionally result in false sharing. For these complex cases, use Perf C2C to trace cache line interactions and pinpoint the root cause of performance issues.
105105

106106
```output
107107
Without Alignment can occupy same cache line
@@ -125,6 +125,6 @@ Address of AlignedType h - 0xffffeb6c6080
125125

126126
In this section, you explored what Arm SPE is and why it offers a deeper, more accurate view of application performance. You also examined how a subtle issue like false sharing can impact multithreaded code, and how to mitigate it using data alignment techniques in C++.
127127

128-
Next, you'll set up your environment and use `perf c2c` to capture and analyze real-world cache behavior on an Arm Neoverse system.
128+
Next, you'll set up your environment and use Perf C2C to capture and analyze real-world cache behavior on an Arm Neoverse system.
129129

130130

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-2.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Set up your environment for Arm SPE and perf c2c profiling
2+
title: Set up your environment for Arm SPE and Perf C2C profiling
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
@@ -8,7 +8,7 @@ layout: learningpathall
88
## Select a system with SPE support
99

1010
{{% notice Learning goal%}}
11-
Before you can start profiling cache behavior with Arm SPE and `perf c2c`, your system needs to meet a few requirements. In this section, you’ll learn how to check whether your hardware and kernel support Arm SPE, install the necessary tools, and validate that Linux perf can access the correct performance monitoring events. By the end, your environment will be ready to record and analyze memory access patterns using `perf c2c` on an Arm Neoverse system.
11+
Before you can start profiling cache behavior with Arm SPE and Perf C2C, your system needs to meet a few requirements. In this section, you’ll learn how to check whether your hardware and kernel support Arm SPE, install the necessary tools, and validate that Linux Perf can access the correct performance monitoring events. By the end, your environment will be ready to record and analyze memory access patterns using `perf c2c` on an Arm Neoverse system.
1212
{{% /notice %}}
1313

1414
SPE requires support from both your hardware and the operating system. Many cloud instances running Linux do not enable SPE-based profiling.
@@ -38,7 +38,7 @@ sudo dnf update -y
3838
sudo dnf install perf git gcc cmake numactl-devel -y
3939
```
4040

41-
Linux perf is a userspace process and SPE is a hardware feature. The Linux kernel must be compiled with SPE support or the kernel module named `arm_spe_pmu` must be loaded.
41+
Linux Perf is a userspace process and SPE is a hardware feature. The Linux kernel must be compiled with SPE support or the kernel module named `arm_spe_pmu` must be loaded.
4242

4343
Run the following command to confirm if the SPE kernel module is loaded:
4444

@@ -86,7 +86,7 @@ Performance features:
8686
perf in userspace: disabled
8787
```
8888

89-
## Confirm Arm SPE is available to perf
89+
## Confirm Arm SPE is available to Perf
9090

9191
Run the following command to confirm SPE is available to `perf`:
9292

@@ -132,4 +132,4 @@ For more information about enabling SPE, see the [perf-arm-spe manual page](http
132132

133133
## Summary
134134

135-
You've confirmed that your system supports Arm SPE, installed the necessary tools, and verified that `perf` can access SPE events. You're now ready to start collecting detailed performance data using `perf c2c`. In the next section, you’ll run a real application and use `perf c2c` to capture cache sharing behavior and uncover memory performance issues.
135+
You've confirmed that your system supports Arm SPE, installed the necessary tools, and verified that Perf C2C can access SPE events. You're now ready to start collecting detailed performance data using Perf C2C. In the next section, you’ll run a real application and use Perf C2C to capture cache sharing behavior and uncover memory performance issues.

content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-3.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ layout: learningpathall
99
## Example code
1010

1111
{{% notice Learning Goal%}}
12-
The example code in this section demonstrates how false sharing affects performance by comparing two multithreaded programs; one with cache-aligned data structures, and one without. You’ll compile and run both versions, observe the runtime difference, and learn how memory layout affects cache behavior. This sets the stage for analyzing performance with `perf c2c` in the next section.
12+
The example code in this section demonstrates how false sharing affects performance by comparing two multithreaded programs; one with cache-aligned data structures, and one without. You’ll compile and run both versions, observe the runtime difference, and learn how memory layout affects cache behavior. This sets the stage for analyzing performance with Perf C2C in the next section.
1313
{{% /notice %}}
1414

1515
Use a text editor to copy and paste the C example code below into a file named `false_sharing_example.c`

0 commit comments

Comments
 (0)