final review before PR

Your Name · Your Name · commit 238930f75c59 · 2025-05-13T13:53:51.000+01:00
diff --git a/content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-1.md
@@ -8,11 +8,11 @@ layout: learningpathall
 
 ## Introduction to Arm Statistical Profiling Extension (SPE)
 
-Traditional performance tracing relies on counting whole instructions, capturing only architectural instructions without revealing the actual memory addresses, pipeline latencies, or micro-operations in flight; moreover, the “skid” phenomenon where events are falsely attributed to later instructions can mislead developers. 
+Standard performance tracing relies on counting whole instructions, capturing only architectural instructions without revealing the actual memory addresses, pipeline latencies, or considering micro-operations in flight. Moreover, the “skid” phenomenon where events are falsely attributed to later instructions can mislead developers. 
 
-The Arm Statistical Profiling Extension (SPE) integrates sampling directly into the CPU pipeline, triggering on individual micro-operations rather than retired instructions, thereby eliminating skid and blind spots. Each SPE sample record includes relevant meta data, such as data addresses, per-µop pipeline latency, triggered PMU event masks, and the memory hierarchy source (L1 cache, last-level cache, or remote socket), enabling fine-grained and precise cache analysis. 
+The Arm Statistical Profiling Extension (SPE) integrates sampling directly into the CPU pipeline, triggering on individual micro-operations rather than retired instructions, thereby eliminating skid and blind spots. Each SPE sample record includes relevant meta data, such as data addresses, per-µop pipeline latency, triggered PMU event masks, and the memory hierarchy source, enabling fine-grained and precise cache analysis. 
 
-This enables software developers to tune user-space software for characteristics such as memory latency, and cache statistics. Importantly, it is the mechanism on Arm to enable cache statistics with the Linux `perf` cache-to-cache utility, referred to as `perf c2c`. Please refer to the [Arm_SPE whitepaper](https://developer.arm.com/documentation/109429/latest/) for more details. 
+This enables software developers to tune user-space software for characteristics such as memory latency and cache accesses. Importantly, it is the mechanism on Arm to enable cache statistics with the Linux `perf` cache-to-cache utility, referred to as `perf c2c`. Please refer to the [Arm_SPE whitepaper](https://developer.arm.com/documentation/109429/latest/) for more details. 
 
 In this learning path we will use the `arm_spe` and `perf c2c` to diagnose a cache issue for an application running on a Neoverse server.
 
@@ -22,15 +22,15 @@ Even when two threads touch entirely separate variables, modern processors move
 
 ![false_sharing_diagram](./false_sharing_diagram.png)
 
-Because false sharing hides behind ordinary writes, the easiest time to eliminate it is while reading or refactoring the source code—padding or realigning the offending variables before compilation. In large, highly concurrent code-bases, however, data structures are often accessed through several layers of abstraction, and many threads touch memory via indirection, so the subtle cache-line overlap may not surface until profiling or performance counters reveal unexpected coherence misses.
+Because false sharing hides behind ordinary writes, the easiest time to eliminate it is while reading or refactoring the source code padding or realigning the offending variables before compilation. In large, highly concurrent code-bases, however, data structures are often accessed through several layers of abstraction, and many threads touch memory via indirection, so the subtle cache-line overlap may not surface until profiling or performance counters reveal unexpected coherence misses.
 
 From a source-code perspective nothing is “shared,” but at the hardware level both variables are implicitly coupled by their physical colocation.
 
 ## Alignment to Cache Lines
 
 In C++11, we can manually specify the alignment of an object using the `alignas` function. For example, in the C++11 source code below, we manually align the the `struct` every 64 bytes (typical cache line size on a modern processor). This ensures that each instance of `AlignedType` is on a separate cache line. 
 
-```c++
+```cpp
 #include <atomic>
 #include <iostream>
 
diff --git a/content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-2.md
@@ -39,7 +39,7 @@ sudo modprobe arm_spe_pmu
 
 ## Run Sysreport
 
-A handy python script is available to summarise your systems capabilities with regard to performance profiling. Install and run System Report python script ('sysreport`) using the [instructions in the learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/sysreport/).
+A handy python script is available to summarise your systems capabilities with regard to performance profiling. Install and run System Report python script (`sysreport`) using the [instructions in the learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/sysreport/).
 
 To check SPE is available on your system look at the `perf sampling` field. It should read `SPE` highlighted in green.
 
diff --git a/content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-3.md
@@ -11,7 +11,6 @@ layout: learningpathall
 Copy and paste the `C++/C` example below into a file named `false_sharing_example.cpp`. The code example below has been adapted from [Joe Mario](https://github.com/joemario/perf-c2c-usage-files) and is discussed thoroughly in the [Arm Statistical Profiling Extension Whitepaper](https://developer.arm.com/documentation/109429/latest/).
 
 
-
 ```cpp
 /*
  * This is an example program to show false sharing between
@@ -285,7 +284,7 @@ int main ( int argc, char *argv[] )
 
 ### Code Explanation
 
-The key data structure that occupies the cache is the `struct Buf`. With our system using a 64-byte cache line, each line can hold 8, 8-byte `long` integers. If we do **not** pass in the `NO_FALSE_SHARING` macro during compilation our `Buf` data structure will have the elements below. Where each structure neatly occupies the entire 64-byte cache line. However, the 4 readers and 2 locks are now on the same cache line. 
+The key data structure that occupies the cache is the `struct Buf`. With our system using a 64-byte cache line, each line can hold 8, 8-byte `long` integers. If we do **not** pass in the `NO_FALSE_SHARING` macro during compilation our `Buf` data structure will contain the elements below. Where each structure neatly occupies the entire 64-byte cache line. However, the 4 readers and 2 locks are now on the same cache line. 
 
 ```output
 typedef struct _buf {
@@ -300,7 +299,7 @@ typedef struct _buf {
 } buf __attribute__((aligned (64)));
 ```
 
-Alternatively if we pass in the `NO_FALSE_SHARING` macro during compilation, our `Buf` structure has a different shape. The `(5*8-byte)` padding pushes the reader variables onto a different cache line. However, notice that this is with the tradeoff that our new `Buf` structures occupies 1 and a half cache lines (12 `long`s). Therefore we have unused cache space of ~25% per `Buf` structure.
+Alternatively if we pass in the `NO_FALSE_SHARING` macro during compilation, our `Buf` structure has a different shape. The `(5*8-byte)` padding pushes the reader variables onto a different cache line. However, notice that this is with the tradeoff that our new `Buf` structures occupies 1 and a half cache lines (12 `long`s). Therefore we have unused cache space of 25% per `Buf` structure.
 
 ```output
 typedef struct _buf {
@@ -322,7 +321,7 @@ gcc -lnuma -pthread false_sharing_example.c -o false_sharing
 gcc -lnuma -pthread false_sharing_example.c -DNO_FALSE_SHARING -o no_false_sharing
 ```
 
-Running both binaries with the command like argument of 1 will show the following, with both binaries successfully return a 0 exit status. 
+Running both binaries with the command like argument of 1 will show the following, with both binaries successfully return a 0 exit status but the `false_sharing` binary runs almost 2x slower!
 
 ```bash
 time ./false_sharing 1
diff --git a/content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-4.md b/content/learning-paths/servers-and-cloud-computing/false-sharing-arm-spe/how-to-4.md
@@ -54,7 +54,7 @@ Rerunning with the `no_false_sharing` shows the following.
          6.4942219 +- 0.0000428 seconds time elapsed  ( +-  0.00% )
 ```
 
-Manually comparing we observe the run time is significantly different (13.01s to 6.49s). Additionally, the instructions per cycle (IPC) is notably different, (0.74 and 1.70) and looks commensurate to the run time. 
+Manually comparing we observe the run time is significantly different (13.01s to 6.49s). Additionally, the instructions per cycle (IPC) is notably different, (0.74 and 1.70) and looks to be commensurate to run time. 
 
 ## Understanding the Root Cause
 
@@ -78,7 +78,7 @@ The output below clearly shows there are disproportionately more backend stall c
 
 ### Skid when using Perf Record
 
-The naive approach would be to record the events using the `perf record` subcommand. Running the following commands can be used to demonstrate skid, mentioned in the previous section.
+The naive approach would be to record the events using the `perf record` subcommand. Running the following commands can be used to demonstrate skid, mentioned in the "Introduction to Arm_SPE and False Sharing" section.
 
 ```bash
 # record using canonical counters
@@ -92,14 +92,14 @@ sudo perf c2c record -g ./false_sharing 1
 sudo perf annotate
 ```
  
-The left screenshot shows the canonical `perf record` command, here the `adrp` instruction falsely reports 52% of the time. However, using `perf c2c` that leverages `arm_spe`, we observe 99% of time associated with the `ldr`, load register command. 
+The left screenshot shows the canonical `perf record` command, here the `adrp` instruction falsely reports 52% of the time. However, using `perf c2c` that leverages `arm_spe`, we observe 99% of time associated with the `ldr`, load register command. The standard `perf record` data could be quite misleading for a developer!
 
 ![perf-record-annotate](./perf-record-error-skid.png)
 ![perf-c2c-record-annotate](./perf-c2c-record.png)
 
 ### Using Perf C2C
 
-Clearly `perf c2c` is more accurate. We are able to observe the instructure that is being used most frequently. Now let's find the specific variable so observe what in our source cause is causing this. 
+Clearly `perf c2c` is more accurate. We are able to observe the instructure that is being used most frequently. Now let's find the specific variable to observe what in our source cause is causing this. 
 
 Next, compile a debug version of both applications with the following command. 
 
@@ -108,7 +108,7 @@ gcc -g -lnuma -pthread false_sharing_example.c -o false_sharing_.debug
 gcc -g -lnuma -pthread false_sharing_example.c -DNO_FALSE_SHARING -o no_false_sharing.debug
 ```
 
-Next, we record our application with call stats using the `perf c2c` subcommand. 
+Next, we record our application with call stacks using the `perf c2c` subcommand with the `-g` flag. 
 
 ```bash
 sudo perf c2c record -g ./false_sharing.debug 1