ArmDeveloperEcosystem
diff --git a/‎content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/1.md‎
Lines changed: 9 additions & 9 deletions b/‎content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/1.md‎
Lines changed: 9 additions & 9 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/2.md‎
Lines changed: 9 additions & 11 deletions b/‎content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/2.md‎
Lines changed: 9 additions & 11 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/3.md‎
Lines changed: 43 additions & 25 deletions b/‎content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/3.md‎
Lines changed: 43 additions & 25 deletions
@@ -1,5 +1,5 @@
 ---
-title: Introduction to Memory Models
+title: Introduction to C++ memory models
 weight: 2
 
 ### FIXED, DO NOT MODIFY
@@ -8,32 +8,32 @@ layout: learningpathall
 
 ## What is a memory model?
 
-A language’s memory model defines how operations on shared data interleave at runtime, providing rules on what reorderings are allowed by compilers and hardware. In C++, the memory model specifies how threads interact with shared variables, ensuring consistent behavior across different compilers and architectures.  A developer can think of memory ordering in 4 broad categories.
+A language’s memory model defines how operations on shared data interleave at runtime, providing rules on what reorderings are allowed by compilers and hardware. In C++, the memory model specifies how threads interact with shared variables, ensuring consistent behavior across different compilers and architectures. You can think of memory ordering in 4 broad categories.
 
--  **Source Code Order** The exact sequence in which you write statements. This is the most intuitive view because it directly reflects how code appears to you.
+-  **Source Code Order**: The exact sequence in which you write statements. This is the most intuitive view because it directly reflects how code appears to you.
 
 ```output
 int x = 5; // A
 int z = x * 5 // B
 int y = 42 // C 
 ```
 
-- **Program Order**: The logical sequence recognized by the compiler, which may rearrange or optimize instructions under certain constraints for an output binary (i.e. program) that takes fewer cycles. Although the statements may appear in a particular order in your source, the compiler could restructure them if it deems it safe. For example the pseudo assembly below has reordered the source line instructions above. 
+- **Program Order**: The logical sequence recognized by the compiler, which may rearrange or optimize instructions under certain constraints to create a program that takes fewer cycles. Although the statements may appear in a particular order in your source code, the compiler could restructure them if it deems it safe. For example, the pseudo assembly below reorders the source line instructions above. 
 
 ```output
 LDR R1 #5 // A
 LDR R2 #42 // C
 MULT R3, #R1, #5 // B
 ```
 
-- **Execution Order**: How instructions are actually issued and executed by the hardware. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an ARM-based system, you might see instructions issued in different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.
+- **Execution Order**: How instructions are actually issued and executed by the hardware. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an Arm-based system, you might see instructions issued in different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.
 
-- **Hardware Perceived Order**: The perspective observed by other devices or the rest of the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and AArch64 - this should be considered when porting applications. An abstract diagram from the academic paper is shown below [Maranget et. al, 2012]. A write operation in one of the 5 threads in the pentagon below may propagate to the other threads in any order. 
+- **Hardware Perceived Order**: This is the perspective observed by other devices in the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and Arm, and this should be considered when porting applications. An abstract diagram from the academic paper is shown below [Maranget et. al, 2012]. A write operation in one of the 5 threads in the pentagon below may propagate to the other threads in any order. 
 
 ![abstract_model](./Abstract_model.png)
 
-## High-level difference between Arm Memory Model and x86 Memory Model
+## High-level differences between the Arm memory model and the x86 memory model
 
-The memory models of ARM and x86 architectures differ in terms of ordering guarantees and required synchronization. x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.
+The memory models of Arm and x86 architectures differ in terms of ordering guarantees and required synchronizations. x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.
 
-In contrast, ARM’s memory model is more relaxed, allowing greater reordering of memory operations to optimize performance and energy efficiency. This relaxed model provides less intuitive ordering guarantees, meaning that loads and stores may be observed out of order by other processors. This means that source code needs to correctly follow the language standard to ensure reliable behaviour. 
+In contrast, Arm’s memory model is more relaxed, allowing greater reordering of memory operations to optimize performance and energy efficiency. This relaxed model provides less intuitive ordering guarantees, meaning that loads and stores may be observed out of order by other processors. This means that source code needs to correctly follow the language standard to ensure reliable behavior. 
@@ -1,25 +1,24 @@
 ---
-title: C++ Memory Model and Atomics
+title: The C++ memory model and atomics
 weight: 3
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## C++ Memory Model for Single Threads
-
+## The C++ memory model for single threads
 
 For a long time, writing C++ programs on single-core systems was relatively straightforward. The compiler could reorder instructions however it wished, so long as the program’s observable behavior remained unchanged. This optimization freedom is commonly referred to as the “as-if” rule. Essentially, compilers can optimize away or move instructions around as if the code had not changed, provided they do not affect inputs, outputs, or volatile accesses.
 
-That single-threaded world was simpler: you wrote code, the compiler made it faster (by reordering or eliding instructions if safe), and everyone benefited. But then multi-core processors and multi-threaded applications became the norm. Suddenly, reordering instructions was not merely about performance—it could change the meaning of programs with threads reading and writing shared data simultaneously.
+The single-threaded world was simpler: you wrote code, the compiler made it faster (by safely reordering or eliminating instructions), and performance benefited. Over time, multi-core processors and multi-threaded applications became the norm. Suddenly, reordering instructions was not only about performance because it could change the meaning of programs with threads reading and writing shared data simultaneously.
 
-### Expanding Memory Model for Multiple Threads
+### Expanding the memory model for multiple threads
 
-When multi threading gained traction, compilers and CPUs need more precise rules about what reordering is allowed. This is where the formalized C++ memory model, introduced in C++11, steps in. Prior to C++11, concurrency in C++ was partially specified and relied on platform-specific behavior. Now, the language standard includes well-defined semantics ensuring that developers writing concurrent code can rely on a set of guaranteed rules.
+When multi threading programming gained traction, compilers and CPUs needed precise rules about what reordering is allowed. This is where the formalized C++ memory model, introduced in C++11, steps in. Prior to C++11, concurrency in C++ was partially specified and relied on platform-specific behavior. Now, the language standard includes well-defined semantics ensuring that concurrent code can rely on a set of guaranteed rules.
 
-Under the new model, if a piece of data is shared between threads without proper synchronization, you can no longer assume it behaves like single-threaded code. Instead, operations on this shared data may be reordered unless you explicitly prevent it using atomic operations or other synchronization primitives such as mutexes. To ensure correctness, C++ provides an array of memory orders (such as `std::memory_order_relaxed`, `std::memory_order_acquire`, `std::memory_order_release`, etc.) that govern how loads and stores can be observed in a multi-threaded environment. Details can be found on the C++ reference manual. 
+Under the new model, if a piece of data is shared between threads without proper synchronization, you can no longer assume it behaves like single-threaded code. Instead, operations on this shared data may be reordered unless you explicitly prevent it using atomic operations or other synchronization primitives such as mutexes. To ensure correctness, C++ provides an array of memory ordering options (such as `std::memory_order_relaxed`, `std::memory_order_acquire`, and `std::memory_order_release`) that govern how loads and stores can be observed in a multi-threaded environment. Details can be found on the C++ reference manual. 
 
-## C++ Atomic Memory Ordering
+## C++ atomic memory ordering
 
 In C++, `std::memory_order` atomic operations allow developers to specify how memory accesses, including regular, non-atomic memory accesses are ordered among atomic operation. Choosing the right memory order is crucial for balancing performance and correctness. Assume we have 2 atomic integers with initial values of 0:
 
@@ -63,8 +62,7 @@ while (atomic_load(ptr, memory_order_acquire) is null) { } // Acquire: wait unti
 
 ```
 
-Sequential consistency, `memory_order_seq_cst` is the strongest order and the default ordering if nothing is specified. There are several other memory ordering possibilities, for information on all possible memory ordering possibilities in the C++11 standard and their nuances, please refer to the [C++ reference](https://en.cppreference.com/w/cpp/atomic/memory_order).
-
-
+Sequential consistency, `memory_order_seq_cst` is the strongest order and the default ordering if nothing is specified. 
 
+There are several other memory ordering possibilities. For information on all possible memory ordering possibilities in the C++11 standard and their nuances, please refer to the [C++ reference](https://en.cppreference.com/w/cpp/atomic/memory_order).
 
@@ -1,16 +1,18 @@
 ---
-title: Example of Race Condition 
+title: Race condition example 
 weight: 4
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Example of a Race Condition when porting from x86 to AArch64
+## Example of a race condition when porting from x86 to Arm
 
 Due to the differences in the hardware perceived ordering as explained in the earlier sections, source code written for x86 may behave differently when ported to Arm. To demonstrate this we will create a trivial example and run it both on an x86 and Arm cloud instance. 
 
-Start an Arm-based cloud instance, in this example I am using `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS. If you are new to using cloud-based virtual machines, please see our [getting started guide](https://learn.arm.com/learning-paths/servers-and-cloud-computing/intro/). 
+Start an Arm-based cloud instance. This example uses a `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS, but other instances types are possible. 
+
+If you are new to cloud-based virtual machines, refer to [Get started with Servers and Cloud Computing](/learning-paths/servers-and-cloud-computing/intro/). 
 
 First confirm you are using a Arm-based instance with the following command.
 
@@ -23,14 +25,14 @@ You should see the following output.
 aarch64
 ```
 
-Next, we will install the prerequisitve packages. 
+Next, install the required software packages. 
 
 ```bash
 sudo apt update
-sudo apt install g++ clang
+sudo apt install g++ clang -y
 ```
 
-Copy and paste the following code snippet into a file named `relaxed_memory_model.cpp`. 
+Use a text editor to copy and paste the following code snippet into a file named `relaxed_memory_ordering.cpp`. 
 
 ```cpp
 #include <iostream>
@@ -83,27 +85,33 @@ int main() {
 }
 ```
 
-The code snippet above is a trivial example of a data race condition. Thread A creates a node variable and assigns it the number 42. On the otherhand, thread B checks than the variable assigned to the Node is equal to 42. Both functions use the `memory_order_relaxed` model, which allows the possibility for thread B to read an unintialised variable before it has been assigned the value 42 in thread A. 
+The code above is a trivial example of a data race condition. Thread A creates a node variable and assigns it the number 42. Thread B checks that the variable assigned to the Node is equal to 42. Both functions use the `memory_order_relaxed` model, which allows the possibility for thread B to read an uninitialized variable before it has been assigned the value 42 in thread A. 
+
+Compile the program using the GNU compiler:
 
 ```bash
 g++ relaxed_memory_ordering.cpp -o relaxed_memory_ordering -O3
 ```
 
-```output
+Run the program and wait about 5-30 seconds for the output:
+
+```bash
 ./relaxed_memory_ordering 
-...
-~ 5-30 second wait
-...
-    Race condition detected: n->x = 42
-    terminate called without an active exception
-    Aborted (core dumped)
 ```
 
-It is worth noting that this is only a probability of a race condition.  Our contrived example is designed to trigger frequently. Unfortunately, in production workloads there may be a more subtle probability that may surface in production or under specific workloads. This is the reason race conditions are difficult to spot.
+The output is:
 
-### Behaviour on x86 instance
+```output
+Race condition detected: n->x = 42
+terminate called without an active exception
+Aborted (core dumped)
+```
+
+It is worth noting that this is only a probability of a race condition.  Our contrived example is designed to trigger frequently. Unfortunately, in production workloads there may be a more subtle probability that may surface under specific workloads. This is the reason race conditions are difficult to spot.
+
+### Behavior on an x86 instance
 
-Due to the more strong memory model associated with x86 processors, programs that do not adhere to the C++ standard may give programmers a false sense of security. To demonstrate this I connected to an AWS `t2.2xlarge` instance that uses the x86 architecture. 
+Due to the more strong memory model associated with x86 processors, programs that do not adhere to the C++ standard may give programmers a false sense of security. To demonstrate this, create an connect to an AWS `t2.2xlarge` instance that uses the x86 architecture. 
 
 Running the following command I can observe the underlying hardware is a Intel Xeon E5-2686 Processor
 
@@ -115,19 +123,27 @@ lscpu | grep -i "Model"
 Model name:                           Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
 Model:                                79
 ```
-Follow the instructions above and recompiling leads to no race conditions on this x86-based machine.  
 
-```output
+Follow the same instructions to compile and run the application. 
+
+```bash
+g++ relaxed_memory_ordering.cpp -o relaxed_memory_ordering -O3
 ./relaxed_memory_ordering 
-No race condition occurred in this run
 ```
 
+Observe there is no race conditions on the x86-based machine.  
 
-## Using correct memory ordering of Atomics
+The output is:
 
-As the example above shows, not adhering to the C++ standard can lead to a false sensitivity when running on x86 platforms. To fix the race condition when porting we need to use the correct memory ordering for each thread. The following snippet of C++ updates `threadA` to use the `memory_order_release`, `threadB` to use `memory_order_acquire` and the `runTest` fuction to use `memory_order_release` on the Node object. 
+```output
+No race condition occurred in this run
+```
 
-Save the adjusted code snippet below into a file named `correct_memory_ordering.cpp`.
+## Using correct memory ordering of atomics
+
+As the example above shows, not adhering to the C++ standard can lead to a false sensitivity when running on x86 platforms. To fix the race condition when porting you need to use the correct memory ordering for each thread. The code below updates `threadA` to use the `memory_order_release`, `threadB` to use `memory_order_acquire` and the `runTest` function to use `memory_order_release` on the Node object. 
+
+Use an editor to copy and past the adjusted code below into a file named `correct_memory_ordering.cpp`.
 
 ```cpp
 #include <iostream>
@@ -181,14 +197,16 @@ int main() {
 
 ```
 
-Compiling with the following command and run on an Aarch64 based machine. 
+Compile and run on the Arm-based machine:
 
 ```bash
 g++ correct_memory_ordering.cpp -o correct_memory_ordering -O3
+./correct_memory_ordering 
 ```
 
+Observe the race condition is gone and the output is:
+
 ```output
-./correct_memory_ordering 
 No Race Condition Occurred in this run
 ```