Merge pull request #1618 from kieranhejmadi01/arm-cpp-mem-model

jasonrandrews · web-flow · commit e6ce6af5c330 · 2025-03-01T08:54:53.000-06:00
Cpp-Memory-Model-on-Arm-Learning-Path
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/1.md b/content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/1.md
@@ -0,0 +1,39 @@
+---
+title: Introduction to Memory Models
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## What is a memory model?
+
+A language’s memory model defines how operations on shared data interleave at runtime, providing rules on what reorderings are allowed by compilers and hardware. In C++, the memory model specifies how threads interact with shared variables, ensuring consistent behavior across different compilers and architectures.  A developer can think of memory ordering in 4 broad categories.
+
+-  **Source Code Order** The exact sequence in which you write statements. This is the most intuitive view because it directly reflects how code appears to you.
+
+```output
+int x = 5; // A
+int z = x * 5 // B
+int y = 42 // C 
+```
+
+- **Program Order**: The logical sequence recognized by the compiler, which may rearrange or optimize instructions under certain constraints for an output binary (i.e. program) that takes fewer cycles. Although the statements may appear in a particular order in your source, the compiler could restructure them if it deems it safe. For example the pseudo assembly below has reordered the source line instructions above. 
+
+```output
+LDR R1 #5 // A
+LDR R2 #42 // C
+MULT R3, #R1, #5 // B
+```
+
+- **Execution Order**: How instructions are actually issued and executed by the hardware. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an ARM-based system, you might see instructions issued in different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.
+
+- **Hardware Perceived Order**: The perspective observed by other devices or the rest of the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and AArch64 - this should be considered when porting applications. An abstract diagram from the academic paper is shown below [Maranget et. al, 2012]. A write operation in one of the 5 threads in the pentagon below may propagate to the other threads in any order. 
+
+![abstract_model](./Abstract_model.png)
+
+## High-level difference between Arm Memory Model and x86 Memory Model
+
+The memory models of ARM and x86 architectures differ in terms of ordering guarantees and required synchronization. x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.
+
+In contrast, ARM’s memory model is more relaxed, allowing greater reordering of memory operations to optimize performance and energy efficiency. This relaxed model provides less intuitive ordering guarantees, meaning that loads and stores may be observed out of order by other processors. This means that source code needs to correctly follow the language standard to ensure reliable behaviour. 
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/2.md b/content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/2.md
@@ -0,0 +1,70 @@
+---
+title: C++ Memory Model and Atomics
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## C++ Memory Model for Single Threads
+
+
+For a long time, writing C++ programs on single-core systems was relatively straightforward. The compiler could reorder instructions however it wished, so long as the program’s observable behavior remained unchanged. This optimization freedom is commonly referred to as the “as-if” rule. Essentially, compilers can optimize away or move instructions around as if the code had not changed, provided they do not affect inputs, outputs, or volatile accesses.
+
+That single-threaded world was simpler: you wrote code, the compiler made it faster (by reordering or eliding instructions if safe), and everyone benefited. But then multi-core processors and multi-threaded applications became the norm. Suddenly, reordering instructions was not merely about performance—it could change the meaning of programs with threads reading and writing shared data simultaneously.
+
+### Expanding Memory Model for Multiple Threads
+
+When multi threading gained traction, compilers and CPUs need more precise rules about what reordering is allowed. This is where the formalized C++ memory model, introduced in C++11, steps in. Prior to C++11, concurrency in C++ was partially specified and relied on platform-specific behavior. Now, the language standard includes well-defined semantics ensuring that developers writing concurrent code can rely on a set of guaranteed rules.
+
+Under the new model, if a piece of data is shared between threads without proper synchronization, you can no longer assume it behaves like single-threaded code. Instead, operations on this shared data may be reordered unless you explicitly prevent it using atomic operations or other synchronization primitives such as mutexes. To ensure correctness, C++ provides an array of memory orders (such as `std::memory_order_relaxed`, `std::memory_order_acquire`, `std::memory_order_release`, etc.) that govern how loads and stores can be observed in a multi-threaded environment. Details can be found on the C++ reference manual. 
+
+## C++ Atomic Memory Ordering
+
+In C++, `std::memory_order` atomic operations allow developers to specify how memory accesses, including regular, non-atomic memory accesses are ordered among atomic operation. Choosing the right memory order is crucial for balancing performance and correctness. Assume we have 2 atomic integers with initial values of 0:
+
+```cpp
+std::atomic<int> x{0};
+std::atomic<int> y{0};
+```
+
+Below are a few of C++’s atomic memory orders, along with a short code snippet illustrating what might or might not be reordered.
+
+- `memory_order_relaxed`
+
+Relaxed operations do not impose ordering constraints beyond atomicity. They can be freely reordered with respect to other operations. This provides maximum performance but can lead to visibility issues if used incorrectly.
+
+```cpp
+// Thread A:
+r1 = y.load(std::memory_order_relaxed); // A
+x.store(r1, std::memory_order_relaxed); // B
+
+// Thread B:
+r2 = x.load(std::memory_order_relaxed); // C 
+y.store(42, std::memory_order_relaxed); // D
+// These two stores could appear in any order relative to each other.
+```
+
+In the pseudo code snippet above, it's possible for operation B to precede operation C, or the mirror possibility of D executing before A. 
+
+- `memory_order_acquire` and `memory_order_release`
+
+Acquire and release are used to synchronise atomic variables.  In the example below, thread A writes to memory (allocating the string and setting data) and then uses a release-store to publish these updates. Thread B repeatedly performs an acquire-load until it sees the updated pointer. The acquire ensures that once Thread B sees a non-null pointer, all writes made by Thread A (including the update to data) become visible, synchronizing the two threads.
+
+```cpp
+// Thread A 
+p = new "Hello"; 
+data = 42; 
+atomic_store(ptr, p, memory_order_release); // Release: publish writes (p, data)
+
+// Thread B 
+while (atomic_load(ptr, memory_order_acquire) is null) { } // Acquire: wait until p is available
+// Now, *p == "Hello" and data == 42 (synchronized with Thread A)
+
+```
+
+Sequential consistency, `memory_order_seq_cst` is the strongest order and the default ordering if nothing is specified. There are several other memory ordering possibilities, for information on all possible memory ordering possibilities in the C++11 standard and their nuances, please refer to the [C++ reference](https://en.cppreference.com/w/cpp/atomic/memory_order).
+
+
+
+
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/3.md b/content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/3.md
@@ -0,0 +1,194 @@
+---
+title: Example of Race Condition 
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Example of a Race Condition when porting from x86 to AArch64
+
+Due to the differences in the hardware perceived ordering as explained in the earlier sections, source code written for x86 may behave differently when ported to Arm. To demonstrate this we will create a trivial example and run it both on an x86 and Arm cloud instance. 
+
+Start an Arm-based cloud instance, in this example I am using `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS. If you are new to using cloud-based virtual machines, please see our [getting started guide](https://learn.arm.com/learning-paths/servers-and-cloud-computing/intro/). 
+
+First confirm you are using a Arm-based instance with the following command.
+
+```bash
+uname -m
+```
+You should see the following output.
+
+```output
+aarch64
+```
+
+Next, we will install the prerequisitve packages. 
+
+```bash
+sudo apt update
+sudo apt install g++ clang
+```
+
+Copy and paste the following code snippet into a file named `relaxed_memory_model.cpp`. 
+
+```cpp
+#include <iostream>
+#include <atomic>
+#include <thread>
+#include <cassert>
+#include <chrono>
+
+struct Node {
+    int x;
+};
+std::atomic<Node*> node{nullptr};
+
+void threadA() {
+    auto n = new Node();
+    n->x = 42;
+    node.store(n, std::memory_order_relaxed);
+}
+
+void threadB() {
+    Node* n = nullptr;
+    while ((n = node.load(std::memory_order_relaxed)) == nullptr) {
+        std::this_thread::sleep_for(std::chrono::nanoseconds(50)); // Small sleep to improve scheduling
+    }
+    if (n->x != 42) {
+        std::cerr << "Race condition detected: n->x = " << n->x << std::endl;
+        std::terminate();
+    }
+}
+
+void runTest() {
+    for (int i = 0; i < 100000; ++i) { // Run many iterations but eventually time out
+        node.store(nullptr, std::memory_order_relaxed);
+        std::thread t1(threadA);
+        std::thread t2(threadB);
+        std::thread t3(threadA);
+        std::thread t4(threadA);
+        t1.join();
+        t2.join();
+        t3.join();
+        t4.join();
+        delete node.load();
+    }
+}
+
+int main() {
+    runTest();
+    std::cout << "No Race Condition Occurred in this run" << std::endl;
+    return 0;
+}
+```
+
+The code snippet above is a trivial example of a data race condition. Thread A creates a node variable and assigns it the number 42. On the otherhand, thread B checks than the variable assigned to the Node is equal to 42. Both functions use the `memory_order_relaxed` model, which allows the possibility for thread B to read an unintialised variable before it has been assigned the value 42 in thread A. 
+
+```bash
+g++ relaxed_memory_ordering.cpp -o relaxed_memory_ordering -O3
+```
+
+```output
+./relaxed_memory_ordering 
+...
+~ 5-30 second wait
+...
+    Race condition detected: n->x = 42
+    terminate called without an active exception
+    Aborted (core dumped)
+```
+
+It is worth noting that this is only a probability of a race condition.  Our contrived example is designed to trigger frequently. Unfortunately, in production workloads there may be a more subtle probability that may surface in production or under specific workloads. This is the reason race conditions are difficult to spot.
+
+### Behaviour on x86 instance
+
+Due to the more strong memory model associated with x86 processors, programs that do not adhere to the C++ standard may give programmers a false sense of security. To demonstrate this I connected to an AWS `t2.2xlarge` instance that uses the x86 architecture. 
+
+Running the following command I can observe the underlying hardware is a Intel Xeon E5-2686 Processor
+
+```bash
+lscpu | grep -i "Model"
+```
+
+```output
+Model name:                           Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
+Model:                                79
+```
+Follow the instructions above and recompiling leads to no race conditions on this x86-based machine.  
+
+```output
+./relaxed_memory_ordering 
+No race condition occurred in this run
+```
+
+
+## Using correct memory ordering of Atomics
+
+As the example above shows, not adhering to the C++ standard can lead to a false sensitivity when running on x86 platforms. To fix the race condition when porting we need to use the correct memory ordering for each thread. The following snippet of C++ updates `threadA` to use the `memory_order_release`, `threadB` to use `memory_order_acquire` and the `runTest` fuction to use `memory_order_release` on the Node object. 
+
+Save the adjusted code snippet below into a file named `correct_memory_ordering.cpp`.
+
+```cpp
+#include <iostream>
+#include <atomic>
+#include <thread>
+#include <cassert>
+#include <chrono>
+
+struct Node {
+    int x;
+};
+std::atomic<Node*> node{nullptr};
+
+void threadA() {
+    auto n = new Node();
+    n->x = 42;
+    node.store(n, std::memory_order_release);
+}
+
+void threadB() {
+    Node* n = nullptr;
+    while ((n = node.load(std::memory_order_acquire)) == nullptr) {
+        std::this_thread::sleep_for(std::chrono::nanoseconds(50)); // Small sleep to improve scheduling
+    }
+    if (n->x != 42) {
+        std::cerr << "Race condition detected: n->x = " << n->x << std::endl;
+        std::terminate();
+    }
+}
+
+void runTest() {
+    for (int i = 0; i < 100000; ++i) { // Run many iterations but eventually time out
+        node.store(nullptr, std::memory_order_release);
+        std::thread t1(threadA);
+        std::thread t2(threadB);
+        std::thread t3(threadA);
+        std::thread t4(threadA);
+        t1.join();
+        t2.join();
+        t3.join();
+        t4.join();
+        delete node.load();
+    }
+}
+
+int main() {
+    runTest();
+    std::cout << "No Race Condition Occurred in this run" << std::endl;
+    return 0;
+}
+
+```
+
+Compiling with the following command and run on an Aarch64 based machine. 
+
+```bash
+g++ correct_memory_ordering.cpp -o correct_memory_ordering -O3
+```
+
+```output
+./correct_memory_ordering 
+No Race Condition Occurred in this run
+```
+
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/4.md b/content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/4.md
@@ -0,0 +1,39 @@
+---
+title: Detecting Race Conditions 
+weight: 5
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## How to detect infrequent race conditions?
+
+Threadsantizer, commonly referred to as `TSan` is a concurrency bug detection tool that identifies data races in multi-threaded programs. By instrumenting code at compile time, TSan dynamically tracks memory operations, monitoring lock usage and detecting inconsistencies in thread synchronization. When it finds a potential data race, it reports detailed information to aid debugging. TSan’s overhead can be significant, but it provides valuable insights into concurrency issues often missed by static analysis.
+
+TSan is available through both recent `clang` and `gcc` compilers. Using the `clang++` compiler in this example, compiling the correct_memory_ordering example with the following command and running the output binary. 
+
+```bash
+clang++ relaxed_memory_ordering.cpp -fsanitize=thread -fPIE -pie -g
+```
+
+
+
+```output
+==================
+WARNING: ThreadSanitizer: data race (pid=2892958)
+  Read of size 4 at 0xfffff42007b0 by thread T2:
+   ...
+   ...
+   ...
+SUMMARY: ThreadSanitizer: data race /home/ubuntu/src/relaxed_memory_ordering.cpp:23:12 in threadB()
+==================
+
+```
+
+The summary output highlights a potential data race in the `threadB` function corresponding to the source code expression `n->x != 42`. 
+
+## Limitations of TSan
+
+Thread Sanitizer (TSan) is powerful for detecting data races but has notable drawbacks. First, it only identifies concurrency issues at runtime, meaning any problematic code that isn’t exercised during testing goes unnoticed. Additionally, if race conditions exist in third-party binaries or libraries, TSan can’t instrument or fix them without access to their source code. Another major limitation is performance overhead: TSan can slow programs by 2 to 20x and requires extra memory, making it challenging for large-scale or real-time systems. 
+
+For further information please refer to the [Google documentation](https://github.com/google/sanitizers/wiki/threadsanitizercppmanual).
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/Abstract_model.png b/content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/Abstract_model.png
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/_index.md b/content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/_index.md
@@ -0,0 +1,36 @@
+---
+title: Learn about the C++ Memory Model when Porting to Arm
+
+minutes_to_complete: 45
+
+who_is_this_for: Intermediate C++ developers who are looking to port and optimise their application from x86 to AArch64.
+
+learning_objectives: 
+    - Learn about the C++ memory model
+    - Learn about the differences between the Arm and x86 memory model
+    - Learn best practices for writing C++ on Arm to avoid race conditions
+
+prerequisites:
+    - Access to an x86 and AArch64 cloud instance
+    - Intermediate understanding of C++
+
+author_primary: Kieran Hejmadi
+
+### Tags
+skilllevels: Introductory
+subjects: Performance and Architecture
+armips:
+    - Neoverse
+tools_software_languages:
+    - C++
+    - ThreadSantizer (TSan)
+operatingsystems:
+    - Linux
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/_next-steps.md