You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/1.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,22 +18,22 @@ int z = x * 5 // B
18
18
int y = 42 // C
19
19
```
20
20
21
-
-**Program Order**: The logical sequence recognized by the compiler, which may rearrange or optimize instructions under certain constraints for an output binary (i.e. program) that takes fewer cycles. Although the statements may appear in a particular order in your source, the compiler could restructure them if it deems it safe. For example the pseudo assembly below has reordered the source line instructions above to the assembly instructions below.
21
+
-**Program Order**: The logical sequence recognized by the compiler, which may rearrange or optimize instructions under certain constraints for an output binary (i.e. program) that takes fewer cycles. Although the statements may appear in a particular order in your source, the compiler could restructure them if it deems it safe. For example the pseudo assembly below has reordered the source line instructions above.
22
22
23
23
```output
24
24
LDR R1 #5 // A
25
25
LDR R2 #42 // C
26
26
MULT R3, #R1, #5 // B
27
27
```
28
28
29
-
-**Execution Order**: How instructions are actually issued and executed by the hardware. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an ARM-based system, you might see instructions issued in the different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.
29
+
-**Execution Order**: How instructions are actually issued and executed by the hardware. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an ARM-based system, you might see instructions issued in different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.
30
30
31
-
-**Hardware Perceived Order**: The perspective observed by other devices or the rest of the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and AArch64 which must be considered when porting applications. An abstract diagram from the academic paper is shown below [Maranget et. al, 2012]. A write operation in one of the 5 threads in the pentagon below may propagate to the other threads in any order.
31
+
-**Hardware Perceived Order**: The perspective observed by other devices or the rest of the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and AArch64 - this should be considered when porting applications. An abstract diagram from the academic paper is shown below [Maranget et. al, 2012]. A write operation in one of the 5 threads in the pentagon below may propagate to the other threads in any order.
32
32
33
33

34
34
35
35
## High-level difference between Arm Memory Model and x86 Memory Model
36
36
37
-
he memory models of ARM and x86 architectures differ in terms of ordering guarantees and required synchronization. x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.
37
+
The memory models of ARM and x86 architectures differ in terms of ordering guarantees and required synchronization. x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.
38
38
39
39
In contrast, ARM’s memory model is more relaxed, allowing greater reordering of memory operations to optimize performance and energy efficiency. This relaxed model provides less intuitive ordering guarantees, meaning that loads and stores may be observed out of order by other processors. This means that source code needs to correctly follow the language standard to ensure reliable behaviour.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/2.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,13 +17,13 @@ That single-threaded world was simpler: you wrote code, the compiler made it fas
17
17
18
18
When multi threading gained traction, compilers and CPUs need more precise rules about what reordering is allowed. This is where the formalized C++ memory model, introduced in C++11, steps in. Prior to C++11, concurrency in C++ was partially specified and relied on platform-specific behavior. Now, the language standard includes well-defined semantics ensuring that developers writing concurrent code can rely on a set of guaranteed rules.
19
19
20
-
Under the new model, if a piece of data is shared between threads without proper synchronization, you can no longer assume it behaves like single-threaded code. Instead, operations on this shared data may be reordered unless you explicitly prevent it using atomic operations or other synchronization primitives such as mutexes. To ensure correctness, C++ provides an array of memory orders (such as `std::memory_order_relaxed`, `std::memory_order_acquire`, `std::memory_order_release`, etc.) that govern how loads and stores can be observed in a multi-threaded environment. Details can be found on cppreference.com’s section on memory ordering.
20
+
Under the new model, if a piece of data is shared between threads without proper synchronization, you can no longer assume it behaves like single-threaded code. Instead, operations on this shared data may be reordered unless you explicitly prevent it using atomic operations or other synchronization primitives such as mutexes. To ensure correctness, C++ provides an array of memory orders (such as `std::memory_order_relaxed`, `std::memory_order_acquire`, `std::memory_order_release`, etc.) that govern how loads and stores can be observed in a multi-threaded environment. Details can be found on the C++ reference manual.
21
21
22
22
## C++ Atomic Memory Ordering
23
23
24
-
In C++, `std::memory_order` atomic operations allow developers to specify how memory accesses, including regular, non-atomic memory accesses are order among atomic operation. Choosing the right memory order is crucial for balancing performance and correctness. Assume we have 2 atomic integers with initial values of 0:
24
+
In C++, `std::memory_order` atomic operations allow developers to specify how memory accesses, including regular, non-atomic memory accesses are ordered among atomic operation. Choosing the right memory order is crucial for balancing performance and correctness. Assume we have 2 atomic integers with initial values of 0:
25
25
26
-
```c++
26
+
```cpp
27
27
std::atomic<int> x{0};
28
28
std::atomic<int> y{0};
29
29
```
@@ -34,7 +34,7 @@ Below are a few of C++’s atomic memory orders, along with a short code snippet
34
34
35
35
Relaxed operations do not impose ordering constraints beyond atomicity. They can be freely reordered with respect to other operations. This provides maximum performance but can lead to visibility issues if used incorrectly.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/3.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,9 @@ layout: learningpathall
10
10
11
11
Due to the differences in the hardware perceived ordering as explained in the earlier sections, source code written for x86 may behave differently when ported to Arm. To demonstrate this we will create a trivial example and run it both on an x86 and Arm cloud instance.
12
12
13
-
Start an Arm-based cloud instance, in this example I am using `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS. If you are new to using cloud-based virtual machines, please see our [getting started guide](https://learn.arm.com/learning-paths/servers-and-cloud-computing/intro/). First confirm you are using a Arm-based instance with the following command.
13
+
Start an Arm-based cloud instance, in this example I am using `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS. If you are new to using cloud-based virtual machines, please see our [getting started guide](https://learn.arm.com/learning-paths/servers-and-cloud-computing/intro/).
14
+
15
+
First confirm you are using a Arm-based instance with the following command.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/4.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,7 +32,7 @@ SUMMARY: ThreadSanitizer: data race /home/ubuntu/src/relaxed_memory_ordering.cpp
32
32
33
33
The summary output highlights a potential data race in the `threadB` function corresponding to the source code expression `n->x != 42`.
34
34
35
-
## TSan's limitations
35
+
## Limitations of TSan
36
36
37
37
Thread Sanitizer (TSan) is powerful for detecting data races but has notable drawbacks. First, it only identifies concurrency issues at runtime, meaning any problematic code that isn’t exercised during testing goes unnoticed. Additionally, if race conditions exist in third-party binaries or libraries, TSan can’t instrument or fix them without access to their source code. Another major limitation is performance overhead: TSan can slow programs by 2 to 20x and requires extra memory, making it challenging for large-scale or real-time systems.
0 commit comments