You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A language’s memory model defines how operations on shared data interleave at runtime, providing rules on what reorderings are allowed by compilers and hardware. In C++, the memory model specifies how threads interact with shared variables, ensuring consistent behavior across different compilers and architectures. A developer can think of memory ordering in 4 broad categories.
11
+
A language’s memory model defines how operations on shared data interleave at runtime, providing rules on what reorderings are allowed by compilers and hardware. In C++, the memory model specifies how threads interact with shared variables, ensuring consistent behavior across different compilers and architectures. You can think of memory ordering in 4 broad categories.
12
12
13
-
-**Source Code Order** The exact sequence in which you write statements. This is the most intuitive view because it directly reflects how code appears to you.
13
+
-**Source Code Order**: The exact sequence in which you write statements. This is the most intuitive view because it directly reflects how code appears to you.
14
14
15
15
```output
16
16
int x = 5; // A
17
17
int z = x * 5 // B
18
18
int y = 42 // C
19
19
```
20
20
21
-
-**Program Order**: The logical sequence recognized by the compiler, which may rearrange or optimize instructions under certain constraints for an output binary (i.e. program) that takes fewer cycles. Although the statements may appear in a particular order in your source, the compiler could restructure them if it deems it safe. For example the pseudo assembly below has reordered the source line instructions above.
21
+
-**Program Order**: The logical sequence recognized by the compiler, which may rearrange or optimize instructions under certain constraints to create a program that takes fewer cycles. Although the statements may appear in a particular order in your source code, the compiler could restructure them if it deems it safe. For example, the pseudo assembly below reorders the source line instructions above.
22
22
23
23
```output
24
24
LDR R1 #5 // A
25
25
LDR R2 #42 // C
26
26
MULT R3, #R1, #5 // B
27
27
```
28
28
29
-
-**Execution Order**: How instructions are actually issued and executed by the hardware. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an ARM-based system, you might see instructions issued in different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.
29
+
-**Execution Order**: How instructions are actually issued and executed by the hardware. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an Arm-based system, you might see instructions issued in different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.
30
30
31
-
-**Hardware Perceived Order**: The perspective observed by other devices or the rest of the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and AArch64 - this should be considered when porting applications. An abstract diagram from the academic paper is shown below [Maranget et. al, 2012]. A write operation in one of the 5 threads in the pentagon below may propagate to the other threads in any order.
31
+
-**Hardware Perceived Order**: This is the perspective observed by other devices in the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and Arm, and this should be considered when porting applications. An abstract diagram from the academic paper is shown below [Maranget et. al, 2012]. A write operation in one of the 5 threads in the pentagon below may propagate to the other threads in any order.
32
32
33
33

34
34
35
-
## High-level difference between Arm Memory Model and x86 Memory Model
35
+
## High-level differences between the Arm memory model and the x86 memory model
36
36
37
-
The memory models of ARM and x86 architectures differ in terms of ordering guarantees and required synchronization. x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.
37
+
The memory models of Arm and x86 architectures differ in terms of ordering guarantees and required synchronizations. x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.
38
38
39
-
In contrast, ARM’s memory model is more relaxed, allowing greater reordering of memory operations to optimize performance and energy efficiency. This relaxed model provides less intuitive ordering guarantees, meaning that loads and stores may be observed out of order by other processors. This means that source code needs to correctly follow the language standard to ensure reliable behaviour.
39
+
In contrast, Arm’s memory model is more relaxed, allowing greater reordering of memory operations to optimize performance and energy efficiency. This relaxed model provides less intuitive ordering guarantees, meaning that loads and stores may be observed out of order by other processors. This means that source code needs to correctly follow the language standard to ensure reliable behavior.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/2.md
+9-11Lines changed: 9 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,25 +1,24 @@
1
1
---
2
-
title: C++ Memory Model and Atomics
2
+
title: The C++ memory model and atomics
3
3
weight: 3
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
## C++ Memory Model for Single Threads
10
-
9
+
## The C++ memory model for single threads
11
10
12
11
For a long time, writing C++ programs on single-core systems was relatively straightforward. The compiler could reorder instructions however it wished, so long as the program’s observable behavior remained unchanged. This optimization freedom is commonly referred to as the “as-if” rule. Essentially, compilers can optimize away or move instructions around as if the code had not changed, provided they do not affect inputs, outputs, or volatile accesses.
13
12
14
-
That single-threaded world was simpler: you wrote code, the compiler made it faster (by reordering or eliding instructions if safe), and everyone benefited. But then multi-core processors and multi-threaded applications became the norm. Suddenly, reordering instructions was not merely about performance—it could change the meaning of programs with threads reading and writing shared data simultaneously.
13
+
The single-threaded world was simpler: you wrote code, the compiler made it faster (by safely reordering or eliminating instructions), and performance benefited. Over time, multi-core processors and multi-threaded applications became the norm. Suddenly, reordering instructions was not only about performance because it could change the meaning of programs with threads reading and writing shared data simultaneously.
15
14
16
-
### Expanding Memory Model for Multiple Threads
15
+
### Expanding the memory model for multiple threads
17
16
18
-
When multi threading gained traction, compilers and CPUs need more precise rules about what reordering is allowed. This is where the formalized C++ memory model, introduced in C++11, steps in. Prior to C++11, concurrency in C++ was partially specified and relied on platform-specific behavior. Now, the language standard includes well-defined semantics ensuring that developers writing concurrent code can rely on a set of guaranteed rules.
17
+
When multi threading programming gained traction, compilers and CPUs needed precise rules about what reordering is allowed. This is where the formalized C++ memory model, introduced in C++11, steps in. Prior to C++11, concurrency in C++ was partially specified and relied on platform-specific behavior. Now, the language standard includes well-defined semantics ensuring that concurrent code can rely on a set of guaranteed rules.
19
18
20
-
Under the new model, if a piece of data is shared between threads without proper synchronization, you can no longer assume it behaves like single-threaded code. Instead, operations on this shared data may be reordered unless you explicitly prevent it using atomic operations or other synchronization primitives such as mutexes. To ensure correctness, C++ provides an array of memory orders (such as `std::memory_order_relaxed`, `std::memory_order_acquire`, `std::memory_order_release`, etc.) that govern how loads and stores can be observed in a multi-threaded environment. Details can be found on the C++ reference manual.
19
+
Under the new model, if a piece of data is shared between threads without proper synchronization, you can no longer assume it behaves like single-threaded code. Instead, operations on this shared data may be reordered unless you explicitly prevent it using atomic operations or other synchronization primitives such as mutexes. To ensure correctness, C++ provides an array of memory ordering options (such as `std::memory_order_relaxed`, `std::memory_order_acquire`, and `std::memory_order_release`) that govern how loads and stores can be observed in a multi-threaded environment. Details can be found on the C++ reference manual.
21
20
22
-
## C++ Atomic Memory Ordering
21
+
## C++ atomic memory ordering
23
22
24
23
In C++, `std::memory_order` atomic operations allow developers to specify how memory accesses, including regular, non-atomic memory accesses are ordered among atomic operation. Choosing the right memory order is crucial for balancing performance and correctness. Assume we have 2 atomic integers with initial values of 0:
25
24
@@ -63,8 +62,7 @@ while (atomic_load(ptr, memory_order_acquire) is null) { } // Acquire: wait unti
63
62
64
63
```
65
64
66
-
Sequential consistency, `memory_order_seq_cst` is the strongest order and the default ordering if nothing is specified. There are several other memory ordering possibilities, for information on all possible memory ordering possibilities in the C++11 standard and their nuances, please refer to the [C++ reference](https://en.cppreference.com/w/cpp/atomic/memory_order).
67
-
68
-
65
+
Sequential consistency, `memory_order_seq_cst` is the strongest order and the default ordering if nothing is specified.
69
66
67
+
There are several other memory ordering possibilities. For information on all possible memory ordering possibilities in the C++11 standard and their nuances, please refer to the [C++ reference](https://en.cppreference.com/w/cpp/atomic/memory_order).
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/3.md
+43-25Lines changed: 43 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,16 +1,18 @@
1
1
---
2
-
title: Example of Race Condition
2
+
title: Race condition example
3
3
weight: 4
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
## Example of a Race Condition when porting from x86 to AArch64
9
+
## Example of a race condition when porting from x86 to Arm
10
10
11
11
Due to the differences in the hardware perceived ordering as explained in the earlier sections, source code written for x86 may behave differently when ported to Arm. To demonstrate this we will create a trivial example and run it both on an x86 and Arm cloud instance.
12
12
13
-
Start an Arm-based cloud instance, in this example I am using `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS. If you are new to using cloud-based virtual machines, please see our [getting started guide](https://learn.arm.com/learning-paths/servers-and-cloud-computing/intro/).
13
+
Start an Arm-based cloud instance. This example uses a `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS, but other instances types are possible.
14
+
15
+
If you are new to cloud-based virtual machines, refer to [Get started with Servers and Cloud Computing](/learning-paths/servers-and-cloud-computing/intro/).
14
16
15
17
First confirm you are using a Arm-based instance with the following command.
16
18
@@ -23,14 +25,14 @@ You should see the following output.
23
25
aarch64
24
26
```
25
27
26
-
Next, we will install the prerequisitve packages.
28
+
Next, install the required software packages.
27
29
28
30
```bash
29
31
sudo apt update
30
-
sudo apt install g++ clang
32
+
sudo apt install g++ clang -y
31
33
```
32
34
33
-
Copy and paste the following code snippet into a file named `relaxed_memory_model.cpp`.
35
+
Use a text editor to copy and paste the following code snippet into a file named `relaxed_memory_ordering.cpp`.
34
36
35
37
```cpp
36
38
#include<iostream>
@@ -83,27 +85,33 @@ int main() {
83
85
}
84
86
```
85
87
86
-
The code snippet above is a trivial example of a data race condition. Thread A creates a node variable and assigns it the number 42. On the otherhand, thread B checks than the variable assigned to the Node is equal to 42. Both functions use the `memory_order_relaxed` model, which allows the possibility for thread B to read an unintialised variable before it has been assigned the value 42 in thread A.
88
+
The code above is a trivial example of a data race condition. Thread A creates a node variable and assigns it the number 42. Thread B checks that the variable assigned to the Node is equal to 42. Both functions use the `memory_order_relaxed` model, which allows the possibility for thread B to read an uninitialized variable before it has been assigned the value 42 in thread A.
Run the program and wait about 5-30 seconds for the output:
97
+
98
+
```bash
93
99
./relaxed_memory_ordering
94
-
...
95
-
~ 5-30 second wait
96
-
...
97
-
Race condition detected: n->x = 42
98
-
terminate called without an active exception
99
-
Aborted (core dumped)
100
100
```
101
101
102
-
It is worth noting that this is only a probability of a race condition. Our contrived example is designed to trigger frequently. Unfortunately, in production workloads there may be a more subtle probability that may surface in production or under specific workloads. This is the reason race conditions are difficult to spot.
102
+
The output is:
103
103
104
-
### Behaviour on x86 instance
104
+
```output
105
+
Race condition detected: n->x = 42
106
+
terminate called without an active exception
107
+
Aborted (core dumped)
108
+
```
109
+
110
+
It is worth noting that this is only a probability of a race condition. Our contrived example is designed to trigger frequently. Unfortunately, in production workloads there may be a more subtle probability that may surface under specific workloads. This is the reason race conditions are difficult to spot.
111
+
112
+
### Behavior on an x86 instance
105
113
106
-
Due to the more strong memory model associated with x86 processors, programs that do not adhere to the C++ standard may give programmers a false sense of security. To demonstrate this I connected to an AWS `t2.2xlarge` instance that uses the x86 architecture.
114
+
Due to the more strong memory model associated with x86 processors, programs that do not adhere to the C++ standard may give programmers a false sense of security. To demonstrate this, create an connect to an AWS `t2.2xlarge` instance that uses the x86 architecture.
107
115
108
116
Running the following command I can observe the underlying hardware is a Intel Xeon E5-2686 Processor
109
117
@@ -115,19 +123,27 @@ lscpu | grep -i "Model"
115
123
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
116
124
Model: 79
117
125
```
118
-
Follow the instructions above and recompiling leads to no race conditions on this x86-based machine.
119
126
120
-
```output
127
+
Follow the same instructions to compile and run the application.
Observe there is no race conditions on the x86-based machine.
125
135
126
-
## Using correct memory ordering of Atomics
136
+
The output is:
127
137
128
-
As the example above shows, not adhering to the C++ standard can lead to a false sensitivity when running on x86 platforms. To fix the race condition when porting we need to use the correct memory ordering for each thread. The following snippet of C++ updates `threadA` to use the `memory_order_release`, `threadB` to use `memory_order_acquire` and the `runTest` fuction to use `memory_order_release` on the Node object.
138
+
```output
139
+
No race condition occurred in this run
140
+
```
129
141
130
-
Save the adjusted code snippet below into a file named `correct_memory_ordering.cpp`.
142
+
## Using correct memory ordering of atomics
143
+
144
+
As the example above shows, not adhering to the C++ standard can lead to a false sensitivity when running on x86 platforms. To fix the race condition when porting you need to use the correct memory ordering for each thread. The code below updates `threadA` to use the `memory_order_release`, `threadB` to use `memory_order_acquire` and the `runTest` function to use `memory_order_release` on the Node object.
145
+
146
+
Use an editor to copy and past the adjusted code below into a file named `correct_memory_ordering.cpp`.
131
147
132
148
```cpp
133
149
#include<iostream>
@@ -181,14 +197,16 @@ int main() {
181
197
182
198
```
183
199
184
-
Compiling with the following command and run on an Aarch64 based machine.
0 commit comments