Skip to content

Commit bacb0c8

Browse files
Merge pull request #1662 from jasonrandrews/review2
Review C++ memory model Learning Path
2 parents 07e5ff9 + 6b0cd1e commit bacb0c8

File tree

5 files changed

+88
-64
lines changed

5 files changed

+88
-64
lines changed
Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Introduction to Memory Models
2+
title: Introduction to C++ memory models
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
@@ -8,32 +8,32 @@ layout: learningpathall
88

99
## What is a memory model?
1010

11-
A language’s memory model defines how operations on shared data interleave at runtime, providing rules on what reorderings are allowed by compilers and hardware. In C++, the memory model specifies how threads interact with shared variables, ensuring consistent behavior across different compilers and architectures. A developer can think of memory ordering in 4 broad categories.
11+
A language’s memory model defines how operations on shared data interleave at runtime, providing rules on what reorderings are allowed by compilers and hardware. In C++, the memory model specifies how threads interact with shared variables, ensuring consistent behavior across different compilers and architectures. You can think of memory ordering in 4 broad categories.
1212

13-
- **Source Code Order** The exact sequence in which you write statements. This is the most intuitive view because it directly reflects how code appears to you.
13+
- **Source Code Order**: The exact sequence in which you write statements. This is the most intuitive view because it directly reflects how code appears to you.
1414

1515
```output
1616
int x = 5; // A
1717
int z = x * 5 // B
1818
int y = 42 // C
1919
```
2020

21-
- **Program Order**: The logical sequence recognized by the compiler, which may rearrange or optimize instructions under certain constraints for an output binary (i.e. program) that takes fewer cycles. Although the statements may appear in a particular order in your source, the compiler could restructure them if it deems it safe. For example the pseudo assembly below has reordered the source line instructions above.
21+
- **Program Order**: The logical sequence recognized by the compiler, which may rearrange or optimize instructions under certain constraints to create a program that takes fewer cycles. Although the statements may appear in a particular order in your source code, the compiler could restructure them if it deems it safe. For example, the pseudo assembly below reorders the source line instructions above.
2222

2323
```output
2424
LDR R1 #5 // A
2525
LDR R2 #42 // C
2626
MULT R3, #R1, #5 // B
2727
```
2828

29-
- **Execution Order**: How instructions are actually issued and executed by the hardware. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an ARM-based system, you might see instructions issued in different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.
29+
- **Execution Order**: How instructions are actually issued and executed by the hardware. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an Arm-based system, you might see instructions issued in different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.
3030

31-
- **Hardware Perceived Order**: The perspective observed by other devices or the rest of the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and AArch64 - this should be considered when porting applications. An abstract diagram from the academic paper is shown below [Maranget et. al, 2012]. A write operation in one of the 5 threads in the pentagon below may propagate to the other threads in any order.
31+
- **Hardware Perceived Order**: This is the perspective observed by other devices in the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and Arm, and this should be considered when porting applications. An abstract diagram from the academic paper is shown below [Maranget et. al, 2012]. A write operation in one of the 5 threads in the pentagon below may propagate to the other threads in any order.
3232

3333
![abstract_model](./Abstract_model.png)
3434

35-
## High-level difference between Arm Memory Model and x86 Memory Model
35+
## High-level differences between the Arm memory model and the x86 memory model
3636

37-
The memory models of ARM and x86 architectures differ in terms of ordering guarantees and required synchronization. x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.
37+
The memory models of Arm and x86 architectures differ in terms of ordering guarantees and required synchronizations. x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.
3838

39-
In contrast, ARM’s memory model is more relaxed, allowing greater reordering of memory operations to optimize performance and energy efficiency. This relaxed model provides less intuitive ordering guarantees, meaning that loads and stores may be observed out of order by other processors. This means that source code needs to correctly follow the language standard to ensure reliable behaviour.
39+
In contrast, Arm’s memory model is more relaxed, allowing greater reordering of memory operations to optimize performance and energy efficiency. This relaxed model provides less intuitive ordering guarantees, meaning that loads and stores may be observed out of order by other processors. This means that source code needs to correctly follow the language standard to ensure reliable behavior.

content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/2.md

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,24 @@
11
---
2-
title: C++ Memory Model and Atomics
2+
title: The C++ memory model and atomics
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## C++ Memory Model for Single Threads
10-
9+
## The C++ memory model for single threads
1110

1211
For a long time, writing C++ programs on single-core systems was relatively straightforward. The compiler could reorder instructions however it wished, so long as the program’s observable behavior remained unchanged. This optimization freedom is commonly referred to as the “as-if” rule. Essentially, compilers can optimize away or move instructions around as if the code had not changed, provided they do not affect inputs, outputs, or volatile accesses.
1312

14-
That single-threaded world was simpler: you wrote code, the compiler made it faster (by reordering or eliding instructions if safe), and everyone benefited. But then multi-core processors and multi-threaded applications became the norm. Suddenly, reordering instructions was not merely about performanceit could change the meaning of programs with threads reading and writing shared data simultaneously.
13+
The single-threaded world was simpler: you wrote code, the compiler made it faster (by safely reordering or eliminating instructions), and performance benefited. Over time, multi-core processors and multi-threaded applications became the norm. Suddenly, reordering instructions was not only about performance because it could change the meaning of programs with threads reading and writing shared data simultaneously.
1514

16-
### Expanding Memory Model for Multiple Threads
15+
### Expanding the memory model for multiple threads
1716

18-
When multi threading gained traction, compilers and CPUs need more precise rules about what reordering is allowed. This is where the formalized C++ memory model, introduced in C++11, steps in. Prior to C++11, concurrency in C++ was partially specified and relied on platform-specific behavior. Now, the language standard includes well-defined semantics ensuring that developers writing concurrent code can rely on a set of guaranteed rules.
17+
When multi threading programming gained traction, compilers and CPUs needed precise rules about what reordering is allowed. This is where the formalized C++ memory model, introduced in C++11, steps in. Prior to C++11, concurrency in C++ was partially specified and relied on platform-specific behavior. Now, the language standard includes well-defined semantics ensuring that concurrent code can rely on a set of guaranteed rules.
1918

20-
Under the new model, if a piece of data is shared between threads without proper synchronization, you can no longer assume it behaves like single-threaded code. Instead, operations on this shared data may be reordered unless you explicitly prevent it using atomic operations or other synchronization primitives such as mutexes. To ensure correctness, C++ provides an array of memory orders (such as `std::memory_order_relaxed`, `std::memory_order_acquire`, `std::memory_order_release`, etc.) that govern how loads and stores can be observed in a multi-threaded environment. Details can be found on the C++ reference manual.
19+
Under the new model, if a piece of data is shared between threads without proper synchronization, you can no longer assume it behaves like single-threaded code. Instead, operations on this shared data may be reordered unless you explicitly prevent it using atomic operations or other synchronization primitives such as mutexes. To ensure correctness, C++ provides an array of memory ordering options (such as `std::memory_order_relaxed`, `std::memory_order_acquire`, and `std::memory_order_release`) that govern how loads and stores can be observed in a multi-threaded environment. Details can be found on the C++ reference manual.
2120

22-
## C++ Atomic Memory Ordering
21+
## C++ atomic memory ordering
2322

2423
In C++, `std::memory_order` atomic operations allow developers to specify how memory accesses, including regular, non-atomic memory accesses are ordered among atomic operation. Choosing the right memory order is crucial for balancing performance and correctness. Assume we have 2 atomic integers with initial values of 0:
2524

@@ -63,8 +62,7 @@ while (atomic_load(ptr, memory_order_acquire) is null) { } // Acquire: wait unti
6362

6463
```
6564
66-
Sequential consistency, `memory_order_seq_cst` is the strongest order and the default ordering if nothing is specified. There are several other memory ordering possibilities, for information on all possible memory ordering possibilities in the C++11 standard and their nuances, please refer to the [C++ reference](https://en.cppreference.com/w/cpp/atomic/memory_order).
67-
68-
65+
Sequential consistency, `memory_order_seq_cst` is the strongest order and the default ordering if nothing is specified.
6966
67+
There are several other memory ordering possibilities. For information on all possible memory ordering possibilities in the C++11 standard and their nuances, please refer to the [C++ reference](https://en.cppreference.com/w/cpp/atomic/memory_order).
7068

content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/3.md

Lines changed: 43 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,18 @@
11
---
2-
title: Example of Race Condition
2+
title: Race condition example
33
weight: 4
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Example of a Race Condition when porting from x86 to AArch64
9+
## Example of a race condition when porting from x86 to Arm
1010

1111
Due to the differences in the hardware perceived ordering as explained in the earlier sections, source code written for x86 may behave differently when ported to Arm. To demonstrate this we will create a trivial example and run it both on an x86 and Arm cloud instance.
1212

13-
Start an Arm-based cloud instance, in this example I am using `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS. If you are new to using cloud-based virtual machines, please see our [getting started guide](https://learn.arm.com/learning-paths/servers-and-cloud-computing/intro/).
13+
Start an Arm-based cloud instance. This example uses a `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS, but other instances types are possible.
14+
15+
If you are new to cloud-based virtual machines, refer to [Get started with Servers and Cloud Computing](/learning-paths/servers-and-cloud-computing/intro/).
1416

1517
First confirm you are using a Arm-based instance with the following command.
1618

@@ -23,14 +25,14 @@ You should see the following output.
2325
aarch64
2426
```
2527

26-
Next, we will install the prerequisitve packages.
28+
Next, install the required software packages.
2729

2830
```bash
2931
sudo apt update
30-
sudo apt install g++ clang
32+
sudo apt install g++ clang -y
3133
```
3234

33-
Copy and paste the following code snippet into a file named `relaxed_memory_model.cpp`.
35+
Use a text editor to copy and paste the following code snippet into a file named `relaxed_memory_ordering.cpp`.
3436

3537
```cpp
3638
#include <iostream>
@@ -83,27 +85,33 @@ int main() {
8385
}
8486
```
8587
86-
The code snippet above is a trivial example of a data race condition. Thread A creates a node variable and assigns it the number 42. On the otherhand, thread B checks than the variable assigned to the Node is equal to 42. Both functions use the `memory_order_relaxed` model, which allows the possibility for thread B to read an unintialised variable before it has been assigned the value 42 in thread A.
88+
The code above is a trivial example of a data race condition. Thread A creates a node variable and assigns it the number 42. Thread B checks that the variable assigned to the Node is equal to 42. Both functions use the `memory_order_relaxed` model, which allows the possibility for thread B to read an uninitialized variable before it has been assigned the value 42 in thread A.
89+
90+
Compile the program using the GNU compiler:
8791
8892
```bash
8993
g++ relaxed_memory_ordering.cpp -o relaxed_memory_ordering -O3
9094
```
9195

92-
```output
96+
Run the program and wait about 5-30 seconds for the output:
97+
98+
```bash
9399
./relaxed_memory_ordering
94-
...
95-
~ 5-30 second wait
96-
...
97-
Race condition detected: n->x = 42
98-
terminate called without an active exception
99-
Aborted (core dumped)
100100
```
101101

102-
It is worth noting that this is only a probability of a race condition. Our contrived example is designed to trigger frequently. Unfortunately, in production workloads there may be a more subtle probability that may surface in production or under specific workloads. This is the reason race conditions are difficult to spot.
102+
The output is:
103103

104-
### Behaviour on x86 instance
104+
```output
105+
Race condition detected: n->x = 42
106+
terminate called without an active exception
107+
Aborted (core dumped)
108+
```
109+
110+
It is worth noting that this is only a probability of a race condition. Our contrived example is designed to trigger frequently. Unfortunately, in production workloads there may be a more subtle probability that may surface under specific workloads. This is the reason race conditions are difficult to spot.
111+
112+
### Behavior on an x86 instance
105113

106-
Due to the more strong memory model associated with x86 processors, programs that do not adhere to the C++ standard may give programmers a false sense of security. To demonstrate this I connected to an AWS `t2.2xlarge` instance that uses the x86 architecture.
114+
Due to the more strong memory model associated with x86 processors, programs that do not adhere to the C++ standard may give programmers a false sense of security. To demonstrate this, create an connect to an AWS `t2.2xlarge` instance that uses the x86 architecture.
107115

108116
Running the following command I can observe the underlying hardware is a Intel Xeon E5-2686 Processor
109117

@@ -115,19 +123,27 @@ lscpu | grep -i "Model"
115123
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
116124
Model: 79
117125
```
118-
Follow the instructions above and recompiling leads to no race conditions on this x86-based machine.
119126

120-
```output
127+
Follow the same instructions to compile and run the application.
128+
129+
```bash
130+
g++ relaxed_memory_ordering.cpp -o relaxed_memory_ordering -O3
121131
./relaxed_memory_ordering
122-
No race condition occurred in this run
123132
```
124133

134+
Observe there is no race conditions on the x86-based machine.
125135

126-
## Using correct memory ordering of Atomics
136+
The output is:
127137

128-
As the example above shows, not adhering to the C++ standard can lead to a false sensitivity when running on x86 platforms. To fix the race condition when porting we need to use the correct memory ordering for each thread. The following snippet of C++ updates `threadA` to use the `memory_order_release`, `threadB` to use `memory_order_acquire` and the `runTest` fuction to use `memory_order_release` on the Node object.
138+
```output
139+
No race condition occurred in this run
140+
```
129141

130-
Save the adjusted code snippet below into a file named `correct_memory_ordering.cpp`.
142+
## Using correct memory ordering of atomics
143+
144+
As the example above shows, not adhering to the C++ standard can lead to a false sensitivity when running on x86 platforms. To fix the race condition when porting you need to use the correct memory ordering for each thread. The code below updates `threadA` to use the `memory_order_release`, `threadB` to use `memory_order_acquire` and the `runTest` function to use `memory_order_release` on the Node object.
145+
146+
Use an editor to copy and past the adjusted code below into a file named `correct_memory_ordering.cpp`.
131147

132148
```cpp
133149
#include <iostream>
@@ -181,14 +197,16 @@ int main() {
181197

182198
```
183199
184-
Compiling with the following command and run on an Aarch64 based machine.
200+
Compile and run on the Arm-based machine:
185201
186202
```bash
187203
g++ correct_memory_ordering.cpp -o correct_memory_ordering -O3
204+
./correct_memory_ordering
188205
```
189206

207+
Observe the race condition is gone and the output is:
208+
190209
```output
191-
./correct_memory_ordering
192210
No Race Condition Occurred in this run
193211
```
194212

0 commit comments

Comments
 (0)