You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A language’s memory model defines how operations on shared data interleave at runtime, providing rules on what reorderings are allowed by compilers and hardware. In C++, the memory model specifies how threads interact with shared variables, ensuring consistent behavior across different compilers and architectures. You can think of memory ordering in 4 broad categories.
11
+
A programming language’s memory model defines how operations on shared data can interleave at runtime. It sets rules for how compilers and hardware might reorder these operations.
12
12
13
-
-**Source Code Order**: The exact sequence in which you write statements. This is the most intuitive view because it directly reflects how code appears to you.
13
+
In C++, the memory model specifically defines how threads interact with shared variables, ensuring consistent behavior across different compilers and architectures.
14
+
15
+
You can think of memory ordering as falling into four broad categories:
16
+
17
+
1.**Source Code Order** - the exact sequence in which you write statements. This is the most intuitive view because it directly reflects how code appears to you.
18
+
19
+
Here is an example:
14
20
15
21
```output
16
22
int x = 5; // A
17
-
int z = x * 5 // B
18
-
int y = 42 // C
23
+
int z = x * 5; // B
24
+
int y = 42; // C
19
25
```
20
26
21
-
-**Program Order**: The logical sequence recognized by the compiler, which may rearrange or optimize instructions under certain constraints to create a program that takes fewer cycles. Although the statements may appear in a particular order in your source code, the compiler could restructure them if it deems it safe. For example, the pseudoassembly below reorders the source line instructions above.
27
+
2.**Program Order** - the logical sequence that the compiler recognizes, and it might rearrange or optimize instructions under certain constraints to create a program that executes in fewer cycles. Although your source code lists statements in a particular order, the compiler can restructure them if it deems it safe. For example, the pseudo-assembly below reorders the source instructions:
22
28
23
29
```output
24
30
LDR R1 #5 // A
25
31
LDR R2 #42 // C
26
32
MULT R3, #R1, #5 // B
27
33
```
28
34
29
-
-**Execution Order**: How instructions are actually issued and executed by the hardware. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an Arm-based system, you might see instructions issued in different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.
35
+
3.**Execution Order** - this is the order in which the hardware actually issues and executes instructions. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an Arm-based system, you might see instructions issued in different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.
30
36
31
-
-**Hardware Perceived Order**: This is the perspective observed by other devices in the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and Arm, and this should be considered when porting applications. An abstract diagram from the academic paper is shown below [Maranget et. al, 2012]. A write operation in one of the 5 threads in the pentagon below may propagate to the other threads in any order.
37
+
4.**Hardware Perceived Order** - this is the perspective observed by other devices in the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and Arm, and this should be considered when porting applications.
32
38
33
-

39
+
## High-level differences between the Arm Memory Model and the x86 Memory Model
34
40
35
-
## High-level differences between the Arm memory model and the x86 memory model
41
+
The memory models of Arm and x86 architectures differ in terms of ordering guarantees and required synchronizations.
36
42
37
-
The memory models of Arm and x86 architectures differ in terms of ordering guarantees and required synchronizations. x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.
43
+
x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.
38
44
39
-
In contrast, Arm’s memory model is more relaxed, allowing greater reordering of memory operations to optimize performance and energy efficiency. This relaxed model provides less intuitive ordering guarantees, meaning that loads and stores may be observed out of order by other processors. This means that source code needs to correctly follow the language standard to ensure reliable behavior.
45
+
In contrast, Arm’s memory model is more relaxed, allowing greater reordering of memory operations to optimize performance and energy efficiency. This relaxed model provides less intuitive ordering guarantees, meaning that loads and stores can be observed out of order by other processors. This means that source code needs to correctly follow the language standard to ensure reliable behavior.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/2.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: The C++ memory model and atomics
2
+
title: The C++ Memory Model and Atomics
3
3
weight: 3
4
4
5
5
### FIXED, DO NOT MODIFY
@@ -8,9 +8,9 @@ layout: learningpathall
8
8
9
9
## The C++ memory model for single threads
10
10
11
-
For a long time, writing C++ programs on single-core systems was relatively straightforward. The compiler could reorder instructions however it wished, so long as the program’s observable behavior remained unchanged. This optimization freedom is commonly referred to as the “as-if” rule. Essentially, compilers can optimize away or move instructions around as if the code had not changed, provided they do not affect inputs, outputs, or volatile accesses.
11
+
For a long time, writing C++ programs on single-core systems was straightforward. Compilers could reorder instructions freely, as long as the program’s observable behavior remained unchanged. This flexibility is commonly referred to as the “as-if” rule. Essentially, compilers could optimize away or move instructions around as if the code had not changed, provided the changes did not affect inputs, outputs, or volatile memory accesses.
12
12
13
-
The single-threaded world was simpler: you wrote code, the compiler made it faster (by safely reordering or eliminating instructions), and performance benefited. Over time, multi-core processors and multi-threaded applications became the norm. Suddenly, reordering instructions was not only about performance because it could change the meaning of programs with threads reading and writing shared data simultaneously.
13
+
The single-threaded world was simpler: you wrote code, the compiler safely reordered or eliminated instructions to make it faster, and your program performed better. But as multi-core processors and multi-threaded applications became common, instruction reordering was not only about improving performance - it could actually change the meaning of programs, especially when multiple threads accessed shared data simultaneously.
14
14
15
15
### Expanding the memory model for multiple threads
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/3.md
+17-13Lines changed: 17 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: Race condition example
2
+
title: Walk through a Race condition example
3
3
weight: 4
4
4
5
5
### FIXED, DO NOT MODIFY
@@ -8,31 +8,35 @@ layout: learningpathall
8
8
9
9
## Example of a race condition when porting from x86 to Arm
10
10
11
-
Due to the differences in the hardware perceived ordering as explained in the earlier sections, source code written for x86 may behave differently when ported to Arm. To demonstrate this we will create a trivial example and run it both on an x86 and Arm cloud instance.
11
+
Due to the differences in the hardware memory ordering, as explained in the earlier sections, source code written for x86 can behave differently when ported to Arm.
12
12
13
-
Start an Arm-based cloud instance. This example uses a `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS, but other instances types are possible.
13
+
To demonstrate this, this Learning Path walks you through a simple example that is run on both x86 and Arm cloud instance.
14
14
15
-
If you are new to cloud-based virtual machines, refer to [Get started with Servers and Cloud Computing](/learning-paths/servers-and-cloud-computing/intro/).
15
+
### Get Started
16
+
17
+
Start an Arm-based cloud instance. This example uses a `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS, but you can use other instances types.
18
+
19
+
If you are new to cloud-based virtual machines, see [Get started with Servers and Cloud Computing](/learning-paths/servers-and-cloud-computing/intro/).
16
20
17
21
First confirm you are using a Arm-based instance with the following command.
18
22
19
23
```bash
20
24
uname -m
21
25
```
22
-
You should see the following output.
26
+
You should see the following output:
23
27
24
28
```output
25
29
aarch64
26
30
```
27
31
28
-
Next, install the required software packages.
32
+
Next, install the required software packages:
29
33
30
34
```bash
31
35
sudo apt update
32
36
sudo apt install g++ clang -y
33
37
```
34
38
35
-
Use a text editor to copy and paste the following code snippet into a file named `relaxed_memory_ordering.cpp`.
39
+
Use a text editor to copy and paste the following code snippet into a file named `relaxed_memory_ordering.cpp`:
36
40
37
41
```cpp
38
42
#include<iostream>
@@ -85,31 +89,31 @@ int main() {
85
89
}
86
90
```
87
91
88
-
The code above is a small example of a data race condition. Thread A creates a node variable and assigns it the number 42. Thread B checks that the variable assigned to the Node is equal to 42. Both functions use the `memory_order_relaxed` model, which allows the possibility for thread B to read an uninitialized variable before it has been assigned the value 42 in thread A.
92
+
The code above demonstrates a data race condition. Thread A creates a node variable and assigns it the value `42`. Thread B checks that the variable assigned to the Node equals 42. Both threads use `memory_order_relaxed` model, allowing thread B to potentially read an uninitialized variable before thread A assigns the value of `42`.
Run the command below to run the binary 10 times. Multiple runs increases the chance of observing a race condition.
100
+
Run the binary 10 times to increase the chance of observing a race condition:
97
101
98
102
```bash
99
103
foriin {1..10};do ./relaxed_memory_ordering;done;
100
104
```
101
105
102
-
If you do not see a race condition, the animation below shows a race condition being triggered on the 3rd run.
106
+
If you do not see a race condition, the animation below shows a race condition being triggered on the third run:
103
107
104
108

105
109
106
-
As the graphic above illustrates, a race condition is not a guarantee but a probability.
110
+
As the graphic above illustrates, a race condition is probabilistic, and not guaranteed.
107
111
108
-
Unfortunately, in production workloads there may be a more subtle probability that may surface under specific workloads. This is the reason race conditions are difficult to spot.
112
+
Subtle issues can surface under specific workloads, making them challenging to detect.
109
113
110
114
### Behavior on an x86 instance
111
115
112
-
Due to the more strong memory model associated with x86 processors, programs that do not adhere to the C++ standard may give programmers a false sense of security. To demonstrate this, create an connect to an AWS `t2.2xlarge` instance that uses the x86 architecture.
116
+
Due to the stronger memory model in x86 processors, programs not adhering to the C++ standard might give programmers a false sense of security. To demonstrate this, create an connect to an AWS `t2.2xlarge` instance that uses the x86 architecture.
113
117
114
118
Running the following command I can observe the underlying hardware is a Intel Xeon E5-2686 Processor
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/4.md
+12-10Lines changed: 12 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,18 +8,20 @@ layout: learningpathall
8
8
9
9
## How can I detect infrequent race conditions?
10
10
11
-
ThreadSanitizer, commonly referred to as `TSan`, is a concurrency bug detection tool that identifies data races in multi-threaded programs. By instrumenting code at compile time, TSan dynamically tracks memory operations, monitoring lock usage and detecting inconsistencies in thread synchronization. When it finds a potential data race, it reports detailed information to aid debugging. TSan's overhead can be significant, but it provides valuable insights into concurrency issues often missed by static analysis.
11
+
ThreadSanitizer (TSan) is a concurrency bug detection tool that identifies data races in multithreaded programs. By instrumenting code at compile time, `TSan` dynamically tracks memory operations, monitors lock usage, and detects inconsistencies in thread synchronization. When a potential data race is found, `TSan`provides detailed reports to help you debug.
12
12
13
-
TSan is available through both recent `clang` and `gcc` compilers.
13
+
Although its runtime overhead can be significant, `TSan` provides valuable insights into concurrency issues often missed by static analysis tools.
14
14
15
-
Use the `clang++` compiler to compile the example and run the executable:
15
+
`TSan` is available in recent versions of the `clang` and `gcc` compilers.
16
+
17
+
Compile and run the following example using the `clang++` compiler:
@@ -32,16 +34,16 @@ SUMMARY: ThreadSanitizer: data race /home/ubuntu/src/relaxed_memory_ordering.cpp
32
34
==================
33
35
```
34
36
35
-
The output highlights a potential data race in the `threadB` function corresponding to the source code expression `n->x != 42`.
37
+
This output highlights a potential data race in the `threadB` function, corresponding to the source code expression `n->x != 42`.
36
38
37
39
## Does TSan have any limitations?
38
40
39
-
Thread Sanitizer (TSan) is powerful for detecting data races but has notable drawbacks.
41
+
While powerful, `TSan`has some notable drawbacks:
40
42
41
-
First, it only identifies concurrency issues at runtime, meaning any problematic code that isn’t exercised during testing goes unnoticed.
43
+
* It identifies concurrency issues only at runtime, meaning code paths not exercised during testing remain unchecked.
42
44
43
-
Second, if race conditions exist in third-party binaries or libraries, TSan can’t instrument or fix them without access to their source code.
45
+
* It cannot instrument or fix race conditions in third-party binaries or librarieswithout source code access.
44
46
45
-
Another major limitation is performance overhead: TSan can slow programs by 2 to 20x and requires extra memory, making it challenging for large-scale or real-time systems.
47
+
* It introduces significant performance overhead, typically slowing programs by 2 to 20 times and requiring additional memory. This makes it challenging to use in large-scale or real-time systems.
46
48
47
-
For further information please refer to the [ThreadSanitizer documentation](https://github.com/google/sanitizers/wiki/threadsanitizercppmanual).
49
+
For further information, see the [ThreadSanitizer documentation](https://github.com/google/sanitizers/wiki/threadsanitizercppmanual).
0 commit comments