Skip to content

Commit afaa6d7

Browse files
author
Your Name
committed
final review
1 parent 5d0954e commit afaa6d7

File tree

4 files changed

+12
-10
lines changed
  • content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model

4 files changed

+12
-10
lines changed

content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/1.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,22 +18,22 @@ int z = x * 5 // B
1818
int y = 42 // C
1919
```
2020

21-
- **Program Order**: The logical sequence recognized by the compiler, which may rearrange or optimize instructions under certain constraints for an output binary (i.e. program) that takes fewer cycles. Although the statements may appear in a particular order in your source, the compiler could restructure them if it deems it safe. For example the pseudo assembly below has reordered the source line instructions above to the assembly instructions below.
21+
- **Program Order**: The logical sequence recognized by the compiler, which may rearrange or optimize instructions under certain constraints for an output binary (i.e. program) that takes fewer cycles. Although the statements may appear in a particular order in your source, the compiler could restructure them if it deems it safe. For example the pseudo assembly below has reordered the source line instructions above.
2222

2323
```output
2424
LDR R1 #5 // A
2525
LDR R2 #42 // C
2626
MULT R3, #R1, #5 // B
2727
```
2828

29-
- **Execution Order**: How instructions are actually issued and executed by the hardware. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an ARM-based system, you might see instructions issued in the different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.
29+
- **Execution Order**: How instructions are actually issued and executed by the hardware. Modern CPUs often employ techniques to improve instruction-level parallelism such as out-of-order execution and speculation for performance. For instance, on an ARM-based system, you might see instructions issued in different order during runtime. The subtle difference between program order and execution order is that program order refers to the sequence seen in the binary whereas execution is the order in which those instructions are actually issued and retired. Even though the instructions are listed in one order, the CPU might reorder their micro-operations as long as it respects dependencies.
3030

31-
- **Hardware Perceived Order**: The perspective observed by other devices or the rest of the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and AArch64 which must be considered when porting applications. An abstract diagram from the academic paper is shown below [Maranget et. al, 2012]. A write operation in one of the 5 threads in the pentagon below may propagate to the other threads in any order.
31+
- **Hardware Perceived Order**: The perspective observed by other devices or the rest of the system, which can differ if the hardware buffers writes or merges memory operations. Crucially, the hardware-perceived order can vary between CPU architectures, for example between x86 and AArch64 - this should be considered when porting applications. An abstract diagram from the academic paper is shown below [Maranget et. al, 2012]. A write operation in one of the 5 threads in the pentagon below may propagate to the other threads in any order.
3232

3333
![abstract_model](./Abstract_model.png)
3434

3535
## High-level difference between Arm Memory Model and x86 Memory Model
3636

37-
he memory models of ARM and x86 architectures differ in terms of ordering guarantees and required synchronization. x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.
37+
The memory models of ARM and x86 architectures differ in terms of ordering guarantees and required synchronization. x86 processors implement a relatively strong memory model, commonly referred to as Total Store Order (TSO). Under TSO, loads and stores appear to execute in program order, with only limited reordering permitted. This strong ordering means that software running on x86 generally relies on fewer memory barrier instructions, making it easier to reason about concurrency.
3838

3939
In contrast, ARM’s memory model is more relaxed, allowing greater reordering of memory operations to optimize performance and energy efficiency. This relaxed model provides less intuitive ordering guarantees, meaning that loads and stores may be observed out of order by other processors. This means that source code needs to correctly follow the language standard to ensure reliable behaviour.

content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/2.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,13 @@ That single-threaded world was simpler: you wrote code, the compiler made it fas
1717

1818
When multi threading gained traction, compilers and CPUs need more precise rules about what reordering is allowed. This is where the formalized C++ memory model, introduced in C++11, steps in. Prior to C++11, concurrency in C++ was partially specified and relied on platform-specific behavior. Now, the language standard includes well-defined semantics ensuring that developers writing concurrent code can rely on a set of guaranteed rules.
1919

20-
Under the new model, if a piece of data is shared between threads without proper synchronization, you can no longer assume it behaves like single-threaded code. Instead, operations on this shared data may be reordered unless you explicitly prevent it using atomic operations or other synchronization primitives such as mutexes. To ensure correctness, C++ provides an array of memory orders (such as `std::memory_order_relaxed`, `std::memory_order_acquire`, `std::memory_order_release`, etc.) that govern how loads and stores can be observed in a multi-threaded environment. Details can be found on cppreference.com’s section on memory ordering.
20+
Under the new model, if a piece of data is shared between threads without proper synchronization, you can no longer assume it behaves like single-threaded code. Instead, operations on this shared data may be reordered unless you explicitly prevent it using atomic operations or other synchronization primitives such as mutexes. To ensure correctness, C++ provides an array of memory orders (such as `std::memory_order_relaxed`, `std::memory_order_acquire`, `std::memory_order_release`, etc.) that govern how loads and stores can be observed in a multi-threaded environment. Details can be found on the C++ reference manual.
2121

2222
## C++ Atomic Memory Ordering
2323

24-
In C++, `std::memory_order` atomic operations allow developers to specify how memory accesses, including regular, non-atomic memory accesses are order among atomic operation. Choosing the right memory order is crucial for balancing performance and correctness. Assume we have 2 atomic integers with initial values of 0:
24+
In C++, `std::memory_order` atomic operations allow developers to specify how memory accesses, including regular, non-atomic memory accesses are ordered among atomic operation. Choosing the right memory order is crucial for balancing performance and correctness. Assume we have 2 atomic integers with initial values of 0:
2525

26-
```c++
26+
```cpp
2727
std::atomic<int> x{0};
2828
std::atomic<int> y{0};
2929
```
@@ -34,7 +34,7 @@ Below are a few of C++’s atomic memory orders, along with a short code snippet
3434
3535
Relaxed operations do not impose ordering constraints beyond atomicity. They can be freely reordered with respect to other operations. This provides maximum performance but can lead to visibility issues if used incorrectly.
3636
37-
```c++
37+
```cpp
3838
// Thread A:
3939
r1 = y.load(std::memory_order_relaxed); // A
4040
x.store(r1, std::memory_order_relaxed); // B

content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/3.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,9 @@ layout: learningpathall
1010

1111
Due to the differences in the hardware perceived ordering as explained in the earlier sections, source code written for x86 may behave differently when ported to Arm. To demonstrate this we will create a trivial example and run it both on an x86 and Arm cloud instance.
1212

13-
Start an Arm-based cloud instance, in this example I am using `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS. If you are new to using cloud-based virtual machines, please see our [getting started guide](https://learn.arm.com/learning-paths/servers-and-cloud-computing/intro/). First confirm you are using a Arm-based instance with the following command.
13+
Start an Arm-based cloud instance, in this example I am using `t4g.xlarge` AWS instance running Ubuntu 22.04 LTS. If you are new to using cloud-based virtual machines, please see our [getting started guide](https://learn.arm.com/learning-paths/servers-and-cloud-computing/intro/).
14+
15+
First confirm you are using a Arm-based instance with the following command.
1416

1517
```bash
1618
uname -m

content/learning-paths/servers-and-cloud-computing/arm-cpp-memory-model/4.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ SUMMARY: ThreadSanitizer: data race /home/ubuntu/src/relaxed_memory_ordering.cpp
3232

3333
The summary output highlights a potential data race in the `threadB` function corresponding to the source code expression `n->x != 42`.
3434

35-
## TSan's limitations
35+
## Limitations of TSan
3636

3737
Thread Sanitizer (TSan) is powerful for detecting data races but has notable drawbacks. First, it only identifies concurrency issues at runtime, meaning any problematic code that isn’t exercised during testing goes unnoticed. Additionally, if race conditions exist in third-party binaries or libraries, TSan can’t instrument or fix them without access to their source code. Another major limitation is performance overhead: TSan can slow programs by 2 to 20x and requires extra memory, making it challenging for large-scale or real-time systems.
3838

0 commit comments

Comments
 (0)