Skip to content

Commit c003eda

Browse files
committed
Starting DOP post
1 parent 9a30179 commit c003eda

File tree

4 files changed

+251
-9
lines changed

4 files changed

+251
-9
lines changed

_posts/2025-01-19-DOPvsOOP.md

Lines changed: 249 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,18 +3,263 @@ title: Data-Oriented Programming vs Object-Oriented Programming
33
tags: [C++, programming]
44
style: dark
55
color: danger
6-
description: An practical introduction to the useful programming concept of DDG
6+
description: An practical introduction to the useful programming concept of DOP
77
---
88

99
## Introduction
1010

11-
_Still in progress..._
11+
**Memory alignment matters. And it gets worse at a large scale. As someone said, 1 ms could make the difference between getting frustrated or not waiting for Word to open.**
12+
13+
In this post, we will experiment with how the alignment of attributes in classes/structures affects the computational cost of code (in C++) in terms of execution time.
14+
15+
The idea is that we will create many instances of various classes with multiple attributes and a method that updates or _does something_ with those attributes, many times. We will evaluate and compare the execution time of each one. We will introduce concepts from DOP (Data-Oriented Programming) and put them into practice to see how they can help us in our daily life as ~~**_high-performance_**~~ programmers.
16+
17+
1. We will create a class ```Entity_OOP_Bad``` as any innocent subscriber to OOP would do.
18+
2. We will paint the previous class with our knowledge of DOP and turn it into ```Entity_OOP_Good```.
19+
_Still in progress..._
20+
3. We will further maximize the efficiency of our code in ```Entity_OOP_GoodWithFooPadding```.
21+
22+
* How will we evaluate performance? The quality assessment will be purely based on execution time with ```chrono``` and CPU cycles with ```__rdtsc``` from ```x86intrin.h``` library.
23+
24+
For the curious, this is my machine:
25+
```bash
26+
$ uname -a
27+
Linux pop-os 6.9.3-76060903-generic #202405300957~1732141768~22.04~f2697e1 SMP PREEMPT_DYNAMIC Wed N x86_64 x86_64 x86_64 GNU/Linux
28+
```
29+
```bash
30+
$ lscpu | grep -E 'Architecture|Cache|CPU|Model|NUMA|cache|L1|L2|L3'
31+
Architecture: x86_64
32+
CPU op-mode(s): 32-bit, 64-bit
33+
CPU(s): 4
34+
On-line CPU(s) list: 0-3
35+
Model name: 12th Gen Intel(R) Core(TM) i7-12650H
36+
CPU family: 6
37+
Model: 154
38+
L1d cache: 192 KiB (4 instances)
39+
L1i cache: 128 KiB (4 instances)
40+
L2 cache: 5 MiB (4 instances)
41+
L3 cache: 96 MiB (4 instances)
42+
NUMA node(s): 1
43+
NUMA node0 CPU(s): 0-3
44+
Vulnerability L1tf: Not affected
45+
```
46+
47+
## Hands-on code
48+
Let us define the number of class instances (entities) we will create to test at large scale to do the test. Better to be a large bnumber, e·g.:
49+
```cpp
50+
const int num_entities = 1000000;
51+
```
52+
53+
### ```Entity_OOP_Bad````
54+
OOP class with poor attribute ordering, causing padding
55+
Padding will be used to properly align the attributes in memory.
56+
The attribute order is random and does not take into account the size of each data type.
57+
This causes the compiler to add padding (filler bytes) to properly align the attributes in memory. As a result, memory is wasted.
58+
```cpp
59+
class Entity_OOP_Bad {
60+
public:
61+
struct atributes {
62+
double dx, dz; // 8 bytes each (16 bytes total)
63+
float x, y; // 4 bytes each (8 bytes total)
64+
uint16_t something; // 2 bytes
65+
double dy; // 8 bytes
66+
uint16_t something1; // 2 bytes
67+
uint16_t something2; // 2 bytes
68+
int score; // 4 bytes
69+
int score1; // 4 bytes
70+
int score2; // 4 bytes
71+
char id; // 1 byte
72+
float z; // 4 bytes
73+
bool active; // 1 byte
74+
// _______
75+
// 56 bytes total, alignment 8 bytes
76+
};
77+
78+
atributes mAtributes;
79+
80+
void modifyParams(){
81+
this->mAtributes.x = this->mAtributes.y = this->mAtributes.z = 0.0f;
82+
this->mAtributes.dx = this->mAtributes.dy = this->mAtributes.dz = 0.1;
83+
this->mAtributes.active = true;
84+
this->mAtributes.id = 'A';
85+
this->mAtributes.score = 100;
86+
this->mAtributes.score1 = 100;
87+
this->mAtributes.score2 = 100;
88+
this->mAtributes.something *= 2;
89+
this->mAtributes.something1 *= 2;
90+
this->mAtributes.something2 *= 2;
91+
}
92+
};
93+
```
94+
Now, lets test it:
95+
```cpp
96+
std::chrono::duration<double> elapsedOOPDOP;
97+
std::vector<Entity_OOP_Bad> entities(num_entities);
98+
auto start = std::chrono::high_resolution_clock::now();
99+
unsigned long long start_cycles = __rdtsc();
100+
for (auto& entity : entities) entity.modifyParams();
101+
unsigned long long end_cycles = __rdtsc();
102+
elapsedOOPBad = std::chrono::high_resolution_clock::now() - start;
103+
std::cout << "OOP (Bad Order) CPU cycles: " << (end_cycles - start_cycles) << "\n";
104+
std::cout << "OOP (Bad Order) Execution time: " << elapsedOOPBad.count() << " seconds\n";
105+
```
106+
```text
107+
OOP (Bad Order) CPU cycles: 17961504
108+
OOP (Bad Order) Execution time: 0.0066837 seconds
109+
```
110+
111+
...but that's pretty fast, right? Well... yes, but let's continue going deeper.
112+
113+
### ```Entity_OOP_Good````
114+
OOP class with proper attribute ordering, minimizing padding
115+
Here, padding is reduced by grouping similar types together.
116+
The attributes are reordered from largest to smallest size (first double, then float, followed by int, char, and finally bool).
117+
This minimizes the amount of padding required, making the structure more compact in memory.
118+
On a more technical level, when performing operations on the attributes, the machine code will perform register lookups starting from rax (rax+4, rax+20...) with fewer shifts, and thus more efficiently, if the attributes are properly ordered.
119+
```cpp
120+
class Entity_OOP_Good {
121+
public:
122+
struct atributes {
123+
double dx, dy, dz; // 8 bytes each (24 bytes total)
124+
float x, y, z; // 4 bytes each (12 bytes total)
125+
int score; // 4 bytes
126+
int score1; // 4 bytes
127+
int score2; // 4 bytes
128+
uint16_t something; // 2 bytes
129+
uint16_t something1; // 2 bytes
130+
uint16_t something2; // 2 bytes
131+
char id; // 1 byte
132+
bool active; // 1 byte
133+
// _______
134+
// 56 bytes total, alignment 8 bytes
135+
};
136+
137+
atributes mAtributes;
138+
139+
void modifyParams(){
140+
this->mAtributes.x = this->mAtributes.y = this->mAtributes.z = 0.0f;
141+
this->mAtributes.dx = this->mAtributes.dy = this->mAtributes.dz = 0.1;
142+
this->mAtributes.active = true;
143+
this->mAtributes.id = 'A';
144+
this->mAtributes.score = 100;
145+
this->mAtributes.score1 = 100;
146+
this->mAtributes.score2 = 100;
147+
this->mAtributes.something *= 2;
148+
this->mAtributes.something1 *= 2;
149+
this->mAtributes.something2 *= 2;
150+
}
151+
};
152+
```
153+
Now, lets test it:
154+
```cpp
155+
std::chrono::duration<double> elapsedOOPDOP;
156+
std::vector<Entity_OOP_Good> entities(num_entities);
157+
auto start = std::chrono::high_resolution_clock::now();
158+
unsigned long long start_cycles = __rdtsc();
159+
for (auto& entity : entities) entity.modifyParams();
160+
unsigned long long end_cycles = __rdtsc();
161+
elapsedOOPDOP = std::chrono::high_resolution_clock::now() - start;
162+
std::cout << "OOP (Good Order by DOP) CPU cycles: " << (end_cycles - start_cycles) << "\n";
163+
std::cout << "OOP (Good Order by DOP) Execution time: " << elapsedOOPDOP.count() << " seconds\n";
164+
```
165+
```text
166+
OOP (Good Order by DOP) CPU cycles: 15459546
167+
OOP (Good Order by DOP) Execution time: 0.00575244 seconds
168+
```
169+
170+
Again a better result. This indicates that we are not thinking nonsense, but we can go even further, and this is just transfering to code naive knowledge about CPU architecture...
171+
172+
> [!NOTE]
173+
> Con el comando ```$ lscpu``` you can view the information about my CPU, to see the size in bytes that the CPU queries in each cycle, in order to know how to maximize the efficiency of my structure to avoid unnecessary gaps and perform operations in the fewest number of cycles (L1 and L2 cache sizes, 64-bit data bus size, etc.).
174+
175+
### ```Entity_OOP_GoodWithFooPadding````
176+
Now we manually add the necessary padding to align the data with the 64-bit boundaries of our CPU's memory architecture:
177+
```cpp
178+
class Entity_OOP_GoodWithFooPadding {
179+
public:
180+
struct atributes {
181+
double dx, dy, dz; // 8 bytes each (24 bytes total)
182+
float x, y, z; // 4 bytes each (12 bytes total)
183+
int score; // 4 bytes
184+
int score1; // 4 bytes
185+
int score2; // 4 bytes
186+
uint16_t something; // 2 bytes
187+
uint16_t something1; // 2 bytes
188+
uint16_t something2; // 2 bytes
189+
char id; // 1 byte
190+
bool active; // 1 byte
191+
char padding[8]; // 8 bytes de padding para completar el bloque de 64 bytes
192+
// _______
193+
// 64 bytes total, alignment 8 bytes
194+
};
195+
196+
atributes mAtributes;
197+
198+
void modifyParams() {
199+
this->mAtributes.x = this->mAtributes.y = this->mAtributes.z = 0.0f;
200+
this->mAtributes.dx = this->mAtributes.dy = this->mAtributes.dz = 0.1;
201+
this->mAtributes.active = true;
202+
this->mAtributes.id = 'A';
203+
this->mAtributes.score = 100;
204+
this->mAtributes.score1 = 100;
205+
this->mAtributes.score2 = 100;
206+
this->mAtributes.something *= 2;
207+
this->mAtributes.something1 *= 2;
208+
this->mAtributes.something2 *= 2;
209+
}
210+
};
211+
```
212+
Now, lets test it:
213+
```cpp
214+
std::chrono::duration<double> elapsedOOPDOP_GoodWithFooPadding;
215+
std::vector<Entity_OOP_GoodWithFooPadding> entities(num_entities);
216+
auto start = std::chrono::high_resolution_clock::now();
217+
unsigned long long start_cycles = __rdtsc();
218+
for (auto& entity : entities) entity.modifyParams();
219+
unsigned long long end_cycles = __rdtsc();
220+
elapsedOOPDOP_GoodWithFooPadding = std::chrono::high_resolution_clock::now() - start;
221+
std::cout << "OOP (Good Order by DOP and Foo Padding) CPU cycles: " << (end_cycles - start_cycles) << "\n";
222+
std::cout << "OOP (Good Order by DOP and Foo Padding) Execution time: " << elapsedOOPDOP_GoodWithFooPadding.count() << " seconds\n";
223+
```
224+
```text
225+
OOP (Good Order by DOP and Foo Padding) CPU cycles: 14294218
226+
OOP (Good Order by DOP and Foo Padding) Execution time: 0.00531921 seconds
227+
```
228+
229+
Even faster. We have found an evidence to the presented hypotesis. Lets summarize the resultd:
230+
231+
```cpp
232+
std::cout << "With DOP, the processing is " << (elapsedOOPBad.count() - elapsedOOPDOP.count()) * 1e3 << " ms faster\n";
233+
std::cout << "With DOP and Foo Padding, the processing is " << (elapsedOOPBad.count() - elapsedOOPDOP_GoodWithFooPadding.count()) * 1e3 << " ms faster\n";
234+
```
235+
```text
236+
With DOP, the processing is 0.931258 ms faster
237+
With DOP and Foo Padding, the processing is 1.36449 ms faster
238+
```
239+
240+
### Larger scale
241+
242+
One may wonder, but what if this was something more casual than causal? And what if it was just a quick coincidence? We can run this $$n$$ times to see if Gauss is on our side (is it true that DOP works or not?).
243+
244+
Graph results after running the test many (1000) times and analyzing which methods were the fastest:
245+
246+
![podium_comparison_ms](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms.png)
247+
![podium_comparison_ticks](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks.png)
248+
249+
Mostly, the results align with what was experienced before; careful structuring of variables in memory enhances performance on both small and large scales, even with the optimizations that modern compilers may add.
250+
251+
**GRAPH**
12252

13253
### Conclusion
254+
Modern CPUs access memory in blocks (typically 8 bytes or more). If the data is properly aligned in memory, access is faster because it can load and store the data in a single memory cycle. If the data is not properly aligned, the CPU may have to perform more memory accesses, which introduces performance penalties due to the need to correct the alignment at runtime.
14255

15-
Each one has its field of application, and it cannot be said that one is better than the other without establishing a particular framework because their philosophy is different. They are not supplementary; rather, they reinforce each other, but especially in the sense that "DOP reinforces OOP."
256+
One lesson learned is that programming often needs to be approached with a statistical mindset: structure your code in the most probabilistically favorable way for it to execute under normal conditions. If a switch case is likely to hit a specific case most of the time, place that one first. If you can do something at compile-time that results in reasonable performance, do it there instead of at runtime. Reduce the number of calls by studying and thinking about which cases are more probable in your problem; save work for the CPU, whose threads you can't really control deterministically.
257+
258+
We painted the paradigms as "good" or "bad", but this goes no further than satire. Each one has its field of application, and it cannot be said that one is better than the other without establishing a particular framework because their philosophy is different. They are not supplementary; rather, they reinforce each other, but especially in the sense that DOP **reinforces OOP**.
259+
260+
_Still in progress..._
16261

17262
### Reference
18263

19-
Mike Acton CppCon 2014
264+
[Mike Acton CppCon 2014](https://www.youtube.com/watch?v=rX0ItVEVjHc)
20265

_posts/2025-01-20-shape-normalization.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -226,10 +226,7 @@ while (error > tolerance && iteration < maxIterations)
226226
error = std::abs((double)whitePixelCountOriginal - (double)whitePixelCountTransformed) / (double)whitePixelCountOriginal;
227227
std::cout << "Iteration: " << iteration << ", Error: " << error << std::endl;
228228
229-
if (whitePixelCountTransformed < whitePixelCountOriginal)
230-
k *= 1.1;
231-
else
232-
k *= 0.9;
229+
k *= whitePixelCountTransformed < whitePixelCountOriginal ? 1.1 : 0.9;
233230
234231
++iteration;
235232
}
@@ -262,7 +259,7 @@ We will understand the above code:
262259
5. To further refine the transformed image, we crop the bounding box of the object:
263260

264261
<table>
265-
<caption>Above: before cropping hte bounding box. Below: cropped.</caption>
262+
<caption>Above: before cropping the bounding box. Below: cropped.</caption>
266263
<tr>
267264
<td><img src="../assets/blog_images/2025-01-20-shape-normalization/normalized_not_cropped_0.png" alt="normalized_not_cropped" width="64" height="64"></td>
268265
<td><img src="../assets/blog_images/2025-01-20-shape-normalization/normalized_not_cropped_0_1.png" alt="normalized_not_cropped" width="64" height="64"></td>
40.8 KB
Loading
41.6 KB
Loading

0 commit comments

Comments
 (0)