Skip to content

Commit fa3c61f

Browse files
committed
More tests with flags and g++
1 parent a0c5884 commit fa3c61f

16 files changed

+77
-4
lines changed

_posts/2025-01-19-DOPvsOOP.md

Lines changed: 77 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,15 @@ NUMA node(s): 1
4747
NUMA node0 CPU(s): 0-3
4848
Vulnerability L1tf: Not affected
4949
```
50+
```bash
51+
$ g++ --version
52+
```
53+
```text
54+
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
55+
Copyright (C) 2021 Free Software Foundation, Inc.
56+
This is free software; see the source for copying conditions. There is NO
57+
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
58+
```
5059

5160
## Hands-on code
5261
Let us define the number of class instances (entities) we will create to test at large scale to do the test. Better to be a large number, e.g.:
@@ -95,7 +104,7 @@ public:
95104
}
96105
};
97106
```
98-
Now, lets test it:
107+
Now, let’s test it:
99108
```cpp
100109
std::chrono::duration<double> elapsedOOPDOP;
101110
std::vector<Entity_OOP_Bad> entities(num_entities);
@@ -154,7 +163,7 @@ public:
154163
}
155164
};
156165
```
157-
Now, lets test it:
166+
Now, let’s test it:
158167
```cpp
159168
std::chrono::duration<double> elapsedOOPDOP;
160169
std::vector<Entity_OOP_Good> entities(num_entities);
@@ -212,7 +221,7 @@ public:
212221
}
213222
};
214223
```
215-
Now, lets test it:
224+
Now, let’s test it:
216225
```cpp
217226
std::chrono::duration<double> elapsedOOPDOP_GoodWithFooPadding;
218227
std::vector<Entity_OOP_GoodWithFooPadding> entities(num_entities);
@@ -229,7 +238,7 @@ OOP (Good Order by DOP and Foo Padding) CPU cycles: 14294218
229238
OOP (Good Order by DOP and Foo Padding) Execution time: 0.00531921 seconds
230239
```
231240

232-
Even faster. We have found an evidence to the presented hypotesis. Lets summarize the resultd:
241+
Even faster. We have found an evidence to the presented hypotesis. Let’s summarize the resultd:
233242

234243
```cpp
235244
std::cout << "With DOP, the processing is " << (elapsedOOPBad.count() - elapsedOOPDOP.count()) * 1e3 << " ms faster\n";
@@ -251,6 +260,70 @@ Below are the graph results after running the test many (1000) times and analyzi
251260

252261
Mostly, the results align with what was experienced before; careful structuring of variables in memory enhances performance on both small and large scales, even with the optimizations that modern compilers may add.
253262

263+
### Compiler customization
264+
Let’s be more austere. In the following, we will enable some [compiler flags](https://caiorss.github.io/C-Cpp-Notes/compiler-flags-options.html) for the g++ (GCC) compiler and analyze whether the graphs vary significantly or not. We are using ```Qt 6.8.1``` and specifying the flags in the ```.pro``` file via the ```QMAKE_CXXFLAGS`` variable.
265+
266+
1. With no flags:
267+
![podium_comparison_ms_1](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_1.png)
268+
![podium_comparison_ticks_1](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_1.png)
269+
270+
2. No optimization:
271+
```.pro
272+
QMAKE_CXXFLAGS += O0
273+
```
274+
Faster compilation time and better for debugging.
275+
![podium_comparison_ms_2](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_2.png)
276+
![podium_comparison_ticks_2](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_2.png)
277+
278+
3. O2 optimization:
279+
```.pro
280+
QMAKE_CXXFLAGS += O2
281+
```
282+
High level of optimization. Slower compilation time, better for releasing.
283+
![podium_comparison_ms_3](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_3.png)
284+
![podium_comparison_ticks_3](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_3.png)
285+
286+
4. O3 optimization:
287+
```.pro
288+
QMAKE_CXXFLAGS += O3
289+
```
290+
Higher (most aggressive) level of optimization. Slower compilation time, better for releasing.
291+
![podium_comparison_ms_4](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_4.png)
292+
![podium_comparison_ticks_4](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_4.png)
293+
294+
5. No optimization, march native:
295+
```.pro
296+
QMAKE_CXXFLAGS += -march=native
297+
```
298+
To utilize all specific characteristics of your CPU hardware.
299+
![podium_comparison_ms_5](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_5.png)
300+
![podium_comparison_ticks_5](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_5.png)
301+
302+
6. O3 optimization, march native:
303+
```.pro
304+
QMAKE_CXXFLAGS += -O3 -march=native
305+
```
306+
![podium_comparison_ms_6](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_6.png)
307+
![podium_comparison_ticks_6](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_6.png)
308+
309+
7. Vectorizing:
310+
```.pro
311+
QMAKE_CXXFLAGS += -ftree-vectorize -mavx -mavx2 -msse4.2
312+
```
313+
Leveraging advanced parallel processing with SIMD (AVX and AVX2) capabilities.
314+
![podium_comparison_ms_7](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_7.png)
315+
![podium_comparison_ticks_7](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_7.png)
316+
317+
8. All for one and one for all:
318+
```.pro
319+
QMAKE_CXXFLAGS += -O3 -march=native -funroll-loops -fomit-frame-pointer -finline-functions -ftree-vectorize -mavx -mavx2 -msse4.2
320+
```
321+
```-funroll-loops```: Optimizes loops by unrolling them, which can speed up repetitive iterations.
322+
```-fomit-frame-pointer```: Removes the frame pointer register to optimize register usage.
323+
```-finline-functions```: Forces the inlining of small functions, improving performance.
324+
![podium_comparison_ms_8](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_8.png)
325+
![podium_comparison_ticks_8](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_8.png)
326+
254327
### Conclusion
255328
Modern CPUs access memory in blocks (typically 8 bytes or more). If the data is properly aligned in memory, access is faster because it can load and store the data in a single memory cycle. If the data is not properly aligned, the CPU may have to perform more memory accesses, which introduces performance penalties due to the need to correct the alignment at runtime.
256329

36.4 KB
Loading
39.8 KB
Loading
40.2 KB
Loading
39.8 KB
Loading
40.3 KB
Loading
40.5 KB
Loading
40.3 KB
Loading
40.2 KB
Loading
42.2 KB
Loading

0 commit comments

Comments
 (0)