agarnung
diff --git a/‎_posts/2025-01-19-DOPvsOOP.md‎
Lines changed: 77 additions & 4 deletions b/‎_posts/2025-01-19-DOPvsOOP.md‎
Lines changed: 77 additions & 4 deletions
diff --git a/‎assets/blog_images/2025-01-19-DOPvsOOP/Screenshot from 2025-01-21 15-48-18.png‎
36.4 KB b/‎assets/blog_images/2025-01-19-DOPvsOOP/Screenshot from 2025-01-21 15-48-18.png‎
36.4 KB
diff --git a/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_1.png‎
39.8 KB b/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_1.png‎
39.8 KB
diff --git a/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_2.png‎
40.2 KB b/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_2.png‎
40.2 KB
diff --git a/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_4.png‎
39.8 KB b/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_4.png‎
39.8 KB
diff --git a/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_5.png‎
40.3 KB b/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_5.png‎
40.3 KB
diff --git a/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_6.png‎
40.5 KB b/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_6.png‎
40.5 KB
diff --git a/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_7.png‎
40.3 KB b/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_7.png‎
40.3 KB
diff --git a/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_8.png‎
40.2 KB b/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_8.png‎
40.2 KB
diff --git a/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_1.png‎
42.2 KB b/‎assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_1.png‎
42.2 KB
@@ -47,6 +47,15 @@ NUMA node(s):                         1
 NUMA node0 CPU(s):                    0-3
 Vulnerability L1tf:                   Not affected
 ```
+```bash
+$ g++ --version 
+```
+```text
+g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
+Copyright (C) 2021 Free Software Foundation, Inc.
+This is free software; see the source for copying conditions.  There is NO
+warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+```
 
 ## Hands-on code
 Let us define the number of class instances (entities) we will create to test at large scale to do the test. Better to be a large number, e.g.:
@@ -95,7 +104,7 @@ public:
     }
 };
 ```
-Now, lets test it:
+Now, let’s test it:
 ```cpp
 std::chrono::duration<double> elapsedOOPDOP;
 std::vector<Entity_OOP_Bad> entities(num_entities);
@@ -154,7 +163,7 @@ public:
     }
 };
 ```
-Now, lets test it:
+Now, let’s test it:
 ```cpp
 std::chrono::duration<double> elapsedOOPDOP;
 std::vector<Entity_OOP_Good> entities(num_entities);
@@ -212,7 +221,7 @@ public:
     }
 };
 ```
-Now, lets test it:
+Now, let’s test it:
 ```cpp
 std::chrono::duration<double> elapsedOOPDOP_GoodWithFooPadding;
 std::vector<Entity_OOP_GoodWithFooPadding> entities(num_entities);
@@ -229,7 +238,7 @@ OOP (Good Order by DOP and Foo Padding) CPU cycles: 14294218
 OOP (Good Order by DOP and Foo Padding) Execution time: 0.00531921 seconds
 ```
 
-Even faster. We have found an evidence to the presented hypotesis. Lets summarize the resultd:
+Even faster. We have found an evidence to the presented hypotesis. Let’s summarize the resultd:
 
 ```cpp
 std::cout << "With DOP, the processing is " << (elapsedOOPBad.count() - elapsedOOPDOP.count()) * 1e3 << " ms faster\n";
@@ -251,6 +260,70 @@ Below are the graph results after running the test many (1000) times and analyzi
 
 Mostly, the results align with what was experienced before; careful structuring of variables in memory enhances performance on both small and large scales, even with the optimizations that modern compilers may add.
 
+### Compiler customization
+Let’s be more austere. In the following, we will enable some [compiler flags](https://caiorss.github.io/C-Cpp-Notes/compiler-flags-options.html) for the g++ (GCC) compiler and analyze whether the graphs vary significantly or not. We are using ```Qt 6.8.1``` and specifying the flags in the ```.pro``` file via the ```QMAKE_CXXFLAGS`` variable.
+
+1. With no flags:
+![podium_comparison_ms_1](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_1.png)
+![podium_comparison_ticks_1](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_1.png)
+
+2. No optimization:
+```.pro
+QMAKE_CXXFLAGS += O0
+```
+Faster compilation time and better for debugging.
+![podium_comparison_ms_2](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_2.png)
+![podium_comparison_ticks_2](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_2.png)
+
+3. O2 optimization:
+```.pro
+QMAKE_CXXFLAGS += O2
+```
+High level of optimization. Slower compilation time, better for releasing.
+![podium_comparison_ms_3](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_3.png)
+![podium_comparison_ticks_3](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_3.png)
+
+4. O3 optimization:
+```.pro
+QMAKE_CXXFLAGS += O3
+```
+Higher (most aggressive) level of optimization. Slower compilation time, better for releasing.
+![podium_comparison_ms_4](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_4.png)
+![podium_comparison_ticks_4](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_4.png)
+
+5. No optimization, march native:
+```.pro
+QMAKE_CXXFLAGS += -march=native
+```
+To utilize all specific characteristics of your CPU hardware.
+![podium_comparison_ms_5](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_5.png)
+![podium_comparison_ticks_5](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_5.png)
+
+6. O3 optimization, march native:
+```.pro
+QMAKE_CXXFLAGS += -O3 -march=native
+```
+![podium_comparison_ms_6](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_6.png)
+![podium_comparison_ticks_6](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_6.png)
+
+7. Vectorizing:
+```.pro
+QMAKE_CXXFLAGS += -ftree-vectorize -mavx -mavx2 -msse4.2
+```
+Leveraging advanced parallel processing with SIMD (AVX and AVX2) capabilities.
+![podium_comparison_ms_7](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_7.png)
+![podium_comparison_ticks_7](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_7.png)
+
+8. All for one and one for all:
+```.pro
+QMAKE_CXXFLAGS += -O3 -march=native -funroll-loops -fomit-frame-pointer -finline-functions -ftree-vectorize -mavx -mavx2 -msse4.2
+```
+```-funroll-loops```: Optimizes loops by unrolling them, which can speed up repetitive iterations.
+```-fomit-frame-pointer```: Removes the frame pointer register to optimize register usage.
+```-finline-functions```: Forces the inlining of small functions, improving performance.
+![podium_comparison_ms_8](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ms_8.png)
+![podium_comparison_ticks_8](../assets/blog_images/2025-01-19-DOPvsOOP/podium_comparison_ticks_8.png)
+
 ### Conclusion
 Modern CPUs access memory in blocks (typically 8 bytes or more). If the data is properly aligned in memory, access is faster because it can load and store the data in a single memory cycle. If the data is not properly aligned, the CPU may have to perform more memory accesses, which introduces performance penalties due to the need to correct the alignment at runtime.