some text-related changes + conclusion addition about data layout

gMerm · gMerm · commit 73808619b871 · 2024-11-08T12:25:54.000+02:00
diff --git a/content/learning-paths/cross-platform/simd-info-demo/conclusion.md b/content/learning-paths/cross-platform/simd-info-demo/conclusion.md
@@ -8,7 +8,7 @@ layout: learningpathall
 
 ### Conclusion and Additional Resources
 
-Porting SIMD code between architecture can be a daunting process, in many cases requiring many hours of studying multiple ISAs in online resources or ISA manuals of thousands pages.
+Porting SIMD code between architecture can be a daunting process, in many cases requiring many hours of studying multiple ISAs in online resources or ISA manuals of thousands pages. Our primary focus in this work was to optimize the existing algorithm directly with SIMD intrinsics, without altering the algorithm or data layout. While reordering data to align with native ARM instructions could offer performance benefits, our scope remained within the constraints of the current data layout and algorithm. For those interested in data layout strategies to further enhance performance on ARM, the [vectorization-friendly data layout learning path](https://learn.arm.com/learning-paths/cross-platform/vectorization-friendly-data-layout/) offers valuable insights.
 
 Using **[SIMD.info](https://simd.info)** can be be instrumental in reducing the amount of time spent in this process, providing a centralized and user-friendly resource for finding **NEON** equivalents to intrinsics of other architectures. It saves considerable time and effort by offering detailed descriptions, prototypes, and comparisons directly, eliminating the need for extensive web searches and manual lookups.
 
diff --git a/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1-porting.md b/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1-porting.md
@@ -95,5 +95,5 @@ Square Root Result: 1.41 3.46 6.00 8.94
 You can see that the results are the same as in the **SSE4.2** example.
 
 {{% notice Note %}} 
-We initialized the vectors in reverse order compared to the SSE4.2 version because the array initialization and vld1q_f32 function load vectors from LSB to MSB, whereas _mm_set_ps loads elements MSB to LSB.
+We initialized the vectors in reverse order compared to the **SSE4.2** version because the array initialization and vld1q_f32 function load vectors from LSB to MSB, whereas **`_mm_set_ps`** loads elements MSB to LSB.
 {{% /notice %}}
diff --git a/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example2.md b/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example2.md
@@ -63,7 +63,7 @@ _mm_madd_epi16(a, b)          : a4d8    0 56b8    0 2198    0  578    0
 
 You will note that the result of the first element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
 
-The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, we chose to use vmovl to zero-extend values, which achieves the correct order with zero elements in place. While both vmovl and zip could be used for this purpose, we opted for **vmovl** in this implementation. For more details, see the ARM Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
+The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, we chose to use **`vmovl`** to zero-extend values, which achieves the correct order with zero elements in place. While both **`vmovl`** and **`zip`** could be used for this purpose, we opted for **`vmovl`** in this implementation. For more details, see the ARM Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
 
 ```C
 #include <arm_neon.h>