Based on the comments, we altered the data-loading, used global scope & some minor fixes in other places

gMerm · gMerm · commit b429e681ce2e · 2024-11-07T14:21:23.000+02:00
diff --git a/content/learning-paths/cross-platform/simd-info-demo/simdinfo-description.md b/content/learning-paths/cross-platform/simd-info-demo/simdinfo-description.md
@@ -64,12 +64,4 @@ When you need to port code from one architecture to another, the advanced search
 
 Furthermore, **SIMD.info**’s comparison tools enhance this process by enabling side-by-side comparisons of instructions from various platforms. This feature highlights the similarities and differences between instructions, which is crucial for accurately adapting your code. By understanding how similar operations are implemented across architectures, you can ensure that your ported code performs optimally.
 
-Let's look at an actual example.
-
-
-
-
-
-
-<!-- IMAGE HERE:
-![example image alt-text#center](example-picture.png "Figure 1. Example image caption") -->
+Let's look at an actual example.
diff --git a/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1-porting.md b/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1-porting.md
@@ -20,9 +20,12 @@ Create a new file for the ported NEON code named `calculation_neon.c` with the c
 #include <arm_neon.h>
 #include <stdio.h>
 
+float32_t a_array[4] = {1.0f, 4.0f, 9.0f, 16.0f};
+float32_t b_array[4] = {1.0f, 2.0f, 3.0f, 4.0f};
+
 int main() {
-    float32x4_t a = {1.0f, 4.0f, 9.0f, 16.0f};
-    float32x4_t b = {1.0f, 2.0f, 3.0f, 4.0f};
+    float32x4_t a = vld1q_f32(a_array);
+    float32x4_t b = vld1q_f32(b_array);
 
     uint32x4_t cmp_result = vcgtq_f32(a, b);
 
@@ -89,11 +92,4 @@ Multiplication Result: 2.00 12.00 36.00 80.00
 Square Root Result: 1.41 3.46 6.00 8.94
 ```
 
-You can see that the results are the same as in the **SSE4.2** example.
-
-{{% notice Note %}}
-We initialized the vectors in reverse order compared to the **SSE4.2** version because **{}** bracket initialization loads vectors from LSB to MSB, whereas **`_mm_set_ps`** loads the elements MSB to LSB.
-{{% /notice %}}
-
-
-
+You can see that the results are the same as in the **SSE4.2** example.
diff --git a/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1.md b/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1.md
@@ -14,9 +14,12 @@ Create a file named `calculation_sse.c` with the contents shown below.
 #include <xmmintrin.h>
 #include <stdio.h>
 
+float a_array[4] = {1.0f, 4.0f, 9.0f, 16.0f};
+float b_array[4] = {1.0f, 2.0f, 3.0f, 4.0f};
+
 int main() {
-    __m128 a = _mm_set_ps(16.0f, 9.0f, 4.0f, 1.0f);
-    __m128 b = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f);
+    __m128 a = _mm_loadu_ps(a_array);
+    __m128 b = _mm_loadu_ps(b_array);
 
     __m128 cmp_result = _mm_cmpgt_ps(a, b);
 
diff --git a/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example2.md b/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example2.md
@@ -27,9 +27,13 @@ void print_s16x8(char *label, __m128i v) {
     printf("\n");
 }
 
+int16_t a_array[8] = {10, 30, 50, 70, 90, 110, 130, 150};
+int16_t b_array[8] = {20, 40, 60, 80, 100, 120, 140, 160};
+
 int main() {
-    __m128i a = _mm_set_epi16(10, 30, 50, 70, 90, 110, 130, 150);
-    __m128i b = _mm_set_epi16(20, 40, 60, 80, 100, 120, 140, 160);
+    
+    __m128i a = _mm_loadu_si128((__m128i*)a_array);
+    __m128i b = _mm_loadu_si128((__m128i*)b_array);
     // 130 * 140 = 18200, 150 * 160 = 24000
     // adding them as 32-bit signed integers -> 42000
     // adding them as 16-bit signed integers -> -23336 (overflow!)
@@ -56,14 +60,14 @@ Now run the program:
 
 The output should look like: 
 ```output
-a                             :   96   82   6e   5a   46   32   1e    a
-b                             :   a0   8c   78   64   50   3c   28   14
-_mm_madd_epi16(a, b)          : a4d8    0 56b8    0 2198    0  578    0
+a                             :    a   1e   32   46   5a   6e   82   96
+b                             :   14   28   3c   50   64   78   8c   a0
+_mm_madd_epi16(a, b)          :  578    0 2198    0 56b8    0 a4d8    0
 ```
 
-You will note that the result of the first element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
+You will note that the result of the last element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
 
-The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. You could get the correct order in multiple ways, using the widening intrinsics **`vmovl`** to zero-extend or using the **`zip`** ones to merge with zero elements. The fastest way is the **`vmovl`** intrinsics, as you can see in the next example:
+The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, we chose to use vmovl to zero-extend values, which achieves the correct order with zero elements in place. While both vmovl and zip could be used for this purpose, we opted for **vmovl** in this implementation. For more details, see the ARM Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
 
 ```C
 #include <arm_neon.h>
@@ -74,13 +78,16 @@ void print_s16x8(char *label, int16x8_t v) {
     int16_t out[8];
     vst1q_s16(out, v);
     printf("%-*s: ", 30, label);
-    for (size_t i=0; i < 8; i++) printf("%4x ", (uint16_t) out[i]);
+    for (size_t i = 0; i < 8; i++) printf("%4x ", (uint16_t)out[i]);
     printf("\n");
 }
 
+int16_t a_array[8] = {10, 30, 50, 70, 90, 110, 130, 150};
+int16_t b_array[8] = {20, 40, 60, 80, 100, 120, 140, 160};
+
 int main() {
-    int16x8_t a = { 150, 130, 110, 90, 70, 50, 30, 10 };
-    int16x8_t b = { 160, 140, 120, 100, 80, 60, 40, 20 };
+    int16x8_t a = vld1q_s16(a_array);
+    int16x8_t b = vld1q_s16(b_array);
     int16x8_t zero = vdupq_n_s16(0);
     // 130 * 140 = 18200, 150 * 160 = 24000
     // adding them as 32-bit signed integers -> 42000
@@ -94,7 +101,7 @@ int main() {
     res = vpaddq_s16(res, zero);
     print_s16x8("vpaddq_s16(a, b)", res);
 
-    // vmovl_s16 would sign-extend we just want to zero-extend
+    // vmovl_s16 would sign-extend; we just want to zero-extend
     // so we need to cast to uint16, vmovl_u16 and then cast back to int16
     uint16x4_t res_u16 = vget_low_u16(vreinterpretq_u16_s16(res));
     res = vreinterpretq_s16_u32(vmovl_u16(res_u16));
@@ -117,11 +124,11 @@ Now run the program:
 
 The output should look like: 
 ```output
-a                             :   96   82   6e   5a   46   32   1e    a
-b                             :   a0   8c   78   64   50   3c   28   14
-vmulq_s16(a, b)               : 5dc0 4718 3390 2328 15e0  bb8  4b0   c8
-vpaddq_s16(a, b)              : a4d8 56b8 2198  578    0    0    0    0
-final                         : a4d8    0 56b8    0 2198    0  578    0
+a                             :    a   1e   32   46   5a   6e   82   96
+b                             :   14   28   3c   50   64   78   8c   a0
+vmulq_s16(a, b)               :   c8  4b0  bb8 15e0 2328 3390 4718 5dc0
+vpaddq_s16(a, b)              :  578 2198 56b8 a4d8    0    0    0    0
+final                         :  578    0 2198    0 56b8    0 a4d8    0
 ```
 
 As you can see the results of both match, **SIMD.info** was especially helpful in this process, providing detailed descriptions and examples that guided the translation of complex intrinsics between different SIMD architectures.