INTEL remains unchanged

gMerm · gMerm · commit 936a73ca376c · 2024-11-07T15:59:06.000+02:00
diff --git a/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1-porting.md b/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1-porting.md
@@ -92,4 +92,8 @@ Multiplication Result: 2.00 12.00 36.00 80.00
 Square Root Result: 1.41 3.46 6.00 8.94
 ```
 
-You can see that the results are the same as in the **SSE4.2** example.
+You can see that the results are the same as in the **SSE4.2** example.
+
+{{% notice Note %}} 
+We initialized the vectors in reverse order compared to the SSE4.2 version because the array initialization and vld1q_f32 function load vectors from LSB to MSB, whereas _mm_set_ps loads elements MSB to LSB.
+{{% /notice %}}
diff --git a/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1.md b/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1.md
@@ -14,12 +14,9 @@ Create a file named `calculation_sse.c` with the contents shown below.
 #include <xmmintrin.h>
 #include <stdio.h>
 
-float a_array[4] = {1.0f, 4.0f, 9.0f, 16.0f};
-float b_array[4] = {1.0f, 2.0f, 3.0f, 4.0f};
-
 int main() {
-    __m128 a = _mm_loadu_ps(a_array);
-    __m128 b = _mm_loadu_ps(b_array);
+    __m128 a = _mm_set_ps(16.0f, 9.0f, 4.0f, 1.0f);
+    __m128 b = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f);
 
     __m128 cmp_result = _mm_cmpgt_ps(a, b);
 
diff --git a/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example2.md b/content/learning-paths/cross-platform/simd-info-demo/simdinfo-example2.md
@@ -27,13 +27,9 @@ void print_s16x8(char *label, __m128i v) {
     printf("\n");
 }
 
-int16_t a_array[8] = {10, 30, 50, 70, 90, 110, 130, 150};
-int16_t b_array[8] = {20, 40, 60, 80, 100, 120, 140, 160};
-
 int main() {
-    
-    __m128i a = _mm_loadu_si128((__m128i*)a_array);
-    __m128i b = _mm_loadu_si128((__m128i*)b_array);
+    __m128i a = _mm_set_epi16(10, 30, 50, 70, 90, 110, 130, 150);
+    __m128i b = _mm_set_epi16(20, 40, 60, 80, 100, 120, 140, 160);
     // 130 * 140 = 18200, 150 * 160 = 24000
     // adding them as 32-bit signed integers -> 42000
     // adding them as 16-bit signed integers -> -23336 (overflow!)
@@ -60,12 +56,12 @@ Now run the program:
 
 The output should look like: 
 ```output
-a                             :    a   1e   32   46   5a   6e   82   96
-b                             :   14   28   3c   50   64   78   8c   a0
-_mm_madd_epi16(a, b)          :  578    0 2198    0 56b8    0 a4d8    0
+a                             :   96   82   6e   5a   46   32   1e    a
+b                             :   a0   8c   78   64   50   3c   28   14
+_mm_madd_epi16(a, b)          : a4d8    0 56b8    0 2198    0  578    0
 ```
 
-You will note that the result of the last element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
+You will note that the result of the first element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
 
 The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, we chose to use vmovl to zero-extend values, which achieves the correct order with zero elements in place. While both vmovl and zip could be used for this purpose, we opted for **vmovl** in this implementation. For more details, see the ARM Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
 
@@ -82,8 +78,8 @@ void print_s16x8(char *label, int16x8_t v) {
     printf("\n");
 }
 
-int16_t a_array[8] = {10, 30, 50, 70, 90, 110, 130, 150};
-int16_t b_array[8] = {20, 40, 60, 80, 100, 120, 140, 160};
+int16_t a_array[8] = {150, 130, 110, 90, 70, 50, 30, 10};
+int16_t b_array[8] = {160, 140, 120, 100, 80, 60, 40, 20};
 
 int main() {
     int16x8_t a = vld1q_s16(a_array);
@@ -124,11 +120,11 @@ Now run the program:
 
 The output should look like: 
 ```output
-a                             :    a   1e   32   46   5a   6e   82   96
-b                             :   14   28   3c   50   64   78   8c   a0
-vmulq_s16(a, b)               :   c8  4b0  bb8 15e0 2328 3390 4718 5dc0
-vpaddq_s16(a, b)              :  578 2198 56b8 a4d8    0    0    0    0
-final                         :  578    0 2198    0 56b8    0 a4d8    0
+a                             :   96   82   6e   5a   46   32   1e    a
+b                             :   a0   8c   78   64   50   3c   28   14
+vmulq_s16(a, b)               : 5dc0 4718 3390 2328 15e0  bb8  4b0   c8
+vpaddq_s16(a, b)              : a4d8 56b8 2198  578    0    0    0    0
+final                         : a4d8    0 56b8    0 2198    0  578    0
 ```
 
 As you can see the results of both match, **SIMD.info** was especially helpful in this process, providing detailed descriptions and examples that guided the translation of complex intrinsics between different SIMD architectures.