Skip to content

Commit 936a73c

Browse files
committed
INTEL remains unchanged
1 parent b429e68 commit 936a73c

File tree

3 files changed

+20
-23
lines changed

3 files changed

+20
-23
lines changed

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1-porting.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,4 +92,8 @@ Multiplication Result: 2.00 12.00 36.00 80.00
9292
Square Root Result: 1.41 3.46 6.00 8.94
9393
```
9494

95-
You can see that the results are the same as in the **SSE4.2** example.
95+
You can see that the results are the same as in the **SSE4.2** example.
96+
97+
{{% notice Note %}}
98+
We initialized the vectors in reverse order compared to the SSE4.2 version because the array initialization and vld1q_f32 function load vectors from LSB to MSB, whereas _mm_set_ps loads elements MSB to LSB.
99+
{{% /notice %}}

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,9 @@ Create a file named `calculation_sse.c` with the contents shown below.
1414
#include <xmmintrin.h>
1515
#include <stdio.h>
1616

17-
float a_array[4] = {1.0f, 4.0f, 9.0f, 16.0f};
18-
float b_array[4] = {1.0f, 2.0f, 3.0f, 4.0f};
19-
2017
int main() {
21-
__m128 a = _mm_loadu_ps(a_array);
22-
__m128 b = _mm_loadu_ps(b_array);
18+
__m128 a = _mm_set_ps(16.0f, 9.0f, 4.0f, 1.0f);
19+
__m128 b = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f);
2320

2421
__m128 cmp_result = _mm_cmpgt_ps(a, b);
2522

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example2.md

Lines changed: 13 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -27,13 +27,9 @@ void print_s16x8(char *label, __m128i v) {
2727
printf("\n");
2828
}
2929

30-
int16_t a_array[8] = {10, 30, 50, 70, 90, 110, 130, 150};
31-
int16_t b_array[8] = {20, 40, 60, 80, 100, 120, 140, 160};
32-
3330
int main() {
34-
35-
__m128i a = _mm_loadu_si128((__m128i*)a_array);
36-
__m128i b = _mm_loadu_si128((__m128i*)b_array);
31+
__m128i a = _mm_set_epi16(10, 30, 50, 70, 90, 110, 130, 150);
32+
__m128i b = _mm_set_epi16(20, 40, 60, 80, 100, 120, 140, 160);
3733
// 130 * 140 = 18200, 150 * 160 = 24000
3834
// adding them as 32-bit signed integers -> 42000
3935
// adding them as 16-bit signed integers -> -23336 (overflow!)
@@ -60,12 +56,12 @@ Now run the program:
6056

6157
The output should look like:
6258
```output
63-
a : a 1e 32 46 5a 6e 82 96
64-
b : 14 28 3c 50 64 78 8c a0
65-
_mm_madd_epi16(a, b) : 578 0 2198 0 56b8 0 a4d8 0
59+
a : 96 82 6e 5a 46 32 1e a
60+
b : a0 8c 78 64 50 3c 28 14
61+
_mm_madd_epi16(a, b) : a4d8 0 56b8 0 2198 0 578 0
6662
```
6763

68-
You will note that the result of the last element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
64+
You will note that the result of the first element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
6965

7066
The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, we chose to use vmovl to zero-extend values, which achieves the correct order with zero elements in place. While both vmovl and zip could be used for this purpose, we opted for **vmovl** in this implementation. For more details, see the ARM Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
7167

@@ -82,8 +78,8 @@ void print_s16x8(char *label, int16x8_t v) {
8278
printf("\n");
8379
}
8480

85-
int16_t a_array[8] = {10, 30, 50, 70, 90, 110, 130, 150};
86-
int16_t b_array[8] = {20, 40, 60, 80, 100, 120, 140, 160};
81+
int16_t a_array[8] = {150, 130, 110, 90, 70, 50, 30, 10};
82+
int16_t b_array[8] = {160, 140, 120, 100, 80, 60, 40, 20};
8783

8884
int main() {
8985
int16x8_t a = vld1q_s16(a_array);
@@ -124,11 +120,11 @@ Now run the program:
124120

125121
The output should look like:
126122
```output
127-
a : a 1e 32 46 5a 6e 82 96
128-
b : 14 28 3c 50 64 78 8c a0
129-
vmulq_s16(a, b) : c8 4b0 bb8 15e0 2328 3390 4718 5dc0
130-
vpaddq_s16(a, b) : 578 2198 56b8 a4d8 0 0 0 0
131-
final : 578 0 2198 0 56b8 0 a4d8 0
123+
a : 96 82 6e 5a 46 32 1e a
124+
b : a0 8c 78 64 50 3c 28 14
125+
vmulq_s16(a, b) : 5dc0 4718 3390 2328 15e0 bb8 4b0 c8
126+
vpaddq_s16(a, b) : a4d8 56b8 2198 578 0 0 0 0
127+
final : a4d8 0 56b8 0 2198 0 578 0
132128
```
133129

134130
As you can see the results of both match, **SIMD.info** was especially helpful in this process, providing detailed descriptions and examples that guided the translation of complex intrinsics between different SIMD architectures.

0 commit comments

Comments
 (0)