You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can see that the results are the same as in the **SSE4.2** example.
95
+
You can see that the results are the same as in the **SSE4.2** example.
96
+
97
+
{{% notice Note %}}
98
+
We initialized the vectors in reverse order compared to the SSE4.2 version because the array initialization and vld1q_f32 function load vectors from LSB to MSB, whereas _mm_set_ps loads elements MSB to LSB.
// adding them as 16-bit signed integers -> -23336 (overflow!)
@@ -60,12 +56,12 @@ Now run the program:
60
56
61
57
The output should look like:
62
58
```output
63
-
a : a 1e 32 46 5a 6e 8296
64
-
b : 14283c5064788ca0
65
-
_mm_madd_epi16(a, b) : 578 0 2198 0 56b8 0 a4d8 0
59
+
a : 96 82 6e 5a 46 32 1e a
60
+
b : a08c7864503c2814
61
+
_mm_madd_epi16(a, b) : a4d8 0 56b8 0 2198 0 578 0
66
62
```
67
63
68
-
You will note that the result of the last element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
64
+
You will note that the result of the first element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
69
65
70
66
The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, we chose to use vmovl to zero-extend values, which achieves the correct order with zero elements in place. While both vmovl and zip could be used for this purpose, we opted for **vmovl** in this implementation. For more details, see the ARM Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
As you can see the results of both match, **SIMD.info** was especially helpful in this process, providing detailed descriptions and examples that guided the translation of complex intrinsics between different SIMD architectures.
0 commit comments