Skip to content

Commit 172dc75

Browse files
Editorial.
1 parent 11c768b commit 172dc75

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example2.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ layout: learningpathall
1010

1111
During the porting process, you can see that certain instructions translate seamlessly. However, there are cases where direct equivalents for some intrinsics might not be readily available across architectures.
1212

13-
For example, the [**`_mm_madd_epi16`**](https://simd.info/c_intrinsic/_mm_madd_epi16/) intrinsic from SSE2, which performs multiplication of 16-bit signed integer elements in a vector and then does a pairwise addition of adjacent elements increasing the element width, does not have a direct counterpart in NEON. However it can be emulated using another intrinsic. Similarly its 256 and 512-bit counterparts, [**`_mm256_madd_epi16`**](https://simd.info/c_intrinsic/_mm256_madd_epi16/) and [**`_mm512_madd_epi16`**](https://simd.info/c_intrinsic/_mm512_madd_epi16/), can be emulated by a sequence of instructions, but here you will see the 128-bit variant.
13+
For example, the [**`_mm_madd_epi16`**](https://simd.info/c_intrinsic/_mm_madd_epi16/) intrinsic from SSE2, which performs multiplication of 16-bit signed integer elements in a vector and then does a pairwise addition of adjacent elements increasing the element width, does not have a direct counterpart in NEON. However, it can be emulated using another intrinsic. Similarly its 256 and 512-bit counterparts, [**`_mm256_madd_epi16`**](https://simd.info/c_intrinsic/_mm256_madd_epi16/) and [**`_mm512_madd_epi16`**](https://simd.info/c_intrinsic/_mm512_madd_epi16/), can be emulated by a sequence of instructions, but here you will see the 128-bit variant.
1414

1515
You might already know the equivalent operations for this particular intrinsic, but let's assume that you don't. In this particular use case, reading **`_mm_madd_epi16`** on **SIMD.info** might indicate that a key characteristic of the instruction involved is the widening of the result elements, from 16-bit to 32-bit signed integers. Unfortunately, this is not the case. This particular instruction does not increase the size of the element holding the result values. You will see how this affects the result in the example.
1616

@@ -63,11 +63,11 @@ b : a0 8c 78 64 50 3c 28 14
6363
_mm_madd_epi16(a, b) : a4d8 0 56b8 0 2198 0 578 0
6464
```
6565

66-
You will note that the result of the first element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
66+
You will note that the result of the first element is a negative number, even though you added 2 positive results (`130*140` and `150*160`). This is because the result of the addition has to occupy a 16-bit signed integer element, and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
6767

68-
The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, you used **`vmovl`** to zero-extend values, which achieves the correct order with zero elements in place. While both **`vmovl`** and **`zip`** could be used for this purpose, **`vmovl`** was chosen in this implementation. For more details, see the Arm Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
68+
The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, you used **`vmovl`** to zero-extend values, which achieves the correct order with zero elements in place. While both **`vmovl`** and **`zip`** can be used for this purpose, **`vmovl`** was chosen in this implementation. For more details, see the Arm Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
6969

70-
Now switch your Linux Arm machine and create a file called `_mm_madd_epi16_neon.c` with the contents below:
70+
Now switch to your Linux Arm machine and create a file called `_mm_madd_epi16_neon.c`, populating it with the contents below:
7171
```C
7272
#include <arm_neon.h>
7373
#include <stdint.h>
@@ -130,5 +130,5 @@ vpaddq_s16(a, b) : a4d8 56b8 2198 578 0 0 0 0
130130
final : a4d8 0 56b8 0 2198 0 578 0
131131
```
132132

133-
As you can see the results of both executions on different architectures match. You used SIMD.info to help with the translation of complex intrinsics between different SIMD architectures.
133+
As you can see, the results of both executions on different architectures match. You used SIMD.info to help with the translation of complex intrinsics between different SIMD architectures.
134134

0 commit comments

Comments
 (0)