Skip to content

Commit b429e68

Browse files
committed
Based on the comments, we altered the data-loading, used global scope & some minor fixes in other places
1 parent 720bcbe commit b429e68

File tree

4 files changed

+35
-37
lines changed

4 files changed

+35
-37
lines changed

content/learning-paths/cross-platform/simd-info-demo/simdinfo-description.md

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -64,12 +64,4 @@ When you need to port code from one architecture to another, the advanced search
6464

6565
Furthermore, **SIMD.info**’s comparison tools enhance this process by enabling side-by-side comparisons of instructions from various platforms. This feature highlights the similarities and differences between instructions, which is crucial for accurately adapting your code. By understanding how similar operations are implemented across architectures, you can ensure that your ported code performs optimally.
6666

67-
Let's look at an actual example.
68-
69-
70-
71-
72-
73-
74-
<!-- IMAGE HERE:
75-
![example image alt-text#center](example-picture.png "Figure 1. Example image caption") -->
67+
Let's look at an actual example.

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1-porting.md

Lines changed: 6 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,12 @@ Create a new file for the ported NEON code named `calculation_neon.c` with the c
2020
#include <arm_neon.h>
2121
#include <stdio.h>
2222

23+
float32_t a_array[4] = {1.0f, 4.0f, 9.0f, 16.0f};
24+
float32_t b_array[4] = {1.0f, 2.0f, 3.0f, 4.0f};
25+
2326
int main() {
24-
float32x4_t a = {1.0f, 4.0f, 9.0f, 16.0f};
25-
float32x4_t b = {1.0f, 2.0f, 3.0f, 4.0f};
27+
float32x4_t a = vld1q_f32(a_array);
28+
float32x4_t b = vld1q_f32(b_array);
2629

2730
uint32x4_t cmp_result = vcgtq_f32(a, b);
2831

@@ -89,11 +92,4 @@ Multiplication Result: 2.00 12.00 36.00 80.00
8992
Square Root Result: 1.41 3.46 6.00 8.94
9093
```
9194

92-
You can see that the results are the same as in the **SSE4.2** example.
93-
94-
{{% notice Note %}}
95-
We initialized the vectors in reverse order compared to the **SSE4.2** version because **{}** bracket initialization loads vectors from LSB to MSB, whereas **`_mm_set_ps`** loads the elements MSB to LSB.
96-
{{% /notice %}}
97-
98-
99-
95+
You can see that the results are the same as in the **SSE4.2** example.

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,12 @@ Create a file named `calculation_sse.c` with the contents shown below.
1414
#include <xmmintrin.h>
1515
#include <stdio.h>
1616

17+
float a_array[4] = {1.0f, 4.0f, 9.0f, 16.0f};
18+
float b_array[4] = {1.0f, 2.0f, 3.0f, 4.0f};
19+
1720
int main() {
18-
__m128 a = _mm_set_ps(16.0f, 9.0f, 4.0f, 1.0f);
19-
__m128 b = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f);
21+
__m128 a = _mm_loadu_ps(a_array);
22+
__m128 b = _mm_loadu_ps(b_array);
2023

2124
__m128 cmp_result = _mm_cmpgt_ps(a, b);
2225

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example2.md

Lines changed: 23 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,13 @@ void print_s16x8(char *label, __m128i v) {
2727
printf("\n");
2828
}
2929

30+
int16_t a_array[8] = {10, 30, 50, 70, 90, 110, 130, 150};
31+
int16_t b_array[8] = {20, 40, 60, 80, 100, 120, 140, 160};
32+
3033
int main() {
31-
__m128i a = _mm_set_epi16(10, 30, 50, 70, 90, 110, 130, 150);
32-
__m128i b = _mm_set_epi16(20, 40, 60, 80, 100, 120, 140, 160);
34+
35+
__m128i a = _mm_loadu_si128((__m128i*)a_array);
36+
__m128i b = _mm_loadu_si128((__m128i*)b_array);
3337
// 130 * 140 = 18200, 150 * 160 = 24000
3438
// adding them as 32-bit signed integers -> 42000
3539
// adding them as 16-bit signed integers -> -23336 (overflow!)
@@ -56,14 +60,14 @@ Now run the program:
5660

5761
The output should look like:
5862
```output
59-
a : 96 82 6e 5a 46 32 1e a
60-
b : a0 8c 78 64 50 3c 28 14
61-
_mm_madd_epi16(a, b) : a4d8 0 56b8 0 2198 0 578 0
63+
a : a 1e 32 46 5a 6e 82 96
64+
b : 14 28 3c 50 64 78 8c a0
65+
_mm_madd_epi16(a, b) : 578 0 2198 0 56b8 0 a4d8 0
6266
```
6367

64-
You will note that the result of the first element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
68+
You will note that the result of the last element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
6569

66-
The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. You could get the correct order in multiple ways, using the widening intrinsics **`vmovl`** to zero-extend or using the **`zip`** ones to merge with zero elements. The fastest way is the **`vmovl`** intrinsics, as you can see in the next example:
70+
The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, we chose to use vmovl to zero-extend values, which achieves the correct order with zero elements in place. While both vmovl and zip could be used for this purpose, we opted for **vmovl** in this implementation. For more details, see the ARM Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
6771

6872
```C
6973
#include <arm_neon.h>
@@ -74,13 +78,16 @@ void print_s16x8(char *label, int16x8_t v) {
7478
int16_t out[8];
7579
vst1q_s16(out, v);
7680
printf("%-*s: ", 30, label);
77-
for (size_t i=0; i < 8; i++) printf("%4x ", (uint16_t) out[i]);
81+
for (size_t i = 0; i < 8; i++) printf("%4x ", (uint16_t)out[i]);
7882
printf("\n");
7983
}
8084

85+
int16_t a_array[8] = {10, 30, 50, 70, 90, 110, 130, 150};
86+
int16_t b_array[8] = {20, 40, 60, 80, 100, 120, 140, 160};
87+
8188
int main() {
82-
int16x8_t a = { 150, 130, 110, 90, 70, 50, 30, 10 };
83-
int16x8_t b = { 160, 140, 120, 100, 80, 60, 40, 20 };
89+
int16x8_t a = vld1q_s16(a_array);
90+
int16x8_t b = vld1q_s16(b_array);
8491
int16x8_t zero = vdupq_n_s16(0);
8592
// 130 * 140 = 18200, 150 * 160 = 24000
8693
// adding them as 32-bit signed integers -> 42000
@@ -94,7 +101,7 @@ int main() {
94101
res = vpaddq_s16(res, zero);
95102
print_s16x8("vpaddq_s16(a, b)", res);
96103

97-
// vmovl_s16 would sign-extend we just want to zero-extend
104+
// vmovl_s16 would sign-extend; we just want to zero-extend
98105
// so we need to cast to uint16, vmovl_u16 and then cast back to int16
99106
uint16x4_t res_u16 = vget_low_u16(vreinterpretq_u16_s16(res));
100107
res = vreinterpretq_s16_u32(vmovl_u16(res_u16));
@@ -117,11 +124,11 @@ Now run the program:
117124

118125
The output should look like:
119126
```output
120-
a : 96 82 6e 5a 46 32 1e a
121-
b : a0 8c 78 64 50 3c 28 14
122-
vmulq_s16(a, b) : 5dc0 4718 3390 2328 15e0 bb8 4b0 c8
123-
vpaddq_s16(a, b) : a4d8 56b8 2198 578 0 0 0 0
124-
final : a4d8 0 56b8 0 2198 0 578 0
127+
a : a 1e 32 46 5a 6e 82 96
128+
b : 14 28 3c 50 64 78 8c a0
129+
vmulq_s16(a, b) : c8 4b0 bb8 15e0 2328 3390 4718 5dc0
130+
vpaddq_s16(a, b) : 578 2198 56b8 a4d8 0 0 0 0
131+
final : 578 0 2198 0 56b8 0 a4d8 0
125132
```
126133

127134
As you can see the results of both match, **SIMD.info** was especially helpful in this process, providing detailed descriptions and examples that guided the translation of complex intrinsics between different SIMD architectures.

0 commit comments

Comments
 (0)