Skip to content

Commit 9208c3a

Browse files
authored
Merge pull request #2 from gMerm/LP_Corrections
Alternations based on the comments
2 parents 720bcbe + 7380861 commit 9208c3a

File tree

4 files changed

+18
-23
lines changed

4 files changed

+18
-23
lines changed

content/learning-paths/cross-platform/simd-info-demo/conclusion.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ layout: learningpathall
88

99
### Conclusion and Additional Resources
1010

11-
Porting SIMD code between architecture can be a daunting process, in many cases requiring many hours of studying multiple ISAs in online resources or ISA manuals of thousands pages.
11+
Porting SIMD code between architecture can be a daunting process, in many cases requiring many hours of studying multiple ISAs in online resources or ISA manuals of thousands pages. Our primary focus in this work was to optimize the existing algorithm directly with SIMD intrinsics, without altering the algorithm or data layout. While reordering data to align with native ARM instructions could offer performance benefits, our scope remained within the constraints of the current data layout and algorithm. For those interested in data layout strategies to further enhance performance on ARM, the [vectorization-friendly data layout learning path](https://learn.arm.com/learning-paths/cross-platform/vectorization-friendly-data-layout/) offers valuable insights.
1212

1313
Using **[SIMD.info](https://simd.info)** can be be instrumental in reducing the amount of time spent in this process, providing a centralized and user-friendly resource for finding **NEON** equivalents to intrinsics of other architectures. It saves considerable time and effort by offering detailed descriptions, prototypes, and comparisons directly, eliminating the need for extensive web searches and manual lookups.
1414

content/learning-paths/cross-platform/simd-info-demo/simdinfo-description.md

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -64,12 +64,4 @@ When you need to port code from one architecture to another, the advanced search
6464

6565
Furthermore, **SIMD.info**’s comparison tools enhance this process by enabling side-by-side comparisons of instructions from various platforms. This feature highlights the similarities and differences between instructions, which is crucial for accurately adapting your code. By understanding how similar operations are implemented across architectures, you can ensure that your ported code performs optimally.
6666

67-
Let's look at an actual example.
68-
69-
70-
71-
72-
73-
74-
<!-- IMAGE HERE:
75-
![example image alt-text#center](example-picture.png "Figure 1. Example image caption") -->
67+
Let's look at an actual example.

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1-porting.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,12 @@ Create a new file for the ported NEON code named `calculation_neon.c` with the c
2020
#include <arm_neon.h>
2121
#include <stdio.h>
2222

23+
float32_t a_array[4] = {1.0f, 4.0f, 9.0f, 16.0f};
24+
float32_t b_array[4] = {1.0f, 2.0f, 3.0f, 4.0f};
25+
2326
int main() {
24-
float32x4_t a = {1.0f, 4.0f, 9.0f, 16.0f};
25-
float32x4_t b = {1.0f, 2.0f, 3.0f, 4.0f};
27+
float32x4_t a = vld1q_f32(a_array);
28+
float32x4_t b = vld1q_f32(b_array);
2629

2730
uint32x4_t cmp_result = vcgtq_f32(a, b);
2831

@@ -91,9 +94,6 @@ Square Root Result: 1.41 3.46 6.00 8.94
9194

9295
You can see that the results are the same as in the **SSE4.2** example.
9396

94-
{{% notice Note %}}
95-
We initialized the vectors in reverse order compared to the **SSE4.2** version because **{}** bracket initialization loads vectors from LSB to MSB, whereas **`_mm_set_ps`** loads the elements MSB to LSB.
96-
{{% /notice %}}
97-
98-
99-
97+
{{% notice Note %}}
98+
We initialized the vectors in reverse order compared to the **SSE4.2** version because the array initialization and vld1q_f32 function load vectors from LSB to MSB, whereas **`_mm_set_ps`** loads elements MSB to LSB.
99+
{{% /notice %}}

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example2.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ _mm_madd_epi16(a, b) : a4d8 0 56b8 0 2198 0 578 0
6363

6464
You will note that the result of the first element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
6565

66-
The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. You could get the correct order in multiple ways, using the widening intrinsics **`vmovl`** to zero-extend or using the **`zip`** ones to merge with zero elements. The fastest way is the **`vmovl`** intrinsics, as you can see in the next example:
66+
The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, we chose to use **`vmovl`** to zero-extend values, which achieves the correct order with zero elements in place. While both **`vmovl`** and **`zip`** could be used for this purpose, we opted for **`vmovl`** in this implementation. For more details, see the ARM Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
6767

6868
```C
6969
#include <arm_neon.h>
@@ -74,13 +74,16 @@ void print_s16x8(char *label, int16x8_t v) {
7474
int16_t out[8];
7575
vst1q_s16(out, v);
7676
printf("%-*s: ", 30, label);
77-
for (size_t i=0; i < 8; i++) printf("%4x ", (uint16_t) out[i]);
77+
for (size_t i = 0; i < 8; i++) printf("%4x ", (uint16_t)out[i]);
7878
printf("\n");
7979
}
8080

81+
int16_t a_array[8] = {150, 130, 110, 90, 70, 50, 30, 10};
82+
int16_t b_array[8] = {160, 140, 120, 100, 80, 60, 40, 20};
83+
8184
int main() {
82-
int16x8_t a = { 150, 130, 110, 90, 70, 50, 30, 10 };
83-
int16x8_t b = { 160, 140, 120, 100, 80, 60, 40, 20 };
85+
int16x8_t a = vld1q_s16(a_array);
86+
int16x8_t b = vld1q_s16(b_array);
8487
int16x8_t zero = vdupq_n_s16(0);
8588
// 130 * 140 = 18200, 150 * 160 = 24000
8689
// adding them as 32-bit signed integers -> 42000
@@ -94,7 +97,7 @@ int main() {
9497
res = vpaddq_s16(res, zero);
9598
print_s16x8("vpaddq_s16(a, b)", res);
9699

97-
// vmovl_s16 would sign-extend we just want to zero-extend
100+
// vmovl_s16 would sign-extend; we just want to zero-extend
98101
// so we need to cast to uint16, vmovl_u16 and then cast back to int16
99102
uint16x4_t res_u16 = vget_low_u16(vreinterpretq_u16_s16(res));
100103
res = vreinterpretq_s16_u32(vmovl_u16(res_u16));

0 commit comments

Comments
 (0)