You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/simd-info-demo/_index.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@ title: Introduction to SIMD.info
3
3
4
4
minutes_to_complete: 30
5
5
6
-
who_is_this_for: This is for software developers interested in porting SIMD code across platforms.
6
+
who_is_this_for: This is for advanced topic for software developers interested in porting SIMD code across Arm platforms.
7
7
8
8
learning_objectives:
9
9
- Learn how to use SIMD.info’s tools and features, such as navigation, search, and comparison, to simplify the process of finding equivalent SIMD intrinsics between architectures and improving code portability.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/simd-info-demo/conclusion.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ layout: learningpathall
8
8
9
9
### Conclusion and Additional Resources
10
10
11
-
Porting SIMD code between architecture can be a daunting process, in many cases requiring many hours of studying multiple ISAs in online resources or ISA manuals of thousands pages. Our primary focus in this work was to optimize the existing algorithm directly with SIMD intrinsics, without altering the algorithm or data layout. While reordering data to align with native ARM instructions could offer performance benefits, our scope remained within the constraints of the current data layout and algorithm. For those interested in data layout strategies to further enhance performance on ARM, the [vectorization-friendly data layout learning path](https://learn.arm.com/learning-paths/cross-platform/vectorization-friendly-data-layout/) offers valuable insights.
11
+
Porting SIMD code between architecture can be a daunting process, in many cases requiring many hours of studying multiple ISAs in online resources or ISA manuals of thousands pages. Our primary focus in this work was to optimize the existing algorithm directly with SIMD intrinsics, without altering the algorithm or data layout. While reordering data to align with native Arm instructions could offer performance benefits, our scope remained within the constraints of the current data layout and algorithm. For those interested in data layout strategies to further enhance performance on Arm, the [vectorization-friendly data layout learning path](https://learn.arm.com/learning-paths/cross-platform/vectorization-friendly-data-layout/) offers valuable insights.
12
12
13
13
Using **[SIMD.info](https://simd.info)** can be be instrumental in reducing the amount of time spent in this process, providing a centralized and user-friendly resource for finding **NEON** equivalents to intrinsics of other architectures. It saves considerable time and effort by offering detailed descriptions, prototypes, and comparisons directly, eliminating the need for extensive web searches and manual lookups.
One of the biggest challenges developers face when working with SIMD code is making it portable across different platforms. SIMD instructions are designed to increase performance by executing the same operation on multiple data elements in parallel. However, each architecture has its own set of SIMD instructions, making it difficult to write code that works on all of them without major changes to the code and/or algorithm.
11
11
12
-
Consider you have the task of porting a software written using Intel intrinsics, like SSE/AVX/AVX512, to Arm Neon.
13
-
The differences in instruction sets and data handling require careful attention.
12
+
To port software written using Intel intrinsics, like SSE/AVX/AVX512, to Arm Neon, you have pay attention to data handling with the different instruction sets.
14
13
15
-
This lack of portability increases development time and introduces the risk of errors during the porting process. Currently, developers rely on ISA documentation and manually search across various vendor platforms like [ARM Developer](https://developer.arm.com/architectures/instruction-sets/intrinsics/) and [Intel Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html) to find equivalent instructions.
14
+
Having to port the code between architectures can increase development time and introduce the risk of errors during the porting process. Currently, developers rely on ISA documentation and manually search across various vendor platforms like [Arm Developer](https://developer.arm.com/architectures/instruction-sets/intrinsics/) and [Intel Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html) to find equivalent instructions.
16
15
17
16
[SIMD.info](https://simd.info) aims to solve this by helping you find equivalent instructions and providing a more streamlined way to adapt your code for different architectures.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/simd-info-demo/simdinfo-description.md
+2-16Lines changed: 2 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,22 +46,8 @@ An example of how the tree structure looks like:
46
46
47
47
-**Advanced search functionality:** With its robust search engine, **SIMD.info** allows you to either search for a specific intrinsic (e.g. `vaddq_f64`) or enter more general terms (e.g. *How to add 2 vectors*), and it will return a list of the corresponding intrinsics. You can also filter results based on the specific engine you're working with, such as **NEON**, **SSE4.2**, **AVX**, **AVX512**, **VSX**. This functionality streamlines the process of finding the right commands tailored to your needs.
48
48
49
-
-**Comparison tools:** This feature lets you directly compare SIMD instructions from different (or the same) platforms side by side, offering a clear view of the similarities and differences. It’s an invaluable tool for porting code across architectures, as it ensures accuracy and efficiency.
49
+
-**Comparison tools:** This feature lets you directly compare SIMD instructions from different (or the same) platforms side by side, offering a clear view of the similarities and differences. It’s a very helpful tool for porting code across architectures, as it ensures accuracy and efficiency.
50
50
51
51
-**Discussion forum (like StackOverflow):** The integrated discussion forum, powered by **[discuss](https://disqus.com/)** allows users to ask questions, share insights, and troubleshoot problems together. This community-driven space ensures that you’re never stuck on a complex issue without support, fostering collaboration and knowledge-sharing among SIMD developers. Imagine something like **StackOverflow** but specific to SIMD intrinsics.
52
52
53
-
### Work in Progress & Future Development
54
-
-**Pseudo-code:** Currently under development, this feature will enable users to generate high-level pseudo-code based on specific SIMD instructions. This tool aims to enable better understanding of the SIMD instructions, in a *common language*. This will also be used in the next feature, **Intrinsics Diagrams**.
55
-
56
-
-**Intrinsics Diagrams:** A feature under progress, creating detailed diagrams for each intrinsic to visualize how it operates on a low level using registers. These diagrams will help you grasp the mechanics of SIMD instructions more clearly, aiding in optimization and debugging.
57
-
58
-
-**[SIMD.ai](https://simd.ai/):** SIMD.ai is an upcoming feature that promises to bring AI-assisted insights and recommendations to the SIMD development process, making it faster and more efficient to port SIMD code between architectures.
59
-
60
-
### How These Features Aid in SIMD Development
61
-
**[SIMD.info](https://simd.info/)** offers a range of features that streamline the process of porting SIMD code across different architectures. The hierarchical structure of tree-based navigation allows you to easily locate instructions within a clear framework. This organization into broad categories and specific subcategories, such as **Arithmetic** and **Boolean Logic**, makes it straightforward to identify the relevant SIMD instructions.
62
-
63
-
When you need to port code from one architecture to another, the advanced search functionality proves invaluable. You can either search for specific intrinsics or use broader terms to find equivalent instructions across platforms. This capability ensures that you quickly find the right intrinsics for Arm, Intel or Power architectures.
64
-
65
-
Furthermore, **SIMD.info**’s comparison tools enhance this process by enabling side-by-side comparisons of instructions from various platforms. This feature highlights the similarities and differences between instructions, which is crucial for accurately adapting your code. By understanding how similar operations are implemented across architectures, you can ensure that your ported code performs optimally.
66
-
67
-
Let's look at an actual example.
53
+
You can now learn how to use these features in the context of an actual example.
You can see that the results are the same as in the **SSE4.2** example.
96
96
97
97
{{% notice Note %}}
98
-
We initialized the vectors in reverse order compared to the **SSE4.2** version because the array initialization and vld1q_f32 function load vectors from LSB to MSB, whereas **`_mm_set_ps`** loads elements MSB to LSB.
99
-
{{% /notice %}}
98
+
You initialized the vectors in reverse order compared to the **SSE4.2** version because the array initialization and vld1q_f32 function load vectors from LSB to MSB, whereas **`_mm_set_ps`** loads elements MSB to LSB.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1.md
+5-3Lines changed: 5 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ layout: learningpathall
8
8
9
9
Consider the following C example that uses Intel SSE4.2 intrinsics.
10
10
11
-
Create a file named `calculation_sse.c` with the contents shown below.
11
+
On an x86_64 Linux development machine, create a file named `calculation_sse.c` with the contents shown below:
12
12
13
13
```C
14
14
#include<xmmintrin.h>
@@ -54,12 +54,14 @@ int main() {
54
54
55
55
The program first compares whether elements in one vector are greater than those in another vector, prints the result, and then proceeds to compute the addition of two vectors, multiplies the result with one of the vectors, and finally takes the square root of the multiplication result:
56
56
57
-
Compile the code as follows on an Intel system that supports **SSE4.2**:
57
+
Compile the code on your Linux x86_64 system that supports **SSE4.2**:
It is imperative that you run the code first on the reference platform (here Intel), to make sure you understand how it works and what kind of results are being expected.
81
+
It is imperative that you run the code first on an Intel x86_64 reference platform, to make sure you understand how it works and what kind of results are being expected.
Copy file name to clipboardExpand all lines: content/learning-paths/cross-platform/simd-info-demo/simdinfo-example2.md
+6-5Lines changed: 6 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ During the porting process, you will observe that certain instructions translate
12
12
13
13
You may already know the equivalent operations for this particular intrinsic, but let's assume you don't. In this usecase, reading the **`_mm_madd_epi16`** on the **SIMD.info** might indicate that a key characteristic of the instruction involved is the *widening* of the result elements, from 16-bit to 32-bit signed integers. Unfortunately, that is not the case, as this particular instruction does not actually increase the size of the element holding the result values. You will see how that effects the result in the example.
14
14
15
-
Consider the following code for **SSE2**. Create a new file for the code named `_mm_madd_epi16_test.c` with the contents shown below:
15
+
Consider the following code for **SSE2**. Create a new file on your x86_64 Linux machine named `_mm_madd_epi16_test.c` with the contents shown below:
16
16
17
17
```C
18
18
#include<stdint.h>
@@ -44,7 +44,7 @@ int main() {
44
44
}
45
45
```
46
46
47
-
Compile the code as follows on an x86 system (no extra flags required as **SSE2** is assumed by default on all 64-bit x86 systems):
47
+
Compile the code as follows on the x86_64 system (no extra flags required as **SSE2** is assumed by default on all 64-bit x86 systems):
You will note that the result of the first element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
65
65
66
-
The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, we chose to use **`vmovl`** to zero-extend values, which achieves the correct order with zero elements in place. While both **`vmovl`** and **`zip`** could be used for this purpose, we opted for **`vmovl`** in this implementation. For more details, see the ARM Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
66
+
The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, you used **`vmovl`** to zero-extend values, which achieves the correct order with zero elements in place. While both **`vmovl`** and **`zip`** could be used for this purpose, **`vmovl`**was chosen in this implementation. For more details, see the Arm Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
67
67
68
+
Now switch your Linux Arm machine and create a file called `_mm_madd_epi16_neon.c` with the contents below:
68
69
```C
69
70
#include<arm_neon.h>
70
71
#include<stdint.h>
@@ -107,7 +108,7 @@ int main() {
107
108
}
108
109
```
109
110
110
-
Write the above program to a file called `_mm_madd_epi16_neon.c` and compile it:
As you can see the results of both match, **SIMD.info** was especially helpful in this process, providing detailed descriptions and examples that guided the translation of complex intrinsics between different SIMD architectures.
131
+
As you can see the results of both executions on different architectures match. You were able to use **SIMD.info** to help with the translation of complex intrinsics between different SIMD architectures.
0 commit comments