Skip to content

Commit 9778131

Browse files
authored
Merge pull request #1383 from pareenaverma/content_review
Tech review of SIMD.info LP
2 parents cab68ba + 17d033b commit 9778131

File tree

7 files changed

+22
-34
lines changed

7 files changed

+22
-34
lines changed

content/learning-paths/cross-platform/simd-info-demo/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Introduction to SIMD.info
33

44
minutes_to_complete: 30
55

6-
who_is_this_for: This is for software developers interested in porting SIMD code across platforms.
6+
who_is_this_for: This is for advanced topic for software developers interested in porting SIMD code across Arm platforms.
77

88
learning_objectives:
99
- Learn how to use SIMD.info’s tools and features, such as navigation, search, and comparison, to simplify the process of finding equivalent SIMD intrinsics between architectures and improving code portability.

content/learning-paths/cross-platform/simd-info-demo/conclusion.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ layout: learningpathall
88

99
### Conclusion and Additional Resources
1010

11-
Porting SIMD code between architecture can be a daunting process, in many cases requiring many hours of studying multiple ISAs in online resources or ISA manuals of thousands pages. Our primary focus in this work was to optimize the existing algorithm directly with SIMD intrinsics, without altering the algorithm or data layout. While reordering data to align with native ARM instructions could offer performance benefits, our scope remained within the constraints of the current data layout and algorithm. For those interested in data layout strategies to further enhance performance on ARM, the [vectorization-friendly data layout learning path](https://learn.arm.com/learning-paths/cross-platform/vectorization-friendly-data-layout/) offers valuable insights.
11+
Porting SIMD code between architecture can be a daunting process, in many cases requiring many hours of studying multiple ISAs in online resources or ISA manuals of thousands pages. Our primary focus in this work was to optimize the existing algorithm directly with SIMD intrinsics, without altering the algorithm or data layout. While reordering data to align with native Arm instructions could offer performance benefits, our scope remained within the constraints of the current data layout and algorithm. For those interested in data layout strategies to further enhance performance on Arm, the [vectorization-friendly data layout learning path](https://learn.arm.com/learning-paths/cross-platform/vectorization-friendly-data-layout/) offers valuable insights.
1212

1313
Using **[SIMD.info](https://simd.info)** can be be instrumental in reducing the amount of time spent in this process, providing a centralized and user-friendly resource for finding **NEON** equivalents to intrinsics of other architectures. It saves considerable time and effort by offering detailed descriptions, prototypes, and comparisons directly, eliminating the need for extensive web searches and manual lookups.
1414

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Overview & Context
2+
title: Overview
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
@@ -9,9 +9,8 @@ layout: learningpathall
99
### The Challenge of SIMD Code Portability
1010
One of the biggest challenges developers face when working with SIMD code is making it portable across different platforms. SIMD instructions are designed to increase performance by executing the same operation on multiple data elements in parallel. However, each architecture has its own set of SIMD instructions, making it difficult to write code that works on all of them without major changes to the code and/or algorithm.
1111

12-
Consider you have the task of porting a software written using Intel intrinsics, like SSE/AVX/AVX512, to Arm Neon.
13-
The differences in instruction sets and data handling require careful attention.
12+
To port software written using Intel intrinsics, like SSE/AVX/AVX512, to Arm Neon, you have pay attention to data handling with the different instruction sets.
1413

15-
This lack of portability increases development time and introduces the risk of errors during the porting process. Currently, developers rely on ISA documentation and manually search across various vendor platforms like [ARM Developer](https://developer.arm.com/architectures/instruction-sets/intrinsics/) and [Intel Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html) to find equivalent instructions.
14+
Having to port the code between architectures can increase development time and introduce the risk of errors during the porting process. Currently, developers rely on ISA documentation and manually search across various vendor platforms like [Arm Developer](https://developer.arm.com/architectures/instruction-sets/intrinsics/) and [Intel Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html) to find equivalent instructions.
1615

1716
[SIMD.info](https://simd.info) aims to solve this by helping you find equivalent instructions and providing a more streamlined way to adapt your code for different architectures.

content/learning-paths/cross-platform/simd-info-demo/simdinfo-description.md

Lines changed: 2 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -46,22 +46,8 @@ An example of how the tree structure looks like:
4646

4747
- **Advanced search functionality:** With its robust search engine, **SIMD.info** allows you to either search for a specific intrinsic (e.g. `vaddq_f64`) or enter more general terms (e.g. *How to add 2 vectors*), and it will return a list of the corresponding intrinsics. You can also filter results based on the specific engine you're working with, such as **NEON**, **SSE4.2**, **AVX**, **AVX512**, **VSX**. This functionality streamlines the process of finding the right commands tailored to your needs.
4848

49-
- **Comparison tools:** This feature lets you directly compare SIMD instructions from different (or the same) platforms side by side, offering a clear view of the similarities and differences. It’s an invaluable tool for porting code across architectures, as it ensures accuracy and efficiency.
49+
- **Comparison tools:** This feature lets you directly compare SIMD instructions from different (or the same) platforms side by side, offering a clear view of the similarities and differences. It’s a very helpful tool for porting code across architectures, as it ensures accuracy and efficiency.
5050

5151
- **Discussion forum (like StackOverflow):** The integrated discussion forum, powered by **[discuss](https://disqus.com/)** allows users to ask questions, share insights, and troubleshoot problems together. This community-driven space ensures that you’re never stuck on a complex issue without support, fostering collaboration and knowledge-sharing among SIMD developers. Imagine something like **StackOverflow** but specific to SIMD intrinsics.
5252

53-
### Work in Progress & Future Development
54-
- **Pseudo-code:** Currently under development, this feature will enable users to generate high-level pseudo-code based on specific SIMD instructions. This tool aims to enable better understanding of the SIMD instructions, in a *common language*. This will also be used in the next feature, **Intrinsics Diagrams**.
55-
56-
- **Intrinsics Diagrams:** A feature under progress, creating detailed diagrams for each intrinsic to visualize how it operates on a low level using registers. These diagrams will help you grasp the mechanics of SIMD instructions more clearly, aiding in optimization and debugging.
57-
58-
- **[SIMD.ai](https://simd.ai/):** SIMD.ai is an upcoming feature that promises to bring AI-assisted insights and recommendations to the SIMD development process, making it faster and more efficient to port SIMD code between architectures.
59-
60-
### How These Features Aid in SIMD Development
61-
**[SIMD.info](https://simd.info/)** offers a range of features that streamline the process of porting SIMD code across different architectures. The hierarchical structure of tree-based navigation allows you to easily locate instructions within a clear framework. This organization into broad categories and specific subcategories, such as **Arithmetic** and **Boolean Logic**, makes it straightforward to identify the relevant SIMD instructions.
62-
63-
When you need to port code from one architecture to another, the advanced search functionality proves invaluable. You can either search for specific intrinsics or use broader terms to find equivalent instructions across platforms. This capability ensures that you quickly find the right intrinsics for Arm, Intel or Power architectures.
64-
65-
Furthermore, **SIMD.info**’s comparison tools enhance this process by enabling side-by-side comparisons of instructions from various platforms. This feature highlights the similarities and differences between instructions, which is crucial for accurately adapting your code. By understanding how similar operations are implemented across architectures, you can ensure that your ported code performs optimally.
66-
67-
Let's look at an actual example.
53+
You can now learn how to use these features in the context of an actual example.

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1-porting.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ layout: learningpathall
1414

1515
After identifying the **NEON** intrinsics you will need in the ported program, it's time to actually write the code.
1616

17-
Create a new file for the ported NEON code named `calculation_neon.c` with the contents shown below:
17+
This time on your Arm Linux machine, create a new file for the ported NEON code named `calculation_neon.c` with the contents shown below:
1818

1919
```C
2020
#include <arm_neon.h>
@@ -68,7 +68,7 @@ int main() {
6868

6969
It's time to verify that the functionality remains the same, which means you get the same results and similar performance.
7070

71-
Compile the above code as follows on an Arm system:
71+
Compile the above code as follows on your Arm Linux machine:
7272

7373
```bash
7474
gcc -O3 calculation_neon.c -o calculation_neon
@@ -95,5 +95,5 @@ Square Root Result: 1.41 3.46 6.00 8.94
9595
You can see that the results are the same as in the **SSE4.2** example.
9696

9797
{{% notice Note %}}
98-
We initialized the vectors in reverse order compared to the **SSE4.2** version because the array initialization and vld1q_f32 function load vectors from LSB to MSB, whereas **`_mm_set_ps`** loads elements MSB to LSB.
99-
{{% /notice %}}
98+
You initialized the vectors in reverse order compared to the **SSE4.2** version because the array initialization and vld1q_f32 function load vectors from LSB to MSB, whereas **`_mm_set_ps`** loads elements MSB to LSB.
99+
{{% /notice %}}

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example1.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ layout: learningpathall
88

99
Consider the following C example that uses Intel SSE4.2 intrinsics.
1010

11-
Create a file named `calculation_sse.c` with the contents shown below.
11+
On an x86_64 Linux development machine, create a file named `calculation_sse.c` with the contents shown below:
1212

1313
```C
1414
#include <xmmintrin.h>
@@ -54,12 +54,14 @@ int main() {
5454

5555
The program first compares whether elements in one vector are greater than those in another vector, prints the result, and then proceeds to compute the addition of two vectors, multiplies the result with one of the vectors, and finally takes the square root of the multiplication result:
5656

57-
Compile the code as follows on an Intel system that supports **SSE4.2**:
57+
Compile the code on your Linux x86_64 system that supports **SSE4.2**:
58+
5859
```bash
5960
gcc -O3 calculation_sse.c -o calculation_sse -msse4.2
6061
```
6162

6263
Now run the program:
64+
6365
```bash
6466
./calculation_sse
6567
```
@@ -76,4 +78,4 @@ Multiplication Result: 2.00 12.00 36.00 80.00
7678
Square Root Result: 1.41 3.46 6.00 8.94
7779
```
7880

79-
It is imperative that you run the code first on the reference platform (here Intel), to make sure you understand how it works and what kind of results are being expected.
81+
It is imperative that you run the code first on an Intel x86_64 reference platform, to make sure you understand how it works and what kind of results are being expected.

content/learning-paths/cross-platform/simd-info-demo/simdinfo-example2.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ During the porting process, you will observe that certain instructions translate
1212

1313
You may already know the equivalent operations for this particular intrinsic, but let's assume you don't. In this usecase, reading the **`_mm_madd_epi16`** on the **SIMD.info** might indicate that a key characteristic of the instruction involved is the *widening* of the result elements, from 16-bit to 32-bit signed integers. Unfortunately, that is not the case, as this particular instruction does not actually increase the size of the element holding the result values. You will see how that effects the result in the example.
1414

15-
Consider the following code for **SSE2**. Create a new file for the code named `_mm_madd_epi16_test.c` with the contents shown below:
15+
Consider the following code for **SSE2**. Create a new file on your x86_64 Linux machine named `_mm_madd_epi16_test.c` with the contents shown below:
1616

1717
```C
1818
#include <stdint.h>
@@ -44,7 +44,7 @@ int main() {
4444
}
4545
```
4646
47-
Compile the code as follows on an x86 system (no extra flags required as **SSE2** is assumed by default on all 64-bit x86 systems):
47+
Compile the code as follows on the x86_64 system (no extra flags required as **SSE2** is assumed by default on all 64-bit x86 systems):
4848
```bash
4949
gcc -O3 _mm_madd_epi16_test.c -o _mm_madd_epi16_test
5050
```
@@ -63,8 +63,9 @@ _mm_madd_epi16(a, b) : a4d8 0 56b8 0 2198 0 578 0
6363

6464
You will note that the result of the first element is a negative number, even though we added 2 positive results (`130*140` and `150*160`). That is because the result of the addition has to occupy a 16-bit signed integer element and when the first is larger we have the effect of an negative overflow. The result is the same in binary arithmetic, but when interpreted into a signed integer, it turns the number into a negative.
6565

66-
The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, we chose to use **`vmovl`** to zero-extend values, which achieves the correct order with zero elements in place. While both **`vmovl`** and **`zip`** could be used for this purpose, we opted for **`vmovl`** in this implementation. For more details, see the ARM Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
66+
The rest of the values are as expected. Notice how each pair has a zero element next to it. The results are correct, but they are not in the correct order. In this example, you used **`vmovl`** to zero-extend values, which achieves the correct order with zero elements in place. While both **`vmovl`** and **`zip`** could be used for this purpose, **`vmovl`** was chosen in this implementation. For more details, see the Arm Software Optimization Guides, such as the [Neoverse V2 guide](https://developer.arm.com/documentation/109898/latest/).
6767

68+
Now switch your Linux Arm machine and create a file called `_mm_madd_epi16_neon.c` with the contents below:
6869
```C
6970
#include <arm_neon.h>
7071
#include <stdint.h>
@@ -107,7 +108,7 @@ int main() {
107108
}
108109
```
109110
110-
Write the above program to a file called `_mm_madd_epi16_neon.c` and compile it:
111+
Compile the code on your Arm Linux machine:
111112
112113
```bash
113114
gcc -O3 _mm_madd_epi16_neon.c -o _mm_madd_epi16_neon
@@ -127,5 +128,5 @@ vpaddq_s16(a, b) : a4d8 56b8 2198 578 0 0 0 0
127128
final : a4d8 0 56b8 0 2198 0 578 0
128129
```
129130

130-
As you can see the results of both match, **SIMD.info** was especially helpful in this process, providing detailed descriptions and examples that guided the translation of complex intrinsics between different SIMD architectures.
131+
As you can see the results of both executions on different architectures match. You were able to use **SIMD.info** to help with the translation of complex intrinsics between different SIMD architectures.
131132

0 commit comments

Comments
 (0)