You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/sve2-match/_index.md
+6-10Lines changed: 6 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,23 +1,19 @@
1
1
---
2
-
title: Accelerate Search Operations with SVE2 MATCH Instruction on Arm servers
2
+
title: Accelerate search performance with SVE2 MATCH on Arm servers
3
3
4
-
draft: true
5
-
cascade:
6
-
draft: true
7
4
8
5
minutes_to_complete: 20
9
6
10
7
who_is_this_for: This is an introductory topic for database developers, performance engineers, and anyone optimizing data processing workloads on Arm-based cloud instances.
11
8
12
9
13
10
learning_objectives:
14
-
- Understand how SVE2 MATCH instructions work
15
-
- Implement search algorithms using scalar and SVE2 implementations using the MATCH instruction
16
-
- Compare performance between different implementations
17
-
- Measure performance improvements on Graviton4 instances
11
+
- Understand the purpose and function of SVE2 MATCH instructions
12
+
- Implement a search algorithm using both scalar and SVE2-based MATCH approaches
13
+
- Benchmark and compare performance between scalar and vectorized implementations
14
+
- Analyze speedups and efficiency gains on Arm Neoverse-based Graviton4 instances
18
15
prerequisites:
19
-
- An [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from an appropriate
20
-
cloud service provider.
16
+
- Access to an [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a supported cloud service provider
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/sve2-match/sve2-match-search.md
+31-33Lines changed: 31 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
# User change
3
-
title: "Compare performance of different Search implementations"
3
+
title: "Compare search performance using scalar and SVE2 MATCH on Arm Servers"
4
4
5
5
weight: 2
6
6
@@ -10,54 +10,51 @@ layout: "learningpathall"
10
10
---
11
11
## Introduction
12
12
13
-
Searching for specific values in large arrays is a fundamental operation in many applications, from databases to text processing. The performance of these search operations can significantly impact overall application performance, especially when dealing with large datasets.
14
-
15
-
In this learning path, you will learn how to use the SVE2 MATCH instructions available on Arm Neoverse V2 based AWS Graviton4 processors to optimize search operations in byte and half word arrays. You will compare the performance of scalar and SVE2 MATCH implementations to demonstrate the significant performance benefits of using specialized vector instructions.
13
+
Searching large arrays for specific values is a core task in performance-sensitive applications—from filtering records in a database to detecting patterns in text or images. On Arm Neoverse-based servers, SVE2 MATCH instructions unlock massive performance gains by vectorizing these operations. In this Learning Path, you’ll implement and benchmark both scalar and vectorized versions of search functions to see just how much faster your workloads can run.
16
14
17
15
## What is SVE2 MATCH?
18
16
19
17
SVE2 (Scalable Vector Extension 2) is an extension to the Arm architecture that provides vector processing capabilities with a length-agnostic programming model. The MATCH instruction is a specialized SVE2 instruction that efficiently searches for elements in a vector that match any element in another vector.
20
18
21
-
## Set Up Your Environment
19
+
## Set up your environment
22
20
23
-
To follow this learning path, you will need:
21
+
To work through these examples, you require:
24
22
25
-
1. An AWS Graviton4 instance running `Ubuntu 24.04`.
26
-
2. GCC compiler with SVE support
23
+
* An AWS Graviton4 instance running `Ubuntu 24.04`
24
+
* GCC compiler with SVE support
27
25
28
-
Let's start by setting up our environment:
26
+
Start by setting up your environment:
29
27
30
28
```bash
31
29
sudo apt-get update
32
30
sudo apt-get install -y build-essential gcc g++
33
31
```
34
-
An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most
35
-
recent compiler version. This Learning path was tested with GCC 13 which is the default version on `Ubuntu 24.04` but you
36
-
can run it with newer versions of GCC as well.
32
+
An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most recent compiler version. This Learning path was tested with GCC 13 which is the default version on `Ubuntu 24.04`, but you can run it with newer versions of GCC as well.
33
+
34
+
Create a directory for your implementations:
37
35
38
-
Create a directory for our implementations:
39
36
```bash
40
37
mkdir -p sve2_match_demo
41
38
cd sve2_match_demo
42
39
```
43
-
## Understanding the Problem
40
+
## Understanding the problem
44
41
45
-
Our goal is to implement a function that searches for any occurrence of a set of keys in an array. The function should return true if any element in the array matches any of the keys, and false otherwise.
42
+
Your goal is to implement a function that searches for any occurrence of a set of keys in an array. The function should return true if any element in the array matches any of the keys, and false otherwise.
46
43
47
44
This type of search operation is common in many applications:
48
45
49
-
1.**Database Systems**: Checking if a value exists in a column
50
-
2.**Text Processing**: Finding specific characters in a text
51
-
3.**Network Packet Inspection**: Looking for specific byte patterns
52
-
4.**Image Processing**: Finding specific pixel values
46
+
***Database systems**: checking if a value exists in a column
47
+
***Text processing**: finding specific characters in a text
48
+
***Network packet inspection**: looking for specific byte patterns
49
+
***Image processing**: finding specific pixel values
53
50
54
-
## Implementing Search Algorithms
51
+
## Implementing search algorithms
55
52
56
53
Let's implement three versions of our search function:
57
54
58
-
### 1. Generic Scalar Implementation
55
+
### 1. Generic scalar implementation
59
56
60
-
Create a generic implementation in C, checking each element individually against each key. Open a editor of your choice and copy the code shown into a file named `sve2_match_demo.c`:
57
+
Create a generic implementation in C that checks each element individually against each key. Open an editor of your choice and copy the code shown into a file named `sve2_match_demo.c`:
61
58
62
59
```c
63
60
#include<arm_sve.h>
@@ -91,7 +88,7 @@ int search_generic_u16(const uint16_t *hay, size_t n, const uint16_t *keys,
91
88
```
92
89
The `search_generic_u8()` and `search_generic_u16()` functions both return 1 immediately when a match is found in the inner loop.
93
90
94
-
### 2. SVE2 MATCH Implementation
91
+
### 2. SVE2 MATCH implementation
95
92
96
93
Now create an implementation that uses SVE2 MATCH instructions to process multiple elements in parallel. Copy the code shown into the same file:
97
94
@@ -145,9 +142,11 @@ The SVE MATCH implementation with the `search_sve2_match_u8()` and `search_sve2_
145
142
- Processes data in vector-sized chunks with early termination when matches are found. Stops immediately when any element in the vector matches.
146
143
- Falls back to scalar code for remainder elements
147
144
148
-
### 3. Optimized SVE2 MATCH Implementation
145
+
### 3. Optimized SVE2 MATCH implementation
149
146
150
-
In this next SVE2 implementation you will add loop unrolling and prefetching to further improve performance. Copy the code shown into the same source file:
147
+
In this next SVE2 implementation you will add loop unrolling and prefetching to further improve performance.
148
+
149
+
Copy the code shown into the same source file:
151
150
152
151
```c
153
152
intsearch_sve2_match_u8_unrolled(const uint8_t *hay, size_t n, const uint8_t *keys,
- Processes 4 vectors per iteration instead of just one and stops immediately when any match is found in any of the 4 vectors.
248
+
- Processes four vectors per iteration instead of just one and stops immediately when any match is found in any of the four vectors.
250
249
- Uses prefetching (__builtin_prefetch) to reduce memory latency
251
250
- Leverages the svmatch_u8/u16 instruction to efficiently compare each element against multiple keys in a single operation
252
251
- Aligns memory to 64-byte boundaries for better memory access performance
253
252
254
-
## Benchmarking Framework
253
+
## Benchmarking framework
255
254
256
-
To compare the performance of the three implementations, you will use a benchmarking framework that measures the execution time of each implementation. You will also add helper functions for membership testing that are needed to setup the test data with controlled hit rates:
255
+
To compare the performance of the three implementations, use a benchmarking framework that measures the execution time of each implementation. You will also add helper functions for membership testing that are needed to setup the test data with controlled hit rates:
257
256
258
257
```c
259
258
// Timing function
@@ -425,7 +424,7 @@ You can now compile the different search implementations:
Now run the benchmark on a dataset of 65,536 elements (2^16) with a 0.001% hit rate:
427
+
Run the benchmark on a dataset of 65,536 elements (2^16) with a 0.001% hit rate:
429
428
430
429
```bash
431
430
./sve2_match_demo $((1<<16)) 3 0.00001
@@ -457,7 +456,7 @@ You can experiment with different haystack lengths, iterations and hit probabili
457
456
```
458
457
## Performance Results
459
458
460
-
When running on a Graviton4 instance with Ubuntu 24.04 and a dataset of 65,536 elements (2^16), you will observe the following results for different hit probabilities:
459
+
When running on a Graviton4 instance with Ubuntu 24.04 and a dataset of 65,536 elements (2^16), you will see different hit probabilities, as shown in the following results:
461
460
462
461
### Latency (ns per iteration) for Different Hit Rates (8-bit)
463
462
@@ -524,8 +523,8 @@ This pattern makes SVE2 MATCH particularly well-suited for applications where ma
524
523
525
524
The unrolled implementation consistently outperforms the basic SVE2 MATCH implementation:
526
525
527
-
1.**Low Hit Rates**: Up to 30% additional speedup
528
-
2.**Higher Hit Rates**: 5-20% additional speedup
526
+
***Low Hit Rates**: Up to 30% additional speedup
527
+
***Higher Hit Rates**: 5-20% additional speedup
529
528
530
529
This demonstrates the value of combining algorithmic optimizations (loop unrolling, prefetching) with hardware-specific instructions for maximum performance.
531
530
@@ -565,4 +564,3 @@ For image processing, MATCH can accelerate:
565
564
566
565
The SVE2 MATCH instruction provides a powerful way to accelerate search operations in byte and half word arrays. By implementing these optimizations on Graviton4 instances, you can achieve significant performance improvements for your applications.
0 commit comments