Skip to content

Commit 514e9ea

Browse files
Merge pull request #2026 from madeline-underwood/SVE
Sve_JA to review
2 parents fc108e8 + ea800d1 commit 514e9ea

File tree

2 files changed

+37
-43
lines changed

2 files changed

+37
-43
lines changed

content/learning-paths/servers-and-cloud-computing/sve2-match/_index.md

Lines changed: 6 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,19 @@
11
---
2-
title: Accelerate Search Operations with SVE2 MATCH Instruction on Arm servers
2+
title: Accelerate search performance with SVE2 MATCH on Arm servers
33

4-
draft: true
5-
cascade:
6-
draft: true
74

85
minutes_to_complete: 20
96

107
who_is_this_for: This is an introductory topic for database developers, performance engineers, and anyone optimizing data processing workloads on Arm-based cloud instances.
118

129

1310
learning_objectives:
14-
- Understand how SVE2 MATCH instructions work
15-
- Implement search algorithms using scalar and SVE2 implementations using the MATCH instruction
16-
- Compare performance between different implementations
17-
- Measure performance improvements on Graviton4 instances
11+
- Understand the purpose and function of SVE2 MATCH instructions
12+
- Implement a search algorithm using both scalar and SVE2-based MATCH approaches
13+
- Benchmark and compare performance between scalar and vectorized implementations
14+
- Analyze speedups and efficiency gains on Arm Neoverse-based Graviton4 instances
1815
prerequisites:
19-
- An [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from an appropriate
20-
cloud service provider.
16+
- Access to an [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a supported cloud service provider
2117

2218
author: Pareena Verma
2319

content/learning-paths/servers-and-cloud-computing/sve2-match/sve2-match-search.md

Lines changed: 31 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
# User change
3-
title: "Compare performance of different Search implementations"
3+
title: "Compare search performance using scalar and SVE2 MATCH on Arm Servers"
44

55
weight: 2
66

@@ -10,54 +10,51 @@ layout: "learningpathall"
1010
---
1111
## Introduction
1212

13-
Searching for specific values in large arrays is a fundamental operation in many applications, from databases to text processing. The performance of these search operations can significantly impact overall application performance, especially when dealing with large datasets.
14-
15-
In this learning path, you will learn how to use the SVE2 MATCH instructions available on Arm Neoverse V2 based AWS Graviton4 processors to optimize search operations in byte and half word arrays. You will compare the performance of scalar and SVE2 MATCH implementations to demonstrate the significant performance benefits of using specialized vector instructions.
13+
Searching large arrays for specific values is a core task in performance-sensitive applications—from filtering records in a database to detecting patterns in text or images. On Arm Neoverse-based servers, SVE2 MATCH instructions unlock massive performance gains by vectorizing these operations. In this Learning Path, you’ll implement and benchmark both scalar and vectorized versions of search functions to see just how much faster your workloads can run.
1614

1715
## What is SVE2 MATCH?
1816

1917
SVE2 (Scalable Vector Extension 2) is an extension to the Arm architecture that provides vector processing capabilities with a length-agnostic programming model. The MATCH instruction is a specialized SVE2 instruction that efficiently searches for elements in a vector that match any element in another vector.
2018

21-
## Set Up Your Environment
19+
## Set up your environment
2220

23-
To follow this learning path, you will need:
21+
To work through these examples, you require:
2422

25-
1. An AWS Graviton4 instance running `Ubuntu 24.04`.
26-
2. GCC compiler with SVE support
23+
* An AWS Graviton4 instance running `Ubuntu 24.04`
24+
* GCC compiler with SVE support
2725

28-
Let's start by setting up our environment:
26+
Start by setting up your environment:
2927

3028
```bash
3129
sudo apt-get update
3230
sudo apt-get install -y build-essential gcc g++
3331
```
34-
An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most
35-
recent compiler version. This Learning path was tested with GCC 13 which is the default version on `Ubuntu 24.04` but you
36-
can run it with newer versions of GCC as well.
32+
An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most recent compiler version. This Learning path was tested with GCC 13 which is the default version on `Ubuntu 24.04`, but you can run it with newer versions of GCC as well.
33+
34+
Create a directory for your implementations:
3735

38-
Create a directory for our implementations:
3936
```bash
4037
mkdir -p sve2_match_demo
4138
cd sve2_match_demo
4239
```
43-
## Understanding the Problem
40+
## Understanding the problem
4441

45-
Our goal is to implement a function that searches for any occurrence of a set of keys in an array. The function should return true if any element in the array matches any of the keys, and false otherwise.
42+
Your goal is to implement a function that searches for any occurrence of a set of keys in an array. The function should return true if any element in the array matches any of the keys, and false otherwise.
4643

4744
This type of search operation is common in many applications:
4845

49-
1. **Database Systems**: Checking if a value exists in a column
50-
2. **Text Processing**: Finding specific characters in a text
51-
3. **Network Packet Inspection**: Looking for specific byte patterns
52-
4. **Image Processing**: Finding specific pixel values
46+
* **Database systems**: checking if a value exists in a column
47+
* **Text processing**: finding specific characters in a text
48+
* **Network packet inspection**: looking for specific byte patterns
49+
* **Image processing**: finding specific pixel values
5350

54-
## Implementing Search Algorithms
51+
## Implementing search algorithms
5552

5653
Let's implement three versions of our search function:
5754

58-
### 1. Generic Scalar Implementation
55+
### 1. Generic scalar implementation
5956

60-
Create a generic implementation in C, checking each element individually against each key. Open a editor of your choice and copy the code shown into a file named `sve2_match_demo.c`:
57+
Create a generic implementation in C that checks each element individually against each key. Open an editor of your choice and copy the code shown into a file named `sve2_match_demo.c`:
6158

6259
```c
6360
#include <arm_sve.h>
@@ -91,7 +88,7 @@ int search_generic_u16(const uint16_t *hay, size_t n, const uint16_t *keys,
9188
```
9289
The `search_generic_u8()` and `search_generic_u16()` functions both return 1 immediately when a match is found in the inner loop.
9390
94-
### 2. SVE2 MATCH Implementation
91+
### 2. SVE2 MATCH implementation
9592
9693
Now create an implementation that uses SVE2 MATCH instructions to process multiple elements in parallel. Copy the code shown into the same file:
9794
@@ -145,9 +142,11 @@ The SVE MATCH implementation with the `search_sve2_match_u8()` and `search_sve2_
145142
- Processes data in vector-sized chunks with early termination when matches are found. Stops immediately when any element in the vector matches.
146143
- Falls back to scalar code for remainder elements
147144

148-
### 3. Optimized SVE2 MATCH Implementation
145+
### 3. Optimized SVE2 MATCH implementation
149146

150-
In this next SVE2 implementation you will add loop unrolling and prefetching to further improve performance. Copy the code shown into the same source file:
147+
In this next SVE2 implementation you will add loop unrolling and prefetching to further improve performance.
148+
149+
Copy the code shown into the same source file:
151150

152151
```c
153152
int search_sve2_match_u8_unrolled(const uint8_t *hay, size_t n, const uint8_t *keys,
@@ -246,14 +245,14 @@ if (svptest_any(pg, match1) || svptest_any(pg, match2) ||
246245
}
247246
```
248247
The main highlights of this implementation are:
249-
- Processes 4 vectors per iteration instead of just one and stops immediately when any match is found in any of the 4 vectors.
248+
- Processes four vectors per iteration instead of just one and stops immediately when any match is found in any of the four vectors.
250249
- Uses prefetching (__builtin_prefetch) to reduce memory latency
251250
- Leverages the svmatch_u8/u16 instruction to efficiently compare each element against multiple keys in a single operation
252251
- Aligns memory to 64-byte boundaries for better memory access performance
253252
254-
## Benchmarking Framework
253+
## Benchmarking framework
255254
256-
To compare the performance of the three implementations, you will use a benchmarking framework that measures the execution time of each implementation. You will also add helper functions for membership testing that are needed to setup the test data with controlled hit rates:
255+
To compare the performance of the three implementations, use a benchmarking framework that measures the execution time of each implementation. You will also add helper functions for membership testing that are needed to setup the test data with controlled hit rates:
257256
258257
```c
259258
// Timing function
@@ -425,7 +424,7 @@ You can now compile the different search implementations:
425424
gcc -O3 -march=armv9-a+sve2 -mcpu=neoverse-v2 sve2_match_demo.c -o sve2_match_demo
426425
```
427426

428-
Now run the benchmark on a dataset of 65,536 elements (2^16) with a 0.001% hit rate:
427+
Run the benchmark on a dataset of 65,536 elements (2^16) with a 0.001% hit rate:
429428

430429
```bash
431430
./sve2_match_demo $((1<<16)) 3 0.00001
@@ -457,7 +456,7 @@ You can experiment with different haystack lengths, iterations and hit probabili
457456
```
458457
## Performance Results
459458

460-
When running on a Graviton4 instance with Ubuntu 24.04 and a dataset of 65,536 elements (2^16), you will observe the following results for different hit probabilities:
459+
When running on a Graviton4 instance with Ubuntu 24.04 and a dataset of 65,536 elements (2^16), you will see different hit probabilities, as shown in the following results:
461460

462461
### Latency (ns per iteration) for Different Hit Rates (8-bit)
463462

@@ -524,8 +523,8 @@ This pattern makes SVE2 MATCH particularly well-suited for applications where ma
524523

525524
The unrolled implementation consistently outperforms the basic SVE2 MATCH implementation:
526525

527-
1. **Low Hit Rates**: Up to 30% additional speedup
528-
2. **Higher Hit Rates**: 5-20% additional speedup
526+
* **Low Hit Rates**: Up to 30% additional speedup
527+
* **Higher Hit Rates**: 5-20% additional speedup
529528

530529
This demonstrates the value of combining algorithmic optimizations (loop unrolling, prefetching) with hardware-specific instructions for maximum performance.
531530

@@ -565,4 +564,3 @@ For image processing, MATCH can accelerate:
565564

566565
The SVE2 MATCH instruction provides a powerful way to accelerate search operations in byte and half word arrays. By implementing these optimizations on Graviton4 instances, you can achieve significant performance improvements for your applications.
567566

568-

0 commit comments

Comments
 (0)