Merge pull request #2026 from madeline-underwood/SVE

jasonrandrews · web-flow · commit 514e9ea0a447 · 2025-06-11T13:09:54.000-05:00
Sve_JA to review
diff --git a/content/learning-paths/servers-and-cloud-computing/sve2-match/_index.md b/content/learning-paths/servers-and-cloud-computing/sve2-match/_index.md
@@ -1,23 +1,19 @@
 ---
-title: Accelerate Search Operations with SVE2 MATCH Instruction on Arm servers
+title: Accelerate search performance with SVE2 MATCH on Arm servers
 
-draft: true
-cascade:
-    draft: true
     
 minutes_to_complete: 20
 
 who_is_this_for: This is an introductory topic for database developers, performance engineers, and anyone optimizing data processing workloads on Arm-based cloud instances.
 
 
 learning_objectives:
-  - Understand how SVE2 MATCH instructions work
-  - Implement search algorithms using scalar and  SVE2 implementations using the MATCH instruction
-  - Compare performance between different implementations
-  - Measure performance improvements on Graviton4 instances
+  - Understand the purpose and function of SVE2 MATCH instructions
+  - Implement a search algorithm using both scalar and SVE2-based MATCH approaches
+  - Benchmark and compare performance between scalar and vectorized implementations
+  - Analyze speedups and efficiency gains on Arm Neoverse-based Graviton4 instances
 prerequisites:
-- An [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from an appropriate
-  cloud service provider.
+- Access to an [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a supported cloud service provider
 
 author: Pareena Verma
 
diff --git a/content/learning-paths/servers-and-cloud-computing/sve2-match/sve2-match-search.md b/content/learning-paths/servers-and-cloud-computing/sve2-match/sve2-match-search.md
@@ -1,6 +1,6 @@
 ---
 # User change
-title: "Compare performance of different Search implementations"
+title: "Compare search performance using scalar and SVE2 MATCH on Arm Servers"
 
 weight: 2
 
@@ -10,54 +10,51 @@ layout: "learningpathall"
 ---
 ## Introduction
 
-Searching for specific values in large arrays is a fundamental operation in many applications, from databases to text processing. The performance of these search operations can significantly impact overall application performance, especially when dealing with large datasets.
-
-In this learning path, you will learn how to use the SVE2 MATCH instructions available on Arm Neoverse V2 based AWS Graviton4 processors to optimize search operations in byte and half word arrays. You will compare the performance of scalar and SVE2 MATCH implementations to demonstrate the significant performance benefits of using specialized vector instructions.
+Searching large arrays for specific values is a core task in performance-sensitive applications—from filtering records in a database to detecting patterns in text or images. On Arm Neoverse-based servers, SVE2 MATCH instructions unlock massive performance gains by vectorizing these operations. In this Learning Path, you’ll implement and benchmark both scalar and vectorized versions of search functions to see just how much faster your workloads can run.
 
 ## What is SVE2 MATCH?
 
 SVE2 (Scalable Vector Extension 2) is an extension to the Arm architecture that provides vector processing capabilities with a length-agnostic programming model. The MATCH instruction is a specialized SVE2 instruction that efficiently searches for elements in a vector that match any element in another vector.
 
-## Set Up Your Environment
+## Set up your environment
 
-To follow this learning path, you will need:
+To work through these examples, you require:
 
-1. An AWS Graviton4 instance running `Ubuntu 24.04`. 
-2. GCC compiler with SVE support
+* An AWS Graviton4 instance running `Ubuntu 24.04`
+* GCC compiler with SVE support
 
-Let's start by setting up our environment:
+Start by setting up your environment:
 
 ```bash
 sudo apt-get update
 sudo apt-get install -y build-essential gcc g++
 ```
-An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most 
-recent compiler version. This Learning path was tested with GCC 13 which is the default version on `Ubuntu 24.04` but you 
-can run it with newer versions of GCC as well.
+An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most recent compiler version. This Learning path was tested with GCC 13 which is the default version on `Ubuntu 24.04`, but you can run it with newer versions of GCC as well.
+
+Create a directory for your implementations:
 
-Create a directory for our implementations:
 ```bash
 mkdir -p sve2_match_demo
 cd sve2_match_demo
 ```
-## Understanding the Problem
+## Understanding the problem
 
-Our goal is to implement a function that searches for any occurrence of a set of keys in an array. The function should return true if any element in the array matches any of the keys, and false otherwise.
+Your goal is to implement a function that searches for any occurrence of a set of keys in an array. The function should return true if any element in the array matches any of the keys, and false otherwise.
 
 This type of search operation is common in many applications:
 
-1. **Database Systems**: Checking if a value exists in a column
-2. **Text Processing**: Finding specific characters in a text
-3. **Network Packet Inspection**: Looking for specific byte patterns
-4. **Image Processing**: Finding specific pixel values
+* **Database systems**: checking if a value exists in a column
+* **Text processing**: finding specific characters in a text
+* **Network packet inspection**: looking for specific byte patterns
+* **Image processing**: finding specific pixel values
 
-## Implementing Search Algorithms
+## Implementing search algorithms
 
 Let's implement three versions of our search function:
 
-### 1. Generic Scalar Implementation
+### 1. Generic scalar implementation
 
-Create a generic implementation in C, checking each element individually against each key. Open a editor of your choice and copy the code shown into a file named `sve2_match_demo.c`:
+Create a generic implementation in C that checks each element individually against each key. Open an editor of your choice and copy the code shown into a file named `sve2_match_demo.c`:
 
 ```c
 #include <arm_sve.h>
@@ -91,7 +88,7 @@ int search_generic_u16(const uint16_t *hay, size_t n, const uint16_t *keys,
 ```
 The `search_generic_u8()` and `search_generic_u16()` functions both return 1 immediately when a match is found in the inner loop.
 
-### 2. SVE2 MATCH Implementation
+### 2. SVE2 MATCH implementation
 
 Now create an implementation that uses SVE2 MATCH instructions to process multiple elements in parallel. Copy the code shown into the same file:
 
@@ -145,9 +142,11 @@ The SVE MATCH implementation with the `search_sve2_match_u8()` and `search_sve2_
    - Processes data in vector-sized chunks with early termination when matches are found. Stops immediately when any element in the vector matches.
    - Falls back to scalar code for remainder elements
 
-### 3. Optimized SVE2 MATCH Implementation
+### 3. Optimized SVE2 MATCH implementation
 
-In this next SVE2 implementation you will add loop unrolling and prefetching to further improve performance. Copy the code shown into the same source file:
+In this next SVE2 implementation you will add loop unrolling and prefetching to further improve performance. 
+
+Copy the code shown into the same source file:
 
 ```c
 int search_sve2_match_u8_unrolled(const uint8_t *hay, size_t n, const uint8_t *keys,
@@ -246,14 +245,14 @@ if (svptest_any(pg, match1) || svptest_any(pg, match2) ||
 }
 ```
 The main highlights of this implementation are:
-   - Processes 4 vectors per iteration instead of just one and stops immediately when any match is found in any of the 4 vectors.
+   - Processes four vectors per iteration instead of just one and stops immediately when any match is found in any of the four vectors.
    - Uses prefetching (__builtin_prefetch) to reduce memory latency
    - Leverages the svmatch_u8/u16 instruction to efficiently compare each element against multiple keys in a single operation
    - Aligns memory to 64-byte boundaries for better memory access performance
 
-## Benchmarking Framework
+## Benchmarking framework
 
-To compare the performance of the three implementations, you will use a benchmarking framework that measures the execution time of each implementation. You will also add helper functions for membership testing that are needed to setup the test data with controlled hit rates:
+To compare the performance of the three implementations, use a benchmarking framework that measures the execution time of each implementation. You will also add helper functions for membership testing that are needed to setup the test data with controlled hit rates:
 
 ```c
 // Timing function
@@ -425,7 +424,7 @@ You can now compile the different search implementations:
 gcc -O3 -march=armv9-a+sve2 -mcpu=neoverse-v2 sve2_match_demo.c -o sve2_match_demo
 ```
 
-Now run the benchmark on a dataset of 65,536 elements (2^16) with a 0.001% hit rate:
+Run the benchmark on a dataset of 65,536 elements (2^16) with a 0.001% hit rate:
 
 ```bash
 ./sve2_match_demo $((1<<16)) 3 0.00001
@@ -457,7 +456,7 @@ You can experiment with different haystack lengths, iterations and hit probabili
 ```
 ## Performance Results
 
-When running on a Graviton4 instance with Ubuntu 24.04 and a dataset of 65,536 elements (2^16), you will observe the following results for different hit probabilities:
+When running on a Graviton4 instance with Ubuntu 24.04 and a dataset of 65,536 elements (2^16), you will see different hit probabilities, as shown in the following results:
 
 ### Latency (ns per iteration) for Different Hit Rates (8-bit)
 
@@ -524,8 +523,8 @@ This pattern makes SVE2 MATCH particularly well-suited for applications where ma
 
 The unrolled implementation consistently outperforms the basic SVE2 MATCH implementation:
 
-1. **Low Hit Rates**: Up to 30% additional speedup
-2. **Higher Hit Rates**: 5-20% additional speedup
+* **Low Hit Rates**: Up to 30% additional speedup
+* **Higher Hit Rates**: 5-20% additional speedup
 
 This demonstrates the value of combining algorithmic optimizations (loop unrolling, prefetching) with hardware-specific instructions for maximum performance.
 
@@ -565,4 +564,3 @@ For image processing, MATCH can accelerate:
 
 The SVE2 MATCH instruction provides a powerful way to accelerate search operations in byte and half word arrays. By implementing these optimizations on Graviton4 instances, you can achieve significant performance improvements for your applications.
 
-