Skip to content

Commit 98fd0b5

Browse files
Merge pull request #1847 from jasonrandrews/review2
Use GitHub Copilot to write NEON intrinsics
2 parents 7547111 + 87be06c commit 98fd0b5

File tree

12 files changed

+1640
-0
lines changed

12 files changed

+1640
-0
lines changed
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
---
2+
title: Write NEON intrinsics using GitHub Copilot to improve Adler32 performance
3+
4+
draft: true
5+
cascade:
6+
draft: true
7+
8+
minutes_to_complete: 45
9+
10+
who_is_this_for: This is an introductory topic for C/C++ developers who are interested in using GitHub Copilot to improve performance using NEON intrinsics.
11+
12+
learning_objectives:
13+
- Use GitHub Copilot to write NEON intrinsics to improve performance of the Adler32 checksum algorithm.
14+
15+
prerequisites:
16+
- An Arm computer running Linux with the GNU compiler (gcc) installed.
17+
- VS Code with GitHub Copilot installed.
18+
19+
author: Jason Andrews
20+
21+
### Tags
22+
skilllevels: Introductory
23+
subjects: Performance and Architecture
24+
armips:
25+
- Neoverse
26+
- Cortex-A
27+
tools_software_languages:
28+
- GCC
29+
- Runbook
30+
31+
operatingsystems:
32+
- Linux
33+
shared_path: true
34+
shared_between:
35+
- servers-and-cloud-computing
36+
- laptops-and-desktops
37+
- mobile-graphics-and-gaming
38+
39+
40+
further_reading:
41+
- resource:
42+
title: Arm C Language Extensions
43+
link: https://arm-software.github.io/acle/
44+
type: Documentation
45+
- resource:
46+
title: Adler-32 Checksum Algorithm
47+
link: https://en.wikipedia.org/wiki/Adler-32
48+
type: Article
49+
- resource:
50+
title: NEON Programming Quick Reference
51+
link: https://developer.arm.com/documentation/den0018/a
52+
type: Documentation
53+
54+
55+
### FIXED, DO NOT MODIFY
56+
# ================================================================================
57+
weight: 1 # _index.md always has weight of 1 to order correctly
58+
layout: "learningpathall" # All files under learning paths have this same wrapper
59+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
60+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
---
2+
title: About NEON and Adler32
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Introduction
10+
11+
In computing, optimizing performance is crucial for applications that process large amounts of data. This Learning Path focuses on implementing and optimizing the Adler32 checksum algorithm using Arm advanced SIMD (Single Instruction, Multiple Data) capabilities. You'll learn how to leverage GitHub Copilot to simplify the development process while achieving significant performance improvements.
12+
13+
## Simplifying Arm NEON Development with GitHub Copilot
14+
15+
Developers recognize that Arm NEON SIMD instructions can significantly boost performance for computationally intensive applications, particularly in areas like image processing, audio/video codecs, and machine learning. However, writing NEON intrinsics directly requires specialized knowledge of the instruction set, careful consideration of data alignment, and complex vector operations that can be error-prone and time-consuming. Many developers avoid implementing these optimizations due to the steep learning curve and development overhead.
16+
17+
The good news is that AI developer tools such as GitHub Copilot make working with NEON intrinsics much more accessible. By providing intelligent code suggestions, automated vectorization hints, and contextual examples tailored to your specific use case, GitHub Copilot can help bridge the knowledge gap and accelerate the development of NEON-optimized code. This allows developers to harness the full performance potential of Arm processors without the traditional complexity and time-consuming effort.
18+
19+
Writing NEON intrinsics with GitHub Copilot can be demonstrated by creating a complete project from scratch, and comparing the C implementation with the NEON implementation.
20+
21+
While you may not create complete projects from scratch, and you shouldn't blindly trust the generated code, it's helpful to see what's possible using an example so you can apply the principles to your own projects.
22+
23+
## Accelerating Adler32 Checksum with Arm NEON Instructions
24+
25+
This project demonstrates how to significantly improve the performance of Adler32 checksum calculations using Arm NEON instructions.
26+
27+
### What is Arm NEON?
28+
29+
Arm NEON is an advanced SIMD architecture extension for Arm processors. It provides a set of instructions that can process multiple data elements in parallel using specialized vector registers. NEON technology enables developers to accelerate computationally intensive algorithms by performing the same operation on multiple data points simultaneously, rather than processing them one at a time. This parallelism is particularly valuable for multimedia processing, scientific calculations, and cryptographic operations where the same operation needs to be applied to large datasets.
30+
31+
## What is Adler32?
32+
33+
Adler32 is a checksum algorithm that was invented by Mark Adler in 1995. It's used in the zlib compression library and is faster than CRC32 but provides less reliable error detection.
34+
35+
The algorithm works by calculating two 16-bit sums:
36+
37+
- s1: A simple sum of all bytes
38+
- s2: A sum of all s1 values after each byte
39+
- The final checksum is (s2 << 16) | s1.
40+
41+
## Project Overview
42+
43+
This project explains how you can use GitHub Copilot to create everything listed below:
44+
45+
- Standard C implementation of Adler32
46+
- Test program to confirm Adler32 works correctly for inputs of various sizes
47+
- Makefile to build and run the program
48+
- Performance measurement code to record how long the algorithm takes
49+
- NEON version of Adler32 to increase performance
50+
- Tables showing performance comparison between the standard C version and the NEON version
51+
52+
Continue to the next section to start creating the project.
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
---
2+
title: Build and run the test program
3+
weight: 6
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## How can I test the build and run?
10+
11+
The required files are now complete to test the Adler32 algorithm.
12+
- Adler32 C function
13+
- Test program to call the Adler32 function to test for correctness and measure performance
14+
- Makefile to build and run
15+
16+
Copy the information below to your GitHub Copilot Agent session:
17+
18+
```console
19+
Use the Makefile to build the project and run to make sure the checksum results are correct for all data sizes.
20+
```
21+
22+
The output is:
23+
24+
```output
25+
26+
GitHub Copilot: I'll help you build and run the Adler-32 checksum project to verify the results. Let me do this step-by-step.
27+
28+
First, let's build the project using the provided Makefile:Great! The build completed successfully without any errors or warnings. The Makefile compiled both source files (`adler32-simple.c` and adler32-test.c) and linked them into the executable adler32-test.
29+
30+
Now let's run the tests to verify that the Adler-32 checksum implementation works correctly for all data sizes:I ran the Adler-32 checksum project, and here are the results:
31+
32+
The build completed successfully without any errors or warnings. The Makefile correctly compiled both source files (`adler32-simple.c` and adler32-test.c) and linked them into the executable adler32-test.
33+
34+
When running the tests, I got the following results:
35+
36+
### Verification Tests
37+
All 6 verification tests passed successfully:
38+
1. Empty string: PASS
39+
2. "a": PASS
40+
3. "abc": PASS
41+
4. "message digest": PASS
42+
5. "abcdefghijklmnopqrstuvwxyz": PASS
43+
6. "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789": PASS
44+
45+
### Incremental API Test
46+
The incremental API test also passed, showing that both direct and incremental calculation methods produce the same checksum (0x3d094d1a).
47+
48+
### Performance Tests
49+
The implementation was benchmarked with different data sizes:
50+
- 1 KB: 195.31 MB/s
51+
- 10 KB: 128.50 MB/s
52+
- 100 KB: 261.11 MB/s
53+
- 1 MB: 297.80 MB/s
54+
- 10 MB: 298.17 MB/s
55+
56+
The results confirm that your Adler-32 checksum implementation is correct for all data sizes and performs efficiently. The optimizations for the Neoverse N1 architecture specified in the Makefile appear to be working properly, with throughput reaching nearly 300 MB/s for larger data sizes.
57+
58+
```
59+
60+
The results from GitHub Copilot explain that the Adler32 checksum calculations are correct and give some initial performance results. The results don't mean much yet as there is nothing to compare with.
61+
62+
Continue to the next section to implement Adler32 using NEON and compare the performance.
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
---
2+
title: Create a Makefile
3+
weight: 5
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## How can I create a Makefile to build and run the test program?
10+
11+
To create a Makefile, copy and paste the information below to GitHub Copilot. The prompt explains that the Makefile should use `gcc` as the compiler and target the Neoverse N1 processor.
12+
13+
```console
14+
Read the .c files in my project and
15+
create a Makefile to build and run the adler32-test program using gcc.
16+
Select the best compiler flags for Neoverse N1 to optimize for performance.
17+
```
18+
19+
The output is:
20+
21+
```output
22+
23+
GitHub Copilot: I'll create a Makefile for your Adler-32 project that's optimized for the Neoverse N1 architecture. First, let's look at the project structure and understand the code files better.I see that the file search didn't return results. Let's try again with a different approach.I can see the project contains three files: adler32-simple.c, adler32-simple.h, and adler32-test.c. Let's examine the code in adler32-simple.c to understand the implementation before creating the Makefile.I've examined the code files in your project:
24+
25+
1. adler32-simple.c - Contains the implementation of the Adler-32 checksum algorithm with functions for initialization, update, and finalization
26+
2. adler32-simple.h - Presumably contains the function declarations
27+
3. adler32-test.c - Contains test code for verifying and benchmarking the Adler-32 implementation
28+
29+
Now I'll create a Makefile optimized for performance on the Neoverse N1 architecture. The Neoverse N1 is an ARM architecture that benefits from specific compiler optimizations.
30+
31+
# Makefile for Adler-32 checksum implementation
32+
# Optimized for Neoverse N1 architecture
33+
34+
# Compiler and flags
35+
CC = gcc
36+
# Neoverse N1 specific optimization flags:
37+
# -march=armv8.2-a+crypto: Target ARMv8.2-A architecture with crypto extensions (Neoverse N1)
38+
# -mcpu=neoverse-n1: Specifically optimize for Neoverse N1
39+
# -O3: Highest standard optimization level
40+
# -flto: Link-time optimization
41+
# -fomit-frame-pointer: Don't keep the frame pointer in a register for functions that don't need one
42+
# -ftree-vectorize: Vectorize loops when possible
43+
# -ffast-math: Enable fast math optimizations
44+
# Additional useful flags:
45+
# -Wall -Wextra: Enable warnings
46+
# -g: Include debugging symbols
47+
CFLAGS = -march=armv8.2-a+crypto -mcpu=neoverse-n1 -O3 -flto -fomit-frame-pointer -ftree-vectorize -ffast-math -Wall -Wextra -g
48+
49+
# Source files
50+
SRCS = adler32-simple.c adler32-test.c
51+
OBJS = $(SRCS:.c=.o)
52+
TARGET = adler32-test
53+
54+
# Default target
55+
all: $(TARGET)
56+
57+
# Link the executable
58+
$(TARGET): $(OBJS)
59+
$(CC) $(CFLAGS) -o $@ $^
60+
61+
# Compile source files
62+
%.o: %.c
63+
$(CC) $(CFLAGS) -c $< -o $@
64+
65+
# Run the tests
66+
run: $(TARGET)
67+
./$(TARGET)
68+
69+
# Clean generated files
70+
clean:
71+
rm -f $(OBJS) $(TARGET)
72+
73+
# Rebuild the project
74+
rebuild: clean all
75+
76+
.PHONY: all run clean rebuild
77+
78+
79+
Made changes.
80+
81+
I've created a Makefile for your Adler-32 project, optimized specifically for the Neoverse N1 architecture. Let me highlight some key features of this Makefile:
82+
83+
### Optimization Flags for Neoverse N1
84+
85+
The Makefile includes several compiler flags tailored for maximum performance on the Neoverse N1:
86+
87+
- `-march=armv8.2-a+crypto`: Targets ARMv8.2-A architecture with crypto extensions
88+
- `-mcpu=neoverse-n1`: Specifically optimizes for the Neoverse N1 processor
89+
- `-O3`: Enables the highest standard optimization level
90+
- `-flto`: Enables link-time optimization for whole-program optimization
91+
- `-fomit-frame-pointer`: Frees up a register by not keeping frame pointers when not needed
92+
- `-ftree-vectorize`: Automatically vectorizes loops to leverage SIMD instructions
93+
- `-ffast-math`: Enables optimizations that might affect floating-point precision but improve performance
94+
95+
### Building and Testing
96+
97+
To use the Makefile, you can run the following commands:
98+
99+
1. To build the project:
100+
make
101+
102+
2. To run the tests after building:
103+
make run
104+
105+
3. To clean the build files:
106+
make clean
107+
108+
4. To rebuild everything from scratch:
109+
make rebuild
110+
111+
The Makefile is designed to be simple but effective, automatically handling dependencies and providing different targets for common development tasks.
112+
```
113+
114+
You are now ready to use the Makefile to build and run the test application.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
---
2+
title: Other ideas for GitHub Copilot
3+
weight: 11
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## What else can I do with GitHub Copilot on this project?
10+
11+
You can investigate more topics using GitHub Copilot.
12+
13+
- Direct GitHub Copilot to try different compiler flags and use Agent mode to iterate through the options to find the best solution.
14+
- Add support for the Clang compiler to the Makefile and compare the results to GCC. Depending on the application code, changing the compiler can result in improved performance.
15+
- Use GitHub Copilot to generate different data sizes and random data patterns to further investigate correct functionality and performance.
16+
- Try different algorithm implementations that use compiler autovectorization instead of NEON intrinsics or break down the Adler32 checksum into smaller blocks of data. It may be possible to get even better performance without NEON using the compiler and a better structure for the C code.
17+
18+
While AI tools do not create performance code for every programming problem, they can be a big help to get you started in a new area of programming such as performance optimization using NEON intrinsics.

0 commit comments

Comments
 (0)