Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions content/learning-paths/cross-platform/adler32/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
title: Write NEON intrinsics using GitHub Copilot to improve Adler32 performance

draft: true
cascade:
draft: true

minutes_to_complete: 45

who_is_this_for: This is an introductory topic for C/C++ developers who are interested in using GitHub Copilot to improve performance using NEON intrinsics.

learning_objectives:
- Use GitHub Copilot to write NEON intrinsics to improve performance of the Adler32 checksum algorithm.

prerequisites:
- An Arm computer running Linux with the GNU compiler (gcc) installed.
- VS Code with GitHub Copilot installed.

author: Jason Andrews

### Tags
skilllevels: Introductory
subjects: Performance and Architecture
armips:
- Neoverse
- Cortex-A
tools_software_languages:
- GCC
- Runbook

operatingsystems:
- Linux
shared_path: true
shared_between:
- servers-and-cloud-computing
- laptops-and-desktops
- mobile-graphics-and-gaming


further_reading:
- resource:
title: Arm C Language Extensions
link: https://arm-software.github.io/acle/
type: Documentation
- resource:
title: Adler-32 Checksum Algorithm
link: https://en.wikipedia.org/wiki/Adler-32
type: Article
- resource:
title: NEON Programming Quick Reference
link: https://developer.arm.com/documentation/den0018/a
type: Documentation


### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
8 changes: 8 additions & 0 deletions content/learning-paths/cross-platform/adler32/_next-steps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
52 changes: 52 additions & 0 deletions content/learning-paths/cross-platform/adler32/about-2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: About NEON and Adler32
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Introduction

In computing, optimizing performance is crucial for applications that process large amounts of data. This Learning Path focuses on implementing and optimizing the Adler32 checksum algorithm using Arm advanced SIMD (Single Instruction, Multiple Data) capabilities. You'll learn how to leverage GitHub Copilot to simplify the development process while achieving significant performance improvements.

## Simplifying Arm NEON Development with GitHub Copilot

Developers recognize that Arm NEON SIMD instructions can significantly boost performance for computationally intensive applications, particularly in areas like image processing, audio/video codecs, and machine learning. However, writing NEON intrinsics directly requires specialized knowledge of the instruction set, careful consideration of data alignment, and complex vector operations that can be error-prone and time-consuming. Many developers avoid implementing these optimizations due to the steep learning curve and development overhead.

The good news is that AI developer tools such as GitHub Copilot make working with NEON intrinsics much more accessible. By providing intelligent code suggestions, automated vectorization hints, and contextual examples tailored to your specific use case, GitHub Copilot can help bridge the knowledge gap and accelerate the development of NEON-optimized code. This allows developers to harness the full performance potential of Arm processors without the traditional complexity and time-consuming effort.

Writing NEON intrinsics with GitHub Copilot can be demonstrated by creating a complete project from scratch, and comparing the C implementation with the NEON implementation.

While you may not create complete projects from scratch, and you shouldn't blindly trust the generated code, it's helpful to see what's possible using an example so you can apply the principles to your own projects.

## Accelerating Adler32 Checksum with Arm NEON Instructions

This project demonstrates how to significantly improve the performance of Adler32 checksum calculations using Arm NEON instructions.

### What is Arm NEON?

Arm NEON is an advanced SIMD architecture extension for Arm processors. It provides a set of instructions that can process multiple data elements in parallel using specialized vector registers. NEON technology enables developers to accelerate computationally intensive algorithms by performing the same operation on multiple data points simultaneously, rather than processing them one at a time. This parallelism is particularly valuable for multimedia processing, scientific calculations, and cryptographic operations where the same operation needs to be applied to large datasets.

## What is Adler32?

Adler32 is a checksum algorithm that was invented by Mark Adler in 1995. It's used in the zlib compression library and is faster than CRC32 but provides less reliable error detection.

The algorithm works by calculating two 16-bit sums:

- s1: A simple sum of all bytes
- s2: A sum of all s1 values after each byte
- The final checksum is (s2 << 16) | s1.

## Project Overview

This project explains how you can use GitHub Copilot to create everything listed below:

- Standard C implementation of Adler32
- Test program to confirm Adler32 works correctly for inputs of various sizes
- Makefile to build and run the program
- Performance measurement code to record how long the algorithm takes
- NEON version of Adler32 to increase performance
- Tables showing performance comparison between the standard C version and the NEON version

Continue to the next section to start creating the project.
62 changes: 62 additions & 0 deletions content/learning-paths/cross-platform/adler32/build-6.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: Build and run the test program
weight: 6

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## How can I test the build and run?

The required files are now complete to test the Adler32 algorithm.
- Adler32 C function
- Test program to call the Adler32 function to test for correctness and measure performance
- Makefile to build and run

Copy the information below to your GitHub Copilot Agent session:

```console
Use the Makefile to build the project and run to make sure the checksum results are correct for all data sizes.
```

The output is:

```output

GitHub Copilot: I'll help you build and run the Adler-32 checksum project to verify the results. Let me do this step-by-step.

First, let's build the project using the provided Makefile:Great! The build completed successfully without any errors or warnings. The Makefile compiled both source files (`adler32-simple.c` and adler32-test.c) and linked them into the executable adler32-test.

Now let's run the tests to verify that the Adler-32 checksum implementation works correctly for all data sizes:I ran the Adler-32 checksum project, and here are the results:

The build completed successfully without any errors or warnings. The Makefile correctly compiled both source files (`adler32-simple.c` and adler32-test.c) and linked them into the executable adler32-test.

When running the tests, I got the following results:

### Verification Tests
All 6 verification tests passed successfully:
1. Empty string: PASS
2. "a": PASS
3. "abc": PASS
4. "message digest": PASS
5. "abcdefghijklmnopqrstuvwxyz": PASS
6. "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789": PASS

### Incremental API Test
The incremental API test also passed, showing that both direct and incremental calculation methods produce the same checksum (0x3d094d1a).

### Performance Tests
The implementation was benchmarked with different data sizes:
- 1 KB: 195.31 MB/s
- 10 KB: 128.50 MB/s
- 100 KB: 261.11 MB/s
- 1 MB: 297.80 MB/s
- 10 MB: 298.17 MB/s

The results confirm that your Adler-32 checksum implementation is correct for all data sizes and performs efficiently. The optimizations for the Neoverse N1 architecture specified in the Makefile appear to be working properly, with throughput reaching nearly 300 MB/s for larger data sizes.

```

The results from GitHub Copilot explain that the Adler32 checksum calculations are correct and give some initial performance results. The results don't mean much yet as there is nothing to compare with.

Continue to the next section to implement Adler32 using NEON and compare the performance.
114 changes: 114 additions & 0 deletions content/learning-paths/cross-platform/adler32/makefile-5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
title: Create a Makefile
weight: 5

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## How can I create a Makefile to build and run the test program?

To create a Makefile, copy and paste the information below to GitHub Copilot. The prompt explains that the Makefile should use `gcc` as the compiler and target the Neoverse N1 processor.

```console
Read the .c files in my project and
create a Makefile to build and run the adler32-test program using gcc.
Select the best compiler flags for Neoverse N1 to optimize for performance.
```

The output is:

```output

GitHub Copilot: I'll create a Makefile for your Adler-32 project that's optimized for the Neoverse N1 architecture. First, let's look at the project structure and understand the code files better.I see that the file search didn't return results. Let's try again with a different approach.I can see the project contains three files: adler32-simple.c, adler32-simple.h, and adler32-test.c. Let's examine the code in adler32-simple.c to understand the implementation before creating the Makefile.I've examined the code files in your project:

1. adler32-simple.c - Contains the implementation of the Adler-32 checksum algorithm with functions for initialization, update, and finalization
2. adler32-simple.h - Presumably contains the function declarations
3. adler32-test.c - Contains test code for verifying and benchmarking the Adler-32 implementation

Now I'll create a Makefile optimized for performance on the Neoverse N1 architecture. The Neoverse N1 is an ARM architecture that benefits from specific compiler optimizations.

# Makefile for Adler-32 checksum implementation
# Optimized for Neoverse N1 architecture

# Compiler and flags
CC = gcc
# Neoverse N1 specific optimization flags:
# -march=armv8.2-a+crypto: Target ARMv8.2-A architecture with crypto extensions (Neoverse N1)
# -mcpu=neoverse-n1: Specifically optimize for Neoverse N1
# -O3: Highest standard optimization level
# -flto: Link-time optimization
# -fomit-frame-pointer: Don't keep the frame pointer in a register for functions that don't need one
# -ftree-vectorize: Vectorize loops when possible
# -ffast-math: Enable fast math optimizations
# Additional useful flags:
# -Wall -Wextra: Enable warnings
# -g: Include debugging symbols
CFLAGS = -march=armv8.2-a+crypto -mcpu=neoverse-n1 -O3 -flto -fomit-frame-pointer -ftree-vectorize -ffast-math -Wall -Wextra -g

# Source files
SRCS = adler32-simple.c adler32-test.c
OBJS = $(SRCS:.c=.o)
TARGET = adler32-test

# Default target
all: $(TARGET)

# Link the executable
$(TARGET): $(OBJS)
$(CC) $(CFLAGS) -o $@ $^

# Compile source files
%.o: %.c
$(CC) $(CFLAGS) -c $< -o $@

# Run the tests
run: $(TARGET)
./$(TARGET)

# Clean generated files
clean:
rm -f $(OBJS) $(TARGET)

# Rebuild the project
rebuild: clean all

.PHONY: all run clean rebuild


Made changes.

I've created a Makefile for your Adler-32 project, optimized specifically for the Neoverse N1 architecture. Let me highlight some key features of this Makefile:

### Optimization Flags for Neoverse N1

The Makefile includes several compiler flags tailored for maximum performance on the Neoverse N1:

- `-march=armv8.2-a+crypto`: Targets ARMv8.2-A architecture with crypto extensions
- `-mcpu=neoverse-n1`: Specifically optimizes for the Neoverse N1 processor
- `-O3`: Enables the highest standard optimization level
- `-flto`: Enables link-time optimization for whole-program optimization
- `-fomit-frame-pointer`: Frees up a register by not keeping frame pointers when not needed
- `-ftree-vectorize`: Automatically vectorizes loops to leverage SIMD instructions
- `-ffast-math`: Enables optimizations that might affect floating-point precision but improve performance

### Building and Testing

To use the Makefile, you can run the following commands:

1. To build the project:
make

2. To run the tests after building:
make run

3. To clean the build files:
make clean

4. To rebuild everything from scratch:
make rebuild

The Makefile is designed to be simple but effective, automatically handling dependencies and providing different targets for common development tasks.
```

You are now ready to use the Makefile to build and run the test application.
18 changes: 18 additions & 0 deletions content/learning-paths/cross-platform/adler32/more-11.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
title: Other ideas for GitHub Copilot
weight: 11

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## What else can I do with GitHub Copilot on this project?

You can investigate more topics using GitHub Copilot.

- Direct GitHub Copilot to try different compiler flags and use Agent mode to iterate through the options to find the best solution.
- Add support for the Clang compiler to the Makefile and compare the results to GCC. Depending on the application code, changing the compiler can result in improved performance.
- Use GitHub Copilot to generate different data sizes and random data patterns to further investigate correct functionality and performance.
- Try different algorithm implementations that use compiler autovectorization instead of NEON intrinsics or break down the Adler32 checksum into smaller blocks of data. It may be possible to get even better performance without NEON using the compiler and a better structure for the C code.

While AI tools do not create performance code for every programming problem, they can be a big help to get you started in a new area of programming such as performance optimization using NEON intrinsics.
Loading
Loading