diff --git a/content/learning-paths/cross-platform/adler32/_index.md b/content/learning-paths/cross-platform/adler32/_index.md new file mode 100644 index 0000000000..e5f83f07c4 --- /dev/null +++ b/content/learning-paths/cross-platform/adler32/_index.md @@ -0,0 +1,60 @@ +--- +title: Write NEON intrinsics using GitHub Copilot to improve Adler32 performance + +draft: true +cascade: + draft: true + +minutes_to_complete: 45 + +who_is_this_for: This is an introductory topic for C/C++ developers who are interested in using GitHub Copilot to improve performance using NEON intrinsics. + +learning_objectives: + - Use GitHub Copilot to write NEON intrinsics to improve performance of the Adler32 checksum algorithm. + +prerequisites: + - An Arm computer running Linux with the GNU compiler (gcc) installed. + - VS Code with GitHub Copilot installed. + +author: Jason Andrews + +### Tags +skilllevels: Introductory +subjects: Performance and Architecture +armips: + - Neoverse + - Cortex-A +tools_software_languages: + - GCC + - Runbook + +operatingsystems: + - Linux +shared_path: true +shared_between: + - servers-and-cloud-computing + - laptops-and-desktops + - mobile-graphics-and-gaming + + +further_reading: + - resource: + title: Arm C Language Extensions + link: https://arm-software.github.io/acle/ + type: Documentation + - resource: + title: Adler-32 Checksum Algorithm + link: https://en.wikipedia.org/wiki/Adler-32 + type: Article + - resource: + title: NEON Programming Quick Reference + link: https://developer.arm.com/documentation/den0018/a + type: Documentation + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/cross-platform/adler32/_next-steps.md b/content/learning-paths/cross-platform/adler32/_next-steps.md new file mode 100644 index 0000000000..c3db0de5a2 --- /dev/null +++ b/content/learning-paths/cross-platform/adler32/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/cross-platform/adler32/about-2.md b/content/learning-paths/cross-platform/adler32/about-2.md new file mode 100644 index 0000000000..04fd4fcc0e --- /dev/null +++ b/content/learning-paths/cross-platform/adler32/about-2.md @@ -0,0 +1,52 @@ +--- +title: About NEON and Adler32 +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Introduction + +In computing, optimizing performance is crucial for applications that process large amounts of data. This Learning Path focuses on implementing and optimizing the Adler32 checksum algorithm using Arm advanced SIMD (Single Instruction, Multiple Data) capabilities. You'll learn how to leverage GitHub Copilot to simplify the development process while achieving significant performance improvements. + +## Simplifying Arm NEON Development with GitHub Copilot + +Developers recognize that Arm NEON SIMD instructions can significantly boost performance for computationally intensive applications, particularly in areas like image processing, audio/video codecs, and machine learning. However, writing NEON intrinsics directly requires specialized knowledge of the instruction set, careful consideration of data alignment, and complex vector operations that can be error-prone and time-consuming. Many developers avoid implementing these optimizations due to the steep learning curve and development overhead. + +The good news is that AI developer tools such as GitHub Copilot make working with NEON intrinsics much more accessible. By providing intelligent code suggestions, automated vectorization hints, and contextual examples tailored to your specific use case, GitHub Copilot can help bridge the knowledge gap and accelerate the development of NEON-optimized code. This allows developers to harness the full performance potential of Arm processors without the traditional complexity and time-consuming effort. + +Writing NEON intrinsics with GitHub Copilot can be demonstrated by creating a complete project from scratch, and comparing the C implementation with the NEON implementation. + +While you may not create complete projects from scratch, and you shouldn't blindly trust the generated code, it's helpful to see what's possible using an example so you can apply the principles to your own projects. + +## Accelerating Adler32 Checksum with Arm NEON Instructions + +This project demonstrates how to significantly improve the performance of Adler32 checksum calculations using Arm NEON instructions. + +### What is Arm NEON? + +Arm NEON is an advanced SIMD architecture extension for Arm processors. It provides a set of instructions that can process multiple data elements in parallel using specialized vector registers. NEON technology enables developers to accelerate computationally intensive algorithms by performing the same operation on multiple data points simultaneously, rather than processing them one at a time. This parallelism is particularly valuable for multimedia processing, scientific calculations, and cryptographic operations where the same operation needs to be applied to large datasets. + +## What is Adler32? + +Adler32 is a checksum algorithm that was invented by Mark Adler in 1995. It's used in the zlib compression library and is faster than CRC32 but provides less reliable error detection. + +The algorithm works by calculating two 16-bit sums: + +- s1: A simple sum of all bytes +- s2: A sum of all s1 values after each byte +- The final checksum is (s2 << 16) | s1. + +## Project Overview + +This project explains how you can use GitHub Copilot to create everything listed below: + +- Standard C implementation of Adler32 +- Test program to confirm Adler32 works correctly for inputs of various sizes +- Makefile to build and run the program +- Performance measurement code to record how long the algorithm takes +- NEON version of Adler32 to increase performance +- Tables showing performance comparison between the standard C version and the NEON version + +Continue to the next section to start creating the project. \ No newline at end of file diff --git a/content/learning-paths/cross-platform/adler32/build-6.md b/content/learning-paths/cross-platform/adler32/build-6.md new file mode 100644 index 0000000000..d0fac641c2 --- /dev/null +++ b/content/learning-paths/cross-platform/adler32/build-6.md @@ -0,0 +1,62 @@ +--- +title: Build and run the test program +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## How can I test the build and run? + +The required files are now complete to test the Adler32 algorithm. +- Adler32 C function +- Test program to call the Adler32 function to test for correctness and measure performance +- Makefile to build and run + +Copy the information below to your GitHub Copilot Agent session: + +```console +Use the Makefile to build the project and run to make sure the checksum results are correct for all data sizes. +``` + +The output is: + +```output + +GitHub Copilot: I'll help you build and run the Adler-32 checksum project to verify the results. Let me do this step-by-step. + +First, let's build the project using the provided Makefile:Great! The build completed successfully without any errors or warnings. The Makefile compiled both source files (`adler32-simple.c` and adler32-test.c) and linked them into the executable adler32-test. + +Now let's run the tests to verify that the Adler-32 checksum implementation works correctly for all data sizes:I ran the Adler-32 checksum project, and here are the results: + +The build completed successfully without any errors or warnings. The Makefile correctly compiled both source files (`adler32-simple.c` and adler32-test.c) and linked them into the executable adler32-test. + +When running the tests, I got the following results: + +### Verification Tests +All 6 verification tests passed successfully: +1. Empty string: PASS +2. "a": PASS +3. "abc": PASS +4. "message digest": PASS +5. "abcdefghijklmnopqrstuvwxyz": PASS +6. "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789": PASS + +### Incremental API Test +The incremental API test also passed, showing that both direct and incremental calculation methods produce the same checksum (0x3d094d1a). + +### Performance Tests +The implementation was benchmarked with different data sizes: +- 1 KB: 195.31 MB/s +- 10 KB: 128.50 MB/s +- 100 KB: 261.11 MB/s +- 1 MB: 297.80 MB/s +- 10 MB: 298.17 MB/s + +The results confirm that your Adler-32 checksum implementation is correct for all data sizes and performs efficiently. The optimizations for the Neoverse N1 architecture specified in the Makefile appear to be working properly, with throughput reaching nearly 300 MB/s for larger data sizes. + +``` + +The results from GitHub Copilot explain that the Adler32 checksum calculations are correct and give some initial performance results. The results don't mean much yet as there is nothing to compare with. + +Continue to the next section to implement Adler32 using NEON and compare the performance. \ No newline at end of file diff --git a/content/learning-paths/cross-platform/adler32/makefile-5.md b/content/learning-paths/cross-platform/adler32/makefile-5.md new file mode 100644 index 0000000000..3edf9ae2e3 --- /dev/null +++ b/content/learning-paths/cross-platform/adler32/makefile-5.md @@ -0,0 +1,114 @@ +--- +title: Create a Makefile +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## How can I create a Makefile to build and run the test program? + +To create a Makefile, copy and paste the information below to GitHub Copilot. The prompt explains that the Makefile should use `gcc` as the compiler and target the Neoverse N1 processor. + +```console +Read the .c files in my project and +create a Makefile to build and run the adler32-test program using gcc. +Select the best compiler flags for Neoverse N1 to optimize for performance. +``` + +The output is: + +```output + +GitHub Copilot: I'll create a Makefile for your Adler-32 project that's optimized for the Neoverse N1 architecture. First, let's look at the project structure and understand the code files better.I see that the file search didn't return results. Let's try again with a different approach.I can see the project contains three files: adler32-simple.c, adler32-simple.h, and adler32-test.c. Let's examine the code in adler32-simple.c to understand the implementation before creating the Makefile.I've examined the code files in your project: + +1. adler32-simple.c - Contains the implementation of the Adler-32 checksum algorithm with functions for initialization, update, and finalization +2. adler32-simple.h - Presumably contains the function declarations +3. adler32-test.c - Contains test code for verifying and benchmarking the Adler-32 implementation + +Now I'll create a Makefile optimized for performance on the Neoverse N1 architecture. The Neoverse N1 is an ARM architecture that benefits from specific compiler optimizations. + +# Makefile for Adler-32 checksum implementation +# Optimized for Neoverse N1 architecture + +# Compiler and flags +CC = gcc +# Neoverse N1 specific optimization flags: +# -march=armv8.2-a+crypto: Target ARMv8.2-A architecture with crypto extensions (Neoverse N1) +# -mcpu=neoverse-n1: Specifically optimize for Neoverse N1 +# -O3: Highest standard optimization level +# -flto: Link-time optimization +# -fomit-frame-pointer: Don't keep the frame pointer in a register for functions that don't need one +# -ftree-vectorize: Vectorize loops when possible +# -ffast-math: Enable fast math optimizations +# Additional useful flags: +# -Wall -Wextra: Enable warnings +# -g: Include debugging symbols +CFLAGS = -march=armv8.2-a+crypto -mcpu=neoverse-n1 -O3 -flto -fomit-frame-pointer -ftree-vectorize -ffast-math -Wall -Wextra -g + +# Source files +SRCS = adler32-simple.c adler32-test.c +OBJS = $(SRCS:.c=.o) +TARGET = adler32-test + +# Default target +all: $(TARGET) + +# Link the executable +$(TARGET): $(OBJS) + $(CC) $(CFLAGS) -o $@ $^ + +# Compile source files +%.o: %.c + $(CC) $(CFLAGS) -c $< -o $@ + +# Run the tests +run: $(TARGET) + ./$(TARGET) + +# Clean generated files +clean: + rm -f $(OBJS) $(TARGET) + +# Rebuild the project +rebuild: clean all + +.PHONY: all run clean rebuild + + +Made changes. + +I've created a Makefile for your Adler-32 project, optimized specifically for the Neoverse N1 architecture. Let me highlight some key features of this Makefile: + +### Optimization Flags for Neoverse N1 + +The Makefile includes several compiler flags tailored for maximum performance on the Neoverse N1: + +- `-march=armv8.2-a+crypto`: Targets ARMv8.2-A architecture with crypto extensions +- `-mcpu=neoverse-n1`: Specifically optimizes for the Neoverse N1 processor +- `-O3`: Enables the highest standard optimization level +- `-flto`: Enables link-time optimization for whole-program optimization +- `-fomit-frame-pointer`: Frees up a register by not keeping frame pointers when not needed +- `-ftree-vectorize`: Automatically vectorizes loops to leverage SIMD instructions +- `-ffast-math`: Enables optimizations that might affect floating-point precision but improve performance + +### Building and Testing + +To use the Makefile, you can run the following commands: + +1. To build the project: + make + +2. To run the tests after building: + make run + +3. To clean the build files: + make clean + +4. To rebuild everything from scratch: + make rebuild + +The Makefile is designed to be simple but effective, automatically handling dependencies and providing different targets for common development tasks. +``` + +You are now ready to use the Makefile to build and run the test application. \ No newline at end of file diff --git a/content/learning-paths/cross-platform/adler32/more-11.md b/content/learning-paths/cross-platform/adler32/more-11.md new file mode 100644 index 0000000000..50a263a8ac --- /dev/null +++ b/content/learning-paths/cross-platform/adler32/more-11.md @@ -0,0 +1,18 @@ +--- +title: Other ideas for GitHub Copilot +weight: 11 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## What else can I do with GitHub Copilot on this project? + +You can investigate more topics using GitHub Copilot. + +- Direct GitHub Copilot to try different compiler flags and use Agent mode to iterate through the options to find the best solution. +- Add support for the Clang compiler to the Makefile and compare the results to GCC. Depending on the application code, changing the compiler can result in improved performance. +- Use GitHub Copilot to generate different data sizes and random data patterns to further investigate correct functionality and performance. +- Try different algorithm implementations that use compiler autovectorization instead of NEON intrinsics or break down the Adler32 checksum into smaller blocks of data. It may be possible to get even better performance without NEON using the compiler and a better structure for the C code. + +While AI tools do not create performance code for every programming problem, they can be a big help to get you started in a new area of programming such as performance optimization using NEON intrinsics. \ No newline at end of file diff --git a/content/learning-paths/cross-platform/adler32/neon-7.md b/content/learning-paths/cross-platform/adler32/neon-7.md new file mode 100644 index 0000000000..fc009548ed --- /dev/null +++ b/content/learning-paths/cross-platform/adler32/neon-7.md @@ -0,0 +1,558 @@ +--- +title: Create a NEON version of Adler32 +weight: 7 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## How can I create a NEON version of Adler32 to improve performance? + +Writing NEON intrinsics is an advanced programming topic which requires time and effort to learn and apply. GitHub Copilot is a good way to get started with NEON, both to write new code and to debug issues during development. + +You can instruct Github Copilot to create a NEON version of Adler32 using the prompt below. + +```console +Add a second implementation of adler32 that processes data in blocks and uses Arm NEON intrinsics to improve performance. +Create it in the file adler32-neon.c +For inputs less than 16 bytes use a standard implementation. +Update the Makefile and other files to add this new version and compare performance. +Print out the speedup from the original version. +Make sure the checksum values are the same for both versions. +``` + +The output is: + +```output + +GitHub Copilot: I'll help you create a NEON-optimized version of adler32 and update the project to compare performance with the original implementation. Let's start by examining the existing code structure to understand the current implementation.Great, I've found the header file for the simple Adler-32 implementation. Now let's look at the implementation file to understand how it works.I've examined the existing code for the Adler32 implementation. Now let's check the test file to understand how to properly integrate our new implementation:I've analyzed your Adler32 implementation and test code. Now I'll create a NEON-optimized version that processes data in blocks, update the Makefile, and compare the performance between both implementations. + +Let's first create the NEON implementation file: + +/* + * adler32-neon.h - Header file for NEON-optimized Adler-32 checksum implementation + */ + +#ifndef ADLER32_NEON_H +#define ADLER32_NEON_H + +#include +#include + +/** + * Calculate Adler-32 checksum for a data buffer using NEON optimization + * + * @param data Pointer to the data buffer + * @param length Length of the data in bytes + * @param initial Initial Adler-32 value (typically 1) + * + * @return The calculated Adler-32 checksum + */ +uint32_t adler32_neon(const unsigned char *data, size_t length, uint32_t initial); + +/** + * Initialize Adler-32 checksum calculation + * + * @return The initial Adler-32 value (1) + */ +uint32_t adler32_neon_init(void); + +/** + * Update Adler-32 checksum with new data using NEON optimization + * + * @param adler Current Adler-32 value + * @param data Pointer to the data buffer + * @param length Length of the data in bytes + * + * @return The updated Adler-32 value + */ +uint32_t adler32_neon_update(uint32_t adler, const unsigned char *data, size_t length); + +/** + * Finalize Adler-32 calculation + * + * @param adler Current Adler-32 value + * + * @return The final Adler-32 checksum + */ +uint32_t adler32_neon_final(uint32_t adler); + +#endif /* ADLER32_NEON_H */ + + +Made changes. + +Now I'll create the implementation file for the NEON-optimized Adler32: + +/* + * adler32-neon.c - NEON-optimized implementation of Adler-32 checksum algorithm + * + * This implementation uses ARM NEON SIMD instructions to process multiple bytes in parallel + * for improved performance on Neoverse N1 and other ARM platforms with NEON support. + */ + +#include +#include +#include +#include "adler32-neon.h" + +/* Adler-32 parameters */ +#define MOD_ADLER 65521 /* Largest prime number less than 2^16 */ +#define BLOCK_SIZE 16 /* Process 16 bytes at once with NEON */ + +/* + * Standard implementation of adler32 for small inputs + * Used for inputs less than BLOCK_SIZE + */ +static uint32_t adler32_standard(const unsigned char *data, size_t length, uint32_t initial) +{ + uint32_t a = initial & 0xffff; + uint32_t b = (initial >> 16) & 0xffff; + size_t i; + + /* Process each byte */ + for (i = 0; i < length; i++) { + a = (a + data[i]) % MOD_ADLER; + b = (b + a) % MOD_ADLER; + } + + /* Combine the two 16-bit values */ + return (b << 16) | a; +} + +/* + * adler32_neon - Calculate the Adler-32 checksum of a data buffer using NEON instructions + * @data: Data buffer + * @length: Length of data buffer in bytes + * @initial: Initial Adler-32 value (typically 1) + * + * Returns the calculated Adler-32 checksum + */ +uint32_t adler32_neon(const unsigned char *data, size_t length, uint32_t initial) +{ + uint32_t a = initial & 0xffff; + uint32_t b = (initial >> 16) & 0xffff; + + /* For small inputs, use the standard implementation */ + if (length < BLOCK_SIZE) { + return adler32_standard(data, length, initial); + } + + /* Process blocks of BLOCK_SIZE bytes */ + while (length >= BLOCK_SIZE) { + /* NEON registers to accumulate sums */ + uint16x8_t va = vdupq_n_u16(0); /* 8 parallel accumulators for a */ + uint16x8_t vb = vdupq_n_u16(0); /* 8 parallel accumulators for b */ + + /* + * We'll process 16 bytes at a time: + * - Load 16 bytes into a 16x8 vector + * - Compute partial sums in vectors + */ + uint8x16_t vdata = vld1q_u8(data); + + /* Convert to 16-bit values for accumulation */ + uint16x8_t vdata_low = vmovl_u8(vget_low_u8(vdata)); + uint16x8_t vdata_high = vmovl_u8(vget_high_u8(vdata)); + + /* Accumulate a values: a += data[i] */ + va = vaddq_u16(va, vdata_low); + va = vaddq_u16(va, vdata_high); + + /* + * Accumulate b values with proper weighting: + * b += a + data[i] * (BLOCK_SIZE - i) + * + * Since we compute 16 bytes at once, weights are 16,15,...,1 + */ + static const uint16_t weights[16] = {16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1}; + uint16x8_t vweights_low = vld1q_u16(weights); + uint16x8_t vweights_high = vld1q_u16(weights + 8); + + uint16x8_t weighted_low = vmulq_u16(vdata_low, vweights_low); + uint16x8_t weighted_high = vmulq_u16(vdata_high, vweights_high); + + vb = vaddq_u16(vb, weighted_low); + vb = vaddq_u16(vb, weighted_high); + + /* Accumulate partial sums into scalar a and b values */ + uint16_t partial_a[8]; + uint16_t partial_b[8]; + vst1q_u16(partial_a, va); + vst1q_u16(partial_b, vb); + + for (size_t i = 0; i < 8; i++) { + a += partial_a[i]; + b += partial_b[i]; + } + + /* Apply modulo to avoid overflow */ + a %= MOD_ADLER; + b = (b + 16 * a) % MOD_ADLER; /* Add BLOCK_SIZE * a to account for sequential weights */ + + /* Move to next block */ + data += BLOCK_SIZE; + length -= BLOCK_SIZE; + } + + /* Process remaining bytes using standard implementation */ + if (length > 0) { + uint32_t adler = (b << 16) | a; + adler = adler32_standard(data, length, adler); + return adler; + } + + /* Combine the two 16-bit values */ + return (b << 16) | a; +} + +/* + * adler32_neon_init - Initialize an Adler-32 checksum + * + * Returns the initial Adler-32 value + */ +uint32_t adler32_neon_init(void) +{ + return 1; +} + +/* + * adler32_neon_update - Update Adler-32 checksum with new data using NEON optimization + * @adler: Current Adler-32 value + * @data: Data buffer + * @length: Length of data buffer in bytes + * + * Returns the updated Adler-32 value + */ +uint32_t adler32_neon_update(uint32_t adler, const unsigned char *data, size_t length) +{ + return adler32_neon(data, length, adler); +} + +/* + * adler32_neon_final - Finalize Adler-32 calculation + * @adler: Current Adler-32 value + * + * Returns the final Adler-32 checksum + */ +uint32_t adler32_neon_final(uint32_t adler) +{ + return adler; +} + + +Made changes. + +Now I'll update the test file to add performance comparisons between the original and NEON implementations: + +/* + * adler32-test.c - Test and benchmark for Adler-32 checksum implementation + */ + +#include +#include +#include +#include +#include +#include "adler32-simple.h" +#include "adler32-neon.h" + +/* Known test vectors for Adler-32 */ +struct test_vector { + const char *data; + uint32_t expected; +}; + +/* Test data sizes in bytes */ +#define KB (1024) +#define MB (1024 * 1024) + +static const struct test_vector test_vectors[] = { + { "", 0x00000001 }, + { "a", 0x00620062 }, + { "abc", 0x024d0127 }, + { "message digest", 0x29750586 }, + { "abcdefghijklmnopqrstuvwxyz", 0x90860b20 }, + { "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789", 0x8adb150c }, +}; + +/* Structure to hold test data */ +struct test_data { + unsigned char *buffer; + size_t size; + const char *name; +}; + +/* Generate random test data of specified size */ +static unsigned char* generate_test_data(size_t size) { + unsigned char *data = malloc(size); + if (!data) { + fprintf(stderr, "Failed to allocate memory for test data\n"); + exit(EXIT_FAILURE); + } + + for (size_t i = 0; i < size; i++) { + data[i] = rand() & 0xFF; + } + + return data; +} + +/* Performance test result structure */ +struct perf_result { + double time; + double throughput; + uint32_t checksum; +}; + +/* Run performance test on a data buffer using the specified implementation */ +static struct perf_result test_performance_impl(const unsigned char *data, size_t size, + uint32_t (*adler_func)(const unsigned char *, size_t, uint32_t), + uint32_t (*init_func)(void), + const char *impl_name) { + clock_t start, end; + struct perf_result result; + + /* Warm-up run */ + adler_func(data, size, init_func()); + + /* Timed run */ + start = clock(); + result.checksum = adler_func(data, size, init_func()); + end = clock(); + + result.time = ((double) (end - start)) / CLOCKS_PER_SEC; + result.throughput = (double)size / (result.time * MB); + + printf(" %s:\n", impl_name); + printf(" Checksum: 0x%08x\n", result.checksum); + printf(" Time: %.6f seconds\n", result.time); + printf(" Throughput: %.2f MB/s\n", result.throughput); + + return result; +} + +/* Run performance test on a data buffer for both implementations */ +static void test_performance(const unsigned char *data, size_t size, const char *test_name) { + printf("Testing %s (%zu bytes):\n", test_name, size); + + /* Test the simple implementation */ + struct perf_result simple = test_performance_impl(data, size, adler32, adler32_init, "Simple implementation"); + + /* Test the NEON implementation */ + struct perf_result neon = test_performance_impl(data, size, adler32_neon, adler32_neon_init, "NEON implementation"); + + /* Check if checksums match */ + if (simple.checksum != neon.checksum) { + printf(" CHECKSUM MISMATCH: Simple 0x%08x, NEON 0x%08x\n", + simple.checksum, neon.checksum); + } + + /* Calculate speedup */ + double speedup = simple.time / neon.time; + printf(" Speedup: %.2fx\n\n", speedup); +} + +/* Run verification tests with known test vectors */ +static void run_verification_tests(void) { + printf("Running verification tests:\n"); + printf("==========================\n"); + + int passed_simple = 0; + int passed_neon = 0; + int total = sizeof(test_vectors) / sizeof(test_vectors[0]); + + for (size_t i = 0; i < total; i++) { + const struct test_vector *tv = &test_vectors[i]; + uint32_t result_simple = adler32((const unsigned char *)tv->data, strlen(tv->data), adler32_init()); + uint32_t result_neon = adler32_neon((const unsigned char *)tv->data, strlen(tv->data), adler32_neon_init()); + + printf("Test %zu: \"%s\"\n", i + 1, tv->data); + printf(" Expected: 0x%08x\n", tv->expected); + printf(" Simple: 0x%08x (%s)\n", result_simple, + (result_simple == tv->expected) ? "PASS" : "FAIL"); + printf(" NEON: 0x%08x (%s)\n", result_neon, + (result_neon == tv->expected) ? "PASS" : "FAIL"); + + if (result_simple == tv->expected) passed_simple++; + if (result_neon == tv->expected) passed_neon++; + + /* Check if both implementations match */ + if (result_simple != result_neon) { + printf(" WARNING: Simple and NEON implementations produce different results!\n"); + } + + printf("\n"); + } + + printf("Verification summary:\n"); + printf(" Simple: %d/%d tests passed\n", passed_simple, total); + printf(" NEON: %d/%d tests passed\n\n", passed_neon, total); +} + +/* Run performance tests with different data sizes */ +static void run_performance_tests(void) { + printf("Running performance tests:\n"); + printf("=========================\n"); + + /* Define test data sizes */ + struct test_data tests[] = { + { NULL, 1 * KB, "1 KB" }, + { NULL, 10 * KB, "10 KB" }, + { NULL, 100 * KB, "100 KB" }, + { NULL, 1 * MB, "1 MB" }, + { NULL, 10 * MB, "10 MB" } + }; + + /* Generate random test data for each size */ + for (size_t i = 0; i < sizeof(tests) / sizeof(tests[0]); i++) { + tests[i].buffer = generate_test_data(tests[i].size); + test_performance(tests[i].buffer, tests[i].size, tests[i].name); + } + + /* Free allocated memory */ + for (size_t i = 0; i < sizeof(tests) / sizeof(tests[0]); i++) { + free(tests[i].buffer); + } +} + +/* Test the incremental API for both implementations */ +static void test_incremental_api(void) { + printf("Testing incremental API:\n"); + printf("=======================\n"); + + /* Generate test data */ + size_t size = 1 * MB; + unsigned char *data = generate_test_data(size); + + printf("Simple implementation:\n"); + /* Calculate checksum in one go */ + uint32_t direct = adler32(data, size, adler32_init()); + + /* Calculate checksum incrementally */ + uint32_t incremental = adler32_init(); + + /* Process in chunks of 64KB */ + size_t chunk_size = 64 * KB; + for (size_t offset = 0; offset < size; offset += chunk_size) { + size_t remaining = size - offset; + size_t current_chunk = (remaining < chunk_size) ? remaining : chunk_size; + incremental = adler32_update(incremental, data + offset, current_chunk); + } + incremental = adler32_final(incremental); + + /* Compare results */ + printf(" Direct calculation: 0x%08x\n", direct); + printf(" Incremental calculation: 0x%08x\n", incremental); + printf(" Match: %s\n\n", (direct == incremental) ? "YES" : "NO"); + + printf("NEON implementation:\n"); + /* Calculate checksum in one go */ + uint32_t direct_neon = adler32_neon(data, size, adler32_neon_init()); + + /* Calculate checksum incrementally */ + uint32_t incremental_neon = adler32_neon_init(); + + /* Process in chunks of 64KB */ + for (size_t offset = 0; offset < size; offset += chunk_size) { + size_t remaining = size - offset; + size_t current_chunk = (remaining < chunk_size) ? remaining : chunk_size; + incremental_neon = adler32_neon_update(incremental_neon, data + offset, current_chunk); + } + incremental_neon = adler32_neon_final(incremental_neon); + + /* Compare results */ + printf(" Direct calculation: 0x%08x\n", direct_neon); + printf(" Incremental calculation: 0x%08x\n", incremental_neon); + printf(" Match: %s\n\n", (direct_neon == incremental_neon) ? "YES" : "NO"); + + /* Compare simple vs NEON */ + printf("Cross implementation comparison:\n"); + printf(" Simple checksum: 0x%08x\n", direct); + printf(" NEON checksum: 0x%08x\n", direct_neon); + printf(" Match: %s\n\n", (direct == direct_neon) ? "YES" : "NO"); + + free(data); +} + +int main(void) { + /* Seed random number generator */ + srand((unsigned int)time(NULL)); + + printf("Adler-32 Test and Benchmark\n"); + printf("===========================\n\n"); + + /* Run verification tests with known test vectors */ + run_verification_tests(); + + /* Test the incremental API */ + test_incremental_api(); + + /* Run performance tests with different data sizes */ + run_performance_tests(); + + return 0; +} + + +Made changes. + +Let's update the Makefile to include the NEON implementation in the build: + +# Makefile for Adler-32 checksum implementation +# Optimized for Neoverse N1 architecture + +# Compiler and flags +CC = gcc +# Neoverse N1 specific optimization flags: +# -march=armv8.2-a+crypto: Target ARMv8.2-A architecture with crypto extensions (Neoverse N1) +# -mcpu=neoverse-n1: Specifically optimize for Neoverse N1 +# -O3: Highest standard optimization level +# -flto: Link-time optimization +# -fomit-frame-pointer: Don't keep the frame pointer in a register for functions that don't need one +# -ftree-vectorize: Vectorize loops when possible +# -ffast-math: Enable fast math optimizations +# Additional useful flags: +# -Wall -Wextra: Enable warnings +# -g: Include debugging symbols +CFLAGS = -march=armv8.2-a+crypto -mcpu=neoverse-n1 -O3 -flto -fomit-frame-pointer -ftree-vectorize -ffast-math -Wall -Wextra -g + +# Source files +SRCS = adler32-simple.c adler32-neon.c adler32-test.c +OBJS = $(SRCS:.c=.o) +TARGET = adler32-test + +# Default target +all: $(TARGET) + +# Link the executable +$(TARGET): $(OBJS) + $(CC) $(CFLAGS) -o $@ $^ + +# Compile source files +%.o: %.c + $(CC) $(CFLAGS) -c $< -o $@ + +# Run the tests +run: $(TARGET) + ./$(TARGET) + +# Clean generated files +clean: + rm -f $(OBJS) $(TARGET) + +# Rebuild the project +rebuild: clean all + +.PHONY: all run clean rebuild + + +Made changes. + +Now that I've created the NEON implementation and updated the test code and Makefile, let's compile and run the code to see the performance comparison. + +``` + +GitHub Copilot generated a new implementation of Adler32 using NEON intrinsics. It also updated the test program and the Makefile to make it easy to compare the two versions. + +Continue to the next section to compile and run the NEON version and investigate performance. diff --git a/content/learning-paths/cross-platform/adler32/neon-debug-9.md b/content/learning-paths/cross-platform/adler32/neon-debug-9.md new file mode 100644 index 0000000000..dd147bcd7c --- /dev/null +++ b/content/learning-paths/cross-platform/adler32/neon-debug-9.md @@ -0,0 +1,99 @@ +--- +title: Debug the NEON version to match the standard C version +weight: 9 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## How can I debug the NEON version Adler32 and compare performance? + +During the last step, GitHub Copilot revealed that the functional results with NEON do not match the C version. + +Use the prompt below to instruct GitHub Copilot to try to debug and fix the issue. + +```console +Can you modify adler32-neon.c to produce the correct checksum results for the adler32 algorithm? The test program is adler32-test.c and the simple version is adler32-simple.c +The NEON version should produce the same checksum as adler32-simple.c but run faster. +``` + +If the LLM you are using doesn't resolve the NEON functional issues, you can try with another LLM such as Gemini 2.5 Pro. + +The output is: + +```output +GitHub Copilot: Okay, I can help with that. The issue seems to be in how the `a` and `b` accumulators are updated within the NEON processing loop. The update to `b` needs to correctly incorporate the value of `a` *before* it was updated in the current block, along with the weighted sum of the data bytes. + +I will modify the `adler32_neon` function in adler32-neon.c to correctly calculate the sums `s1` (sum of bytes) and `s2` (weighted sum of bytes) for each block using NEON intrinsics, and then update the scalar `a` and `b` values according to the Adler-32 definition: `a = (a_old + s1) % MOD_ADLER` and `b = (b_old + BLOCK_SIZE * a_old + s2) % MOD_ADLER`. + +uint32_t adler32_neon(const unsigned char *data, size_t length, uint32_t initial) +{ + uint32_t a = initial & 0xffff; + uint32_t b = (initial >> 16) & 0xffff; + + /* For small inputs, use the standard implementation */ + if (length < BLOCK_SIZE) { + return adler32_standard(data, length, initial); + } + + /* Preload weights */ + static const uint16_t weights[16] = {16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1}; + uint16x8_t vweights_low = vld1q_u16(weights); + uint16x8_t vweights_high = vld1q_u16(weights + 8); + + /* Process blocks of BLOCK_SIZE bytes */ + while (length >= BLOCK_SIZE) { + /* Store old 'a' before processing the block */ + uint32_t a_old = a; + + /* Load 16 bytes */ + uint8x16_t vdata = vld1q_u8(data); + + /* Widen bytes to 16 bits */ + uint16x8_t vdata_low = vmovl_u8(vget_low_u8(vdata)); + uint16x8_t vdata_high = vmovl_u8(vget_high_u8(vdata)); + + /* Calculate S1 = sum(data[i]) for the block */ + uint16x8_t vs1_sum = vaddq_u16(vdata_low, vdata_high); + /* Horizontal sum for S1 */ + uint32x4_t vs1_pairs = vpaddlq_u16(vs1_sum); + uint64x2_t vs1_quads = vpaddlq_u32(vs1_pairs); + uint32_t block_s1 = (uint32_t)(vgetq_lane_u64(vs1_quads, 0) + vgetq_lane_u64(vs1_quads, 1)); + + /* Calculate S2 = sum((BLOCK_SIZE-i) * data[i]) for the block */ + uint16x8_t weighted_low = vmulq_u16(vdata_low, vweights_low); + uint16x8_t weighted_high = vmulq_u16(vdata_high, vweights_high); + uint16x8_t vs2_sum = vaddq_u16(weighted_low, weighted_high); + /* Horizontal sum for S2 */ + uint32x4_t vs2_pairs = vpaddlq_u16(vs2_sum); + uint64x2_t vs2_quads = vpaddlq_u32(vs2_pairs); + uint32_t block_s2 = (uint32_t)(vgetq_lane_u64(vs2_quads, 0) + vgetq_lane_u64(vs2_quads, 1)); + + /* Update scalar a and b */ + a = (a_old + block_s1) % MOD_ADLER; + b = (b + (BLOCK_SIZE * a_old) + block_s2) % MOD_ADLER; + + /* Move to next block */ + data += BLOCK_SIZE; + length -= BLOCK_SIZE; + } + + /* Process remaining bytes using standard implementation */ + if (length > 0) { +// ...existing code... + + +Made changes. + +I've modified the `adler32_neon` function in adler32-neon.c to correctly calculate the checksum using NEON intrinsics. + +Now, I'll check for any compilation errors and then rebuild and run the test program to verify the results.I have modified the `adler32_neon` function in adler32-neon.c to correctly calculate the block sums and update the `a` and `b` accumulators according to the Adler-32 algorithm definition, using NEON intrinsics for the block processing. + +I then checked for errors in the modified file and rebuilt the project using `make`. The build completed successfully, although there's a minor compiler warning in adler32-test.c about comparing signed and unsigned integers, which is unrelated to the changes made. + +You can now run the test program adler32-test to verify that the NEON implementation produces the same results as the simple implementation. +``` + +GitHub Copilot was able to modify the NEON version to get the correct functional results. + +Continue to the next section to summarize the project, including the performance improvements, and generate a full project description. \ No newline at end of file diff --git a/content/learning-paths/cross-platform/adler32/neon-run-8.md b/content/learning-paths/cross-platform/adler32/neon-run-8.md new file mode 100644 index 0000000000..1f12830456 --- /dev/null +++ b/content/learning-paths/cross-platform/adler32/neon-run-8.md @@ -0,0 +1,128 @@ +--- +title: Compare the NEON version to the standard C version of Adler32 +weight: 8 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## How can I run the NEON version of Adler32 and compare performance? + +The first step to investigate if NEON can provide improved performance is to make sure the NEON version builds, runs, and matches the results of the C version. + +To build the NEON version and compare it to the C version use the prompt below in your GitHub Copilot Agent session: + +```console +Use the Makefile to compile and run the new test program and compare the checksum values and performance of each implementation. +``` + +The output is: + +```output + +GitHub Copilot: I'll help you compile and run the test program to compare the checksum implementations. Let's first examine the Makefile to understand how to build and run the tests.I've examined the Makefile and now I understand how to compile and run the test program. The Makefile contains targets for building the adler32 implementations and running the tests. + +Let's also look at the test program to understand what it's testing:I've examined the Makefile and the test program (adler32-test.c). Now I understand how we can compile and run the tests to compare the checksum implementations. The adler32-test.c file contains comprehensive tests that will: + +1. Verify both implementations with known test vectors +2. Test the incremental API for both implementations +3. Run performance benchmarks with different data sizes (1KB to 10MB) + +Let me compile and run the test program using the Makefile:I see that the `make clean && make` command completed, but it didn't show detailed output. Let's run the test program to see the comparison results:I've run the Adler-32 test program for you. The test results show that the current implementation has a few issues that need to be addressed. + +Here's what I found from running the tests: + +1. **Verification Tests**: The program only tested one implementation against known test vectors (all 6 tests passed). However, it should be testing both the simple and NEON implementations separately, as designed in the test code. + +2. **Incremental API Test**: Only one implementation was tested in the incremental mode, but the test program is designed to test both. + +3. **Performance Tests**: The benchmarks only showed results for a single implementation rather than comparing both the simple and NEON versions as intended. + +The issue appears to be that the test program is not fully comparing both implementations as it was designed to do. This suggests there might be a mismatch between what's implemented and what's being tested. + +Let's check the simple implementation to understand what functions are available:I've compiled and run the Adler-32 test program to compare the implementations. Let me share what I found. + +The test program ran, but it's not properly comparing both implementations as it was designed to do. Here's what I discovered: + +1. The test passed all 6 verification tests, but it only appears to be testing one implementation against the known test vectors, not both the simple and NEON implementations as intended. + +2. Only one implementation was tested in the incremental API tests, while the code is written to test both. + +3. The performance benchmarks only showed results for a single implementation rather than comparing both as intended. For reference, the single implementation achieved: + - 1KB: 195.31 MB/s + - 10KB: 279.02 MB/s + - 100KB: 298.64 MB/s + - 1MB: 298.24 MB/s + - 10MB: 298.03 MB/s + +Looking at the test code and the adler32-simple.h header file, I can see the issue is likely related to the test program not properly executing the comparison between both implementations. + +Let me fix this by updating the test file so it properly tests both implementations. Then we can run it again to get the comparative results.I've reviewed the test program (adler32-test.c) and found that it's correctly designed to compare both the simple and NEON implementations. + +There seems to be an issue with how the tests are being run or how the implementations are connected. The test code expects to compare both implementations, but when it ran, it only showed results for a single implementation. + +Let's run the tests again, but with verbose output to better understand what's happening:I've run the tests to compare the Adler-32 implementations, and now I can provide you with the detailed results. + +## Test Results + +### Verification Tests +- Both implementations were tested against known test vectors. +- The Simple implementation passed all 6 test cases (100%). +- The NEON implementation passed 4 out of 6 test cases (67%). +- There were mismatches in test case #5 (alphabet string) and test case #6 (alphanumeric string). + +### Incremental API Test +- Both implementations showed consistency between direct and incremental calculation methods. +- However, the checksums between the Simple and NEON implementations don't match. + +### Performance Tests +For each test size, here are the results: + +1. **1 KB Data**: + - Simple: 195.31 MB/s + - NEON: 976.56 MB/s + - Speedup: 5.00x + - Checksums don't match + +2. **10 KB Data**: + - Simple: 287.22 MB/s + - NEON: 1085.07 MB/s + - Speedup: 3.78x + - Checksums don't match + +3. **100 KB Data**: + - Simple: 295.93 MB/s + - NEON: 1302.08 MB/s + - Speedup: 4.40x + - Checksums don't match + +4. **1 MB Data**: + - Simple: 297.97 MB/s + - NEON: 1305.48 MB/s + - Speedup: 4.38x + - Checksums don't match + +5. **10 MB Data**: + - Simple: 297.95 MB/s + - NEON: 1310.44 MB/s + - Speedup: 4.40x + - Checksums don't match + +## Analysis + +The NEON implementation shows impressive performance gains, with speedups ranging from 3.78x to 5.00x compared to the simple implementation. This demonstrates the effectiveness of using SIMD (Single Instruction, Multiple Data) instructions for checksum calculations. + +However, there's a critical issue: **the checksum values from the NEON implementation don't match those from the simple implementation**. Since the simple implementation passes all verification tests, this indicates that the NEON implementation has a bug that needs to be fixed. + +The most likely issue is in the NEON vector processing logic. From the test results, it appears the NEON implementation is correctly handling shorter inputs (the first 4 test cases passed) but fails on longer inputs. This suggests the bug might be related to how it processes larger blocks of data. + +## Conclusion + +The NEON implementation offers significant performance improvements (up to 5x faster), which would be valuable for applications processing large amounts of data. However, it produces incorrect checksum values currently, making it unsuitable for production use until fixed. + +If you'd like, I can analyze the code to identify and fix the issue in the NEON implementation so you can maintain both correctness and the performance benefits. +``` + +GitHub Copilot has come to the conclusion that although the NEON version looks like it offers better performance, it does not produce the correct results for the checksums. Obviously, this is not a good result which needs to be fixed. + +Continue to the next section to fix the functional issues and continue with the performance comparison. \ No newline at end of file diff --git a/content/learning-paths/cross-platform/adler32/simple-c-3.md b/content/learning-paths/cross-platform/adler32/simple-c-3.md new file mode 100644 index 0000000000..ffc8a3dd7d --- /dev/null +++ b/content/learning-paths/cross-platform/adler32/simple-c-3.md @@ -0,0 +1,196 @@ +--- +title: Create a C version of Adler32 +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## How can I generate a C function for Adler32? + +You can use GitHub Copilot to generate a simple C version of the Adler32 checksum algorithm. + +On your Arm Linux computer, create a new directory and enter the empty directory: + +```console +mkdir adler32 && cd adler32 +``` + +Open VS Code in the new directory to start the project: + +```console +code . & +``` + +Open GitHub Copilot, select the Large Language Model (LLM) you want to use, and select `Agent` mode. + +{{% notice Note %}} +Your results will be different based on the model you use. The model used for the output shown is Claude 3.7 Sonnet. +{{% /notice %}} + +Copy and paste the information below as a GitHub Copilot prompt and click the `Send` button. + +```console +Write a basic C implementation of the adler32 checksum function. +Write the function in a file named adler32-simple.c that can be called from another C file. +Don't create a C main function yet. +``` + +The output is: + +```output + +GitHub Copilot: I'll create a basic C implementation of the Adler-32 checksum function in a file named adler32-simple.c. Let me do that for you. + +First, let me check if the file exists already:The file doesn't exist yet, so I'll create it. Let me write the Adler-32 checksum implementation for you. + +The Adler-32 checksum algorithm works by calculating two 16-bit checksums A and B and combining them into a single 32-bit checksum. It's a simple algorithm that provides decent error detection with minimal computational overhead. + +/* + * adler32-simple.c - Simple implementation of Adler-32 checksum algorithm + * + * The Adler-32 checksum is a checksum algorithm which was invented by Mark Adler. + * It is used in the zlib compression library and specified in RFC 1950. + */ + +#include +#include + +/* Adler-32 parameters */ +#define MOD_ADLER 65521 /* Largest prime number less than 2^16 */ + +/* + * adler32 - Calculate the Adler-32 checksum of a data buffer + * @data: Data buffer + * @length: Length of data buffer in bytes + * @initial: Initial Adler-32 value (typically 1) + * + * Returns the calculated Adler-32 checksum + */ +uint32_t adler32(const unsigned char *data, size_t length, uint32_t initial) +{ + uint32_t a = initial & 0xffff; + uint32_t b = (initial >> 16) & 0xffff; + size_t i; + + /* Process each byte */ + for (i = 0; i < length; i++) { + a = (a + data[i]) % MOD_ADLER; + b = (b + a) % MOD_ADLER; + } + + /* Combine the two 16-bit values */ + return (b << 16) | a; +} + +/* + * adler32_init - Initialize an Adler-32 checksum + * + * Returns the initial Adler-32 value + */ +uint32_t adler32_init(void) +{ + return 1; +} + +/* + * adler32_update - Update Adler-32 checksum with new data + * @adler: Current Adler-32 value + * @data: Data buffer + * @length: Length of data buffer in bytes + * + * Returns the updated Adler-32 value + */ +uint32_t adler32_update(uint32_t adler, const unsigned char *data, size_t length) +{ + return adler32(data, length, adler); +} + +/* + * adler32_final - Finalize Adler-32 calculation + * @adler: Current Adler-32 value + * + * Returns the final Adler-32 checksum + */ +uint32_t adler32_final(uint32_t adler) +{ + return adler; +} + + +Made changes. + +Now let's create a header file to make the Adler-32 functions accessible from other C files: + +/* + * adler32-simple.h - Header file for Adler-32 checksum implementation + */ + +#ifndef ADLER32_SIMPLE_H +#define ADLER32_SIMPLE_H + +#include +#include + +/** + * Calculate Adler-32 checksum for a data buffer + * + * @param data Pointer to the data buffer + * @param length Length of the data in bytes + * @param initial Initial Adler-32 value (typically 1) + * + * @return The calculated Adler-32 checksum + */ +uint32_t adler32(const unsigned char *data, size_t length, uint32_t initial); + +/** + * Initialize Adler-32 checksum calculation + * + * @return The initial Adler-32 value (1) + */ +uint32_t adler32_init(void); + +/** + * Update Adler-32 checksum with new data + * + * @param adler Current Adler-32 value + * @param data Pointer to the data buffer + * @param length Length of the data in bytes + * + * @return The updated Adler-32 value + */ +uint32_t adler32_update(uint32_t adler, const unsigned char *data, size_t length); + +/** + * Finalize Adler-32 calculation + * + * @param adler Current Adler-32 value + * + * @return The final Adler-32 checksum + */ +uint32_t adler32_final(uint32_t adler); + +#endif /* ADLER32_SIMPLE_H */ + + +Made changes. + +I've created a basic C implementation of the Adler-32 checksum algorithm in adler32-simple.c as requested. I also added a corresponding header file adler32-simple.h to make the functions accessible from other C files. + +The implementation includes: + +1. `adler32()` - The core function that calculates the checksum for a data buffer +2. `adler32_init()` - Function to initialize an Adler-32 checksum (returns 1) +3. `adler32_update()` - Function to update a checksum with new data +4. `adler32_final()` - Function to finalize the checksum calculation + +These functions follow the standard pattern for checksum/hash functions, allowing you to calculate checksums for data in chunks rather than all at once if needed. + +The Adler-32 algorithm works by calculating two 16-bit sums (A and B) and combining them into a 32-bit checksum. It uses the modulus 65521, which is the largest prime number less than 2^16. + +As requested, I did not include a main function, so this code can be used as a library for other C files. +``` + +You now have the files `adler32-simple.c` and `adler32-simple.h` in your directory which compute the Adler32 checksum. These will serve as the reference implementations of the algorithm with baseline performance. + +Continue to the next section to create the test application. \ No newline at end of file diff --git a/content/learning-paths/cross-platform/adler32/summary-10.md b/content/learning-paths/cross-platform/adler32/summary-10.md new file mode 100644 index 0000000000..7ff80a22cc --- /dev/null +++ b/content/learning-paths/cross-platform/adler32/summary-10.md @@ -0,0 +1,98 @@ +--- +title: Summarize the project with a README.md file +weight: 10 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## How can I summarize the project results? + +You can use GitHub Copilot to generate a project summary in a README file. + +Copy the prompt below to your GitHub Copilot Agent chat and review the created README file. + +```console +Review the files in my project. +Create a README.md file that describes the background and overview of the project. +Explain the two implementations of the adler32 checksum. +Add a note that the performance results recorded on the Neoverse N1 processor. +Use a table to compare the original version and the NEON version and show the performance improvement factor. +``` + +Below is the created README.md file. The formatting doesn't match the Learning Path template exactly, but you can copy the the README file to a new repository in GitHub for improved results. + +## Adler-32 Checksum Implementation Comparison + +### Background + +The Adler-32 checksum is an algorithm invented by Mark Adler, used in the zlib compression library and specified in RFC 1950. It provides a fast way to calculate a checksum for data integrity verification, balancing speed and reliability. It generates a 32-bit integer value based on the input data. + +### Overview + +This project provides and compares two implementations of the Adler-32 checksum algorithm: + +1. A simple, standard C implementation. +2. An optimized implementation using ARM NEON SIMD instructions. + +The goal is to demonstrate the performance benefits of using NEON intrinsics for this type of computation on compatible ARM architectures, such as the Neoverse N1. + +The project includes: +* Source code for both implementations (`adler32-simple.c`, `adler32-neon.c`). +* Header files (`adler32-simple.h`, `adler32-neon.h`). +* A test and benchmark program (`adler32-test.c`) to verify correctness and measure performance. +* A `Makefile` for easy compilation and testing. + +### Implementations + +#### 1. Simple Implementation (`adler32-simple.c`) + +This is a straightforward C implementation following the standard Adler-32 algorithm definition. It processes the input data byte by byte, updating two 16-bit accumulators (`a` and `b`) modulo 65521 (the largest prime smaller than 2^16). + +#### 2. NEON-Optimized Implementation (`adler32-neon.c`) + +This implementation leverages ARM NEON SIMD (Single Instruction, Multiple Data) instructions to accelerate the checksum calculation. Key aspects include: +* Processing data in blocks (16 bytes at a time). +* Using NEON intrinsics (`vld1q_u8`, `vmovl_u8`, `vaddq_u16`, `vpaddlq_u16`, `vmulq_u16`, etc.) to perform parallel operations on data vectors. +* Calculating the sums `S1` (sum of bytes) and `S2` (weighted sum) for each block using vector operations. +* Updating the scalar `a` and `b` accumulators based on the block results. +* Falling back to the standard implementation for data lengths smaller than the block size or for the remaining bytes after processing full blocks. + +### Performance + +The performance of both implementations was measured on an **AWS Graviton2 processor (based on Arm Neoverse N1 cores)**. The benchmark program (`adler32-test`) calculates the checksum for various data sizes and measures the time taken and throughput (in MB/s). + +The following table summarizes the throughput results and the performance improvement factor of the NEON version compared to the simple version: + +| Data Size | Simple Throughput (MB/s) | NEON Throughput (MB/s) | Speedup Factor | +| :-------- | :----------------------- | :--------------------- | :------------- | +| 1 KB | 244.14 | 976.56 | 4.00x | +| 10 KB | 295.93 | 3255.21 | 11.00x | +| 100 KB | 298.64 | 3150.20 | 10.55x | +| 1 MB | 298.33 | 3215.43 | 10.78x | +| 10 MB | 298.37 | 3194.89 | 10.71x | + +**Note:** Performance results can vary based on the specific hardware, compiler, and system load. The results above demonstrate a significant performance improvement (around **10-11x** for larger data sizes) when using NEON optimization on the Neoverse N1 architecture. + +### Building and Running + +Use the provided `Makefile`: + +```bash +# Compile the code +make + +# Run verification and performance tests +make run + +# Clean up generated files +make clean +``` + +The table summarizes the speedup obtained by the NEON version. + +Using Agent mode in GitHub Copilot is a significant benefit when you are actively building and running software. Agent mode can create files and modify them to make needed improvements. + +The entire project was done without modifying any of the generated files. While you may not need to do this on a real project, the concept of writing NEON intrinsics to improve performance was demonstrated. You can also use GitHub Copilot to fix issues in NEON code that are difficult to debug for developers who are not experts. + +Make sure to try different LLMs with Copilot as the results will vary greatly depending on the model. diff --git a/content/learning-paths/cross-platform/adler32/test-prog-4.md b/content/learning-paths/cross-platform/adler32/test-prog-4.md new file mode 100644 index 0000000000..e19359036b --- /dev/null +++ b/content/learning-paths/cross-platform/adler32/test-prog-4.md @@ -0,0 +1,247 @@ +--- +title: Create a test program +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## How can I create a test program to verify Adler32 works correctly? + +You can continue your GitHub Copilot Agent session to create a test program using the prompt below. Copy and paste it into GitHub Copilot to generate a test program. + +```console +Create a C file named adler32-test.c which includes a main() function and calls the adler32 function in adler32-simple.c +Add code around the adler32 calls to measure the performance of the adler32 checksum function. +Generate a simple test to make sure the adler32 generates the expected checksum. +Add randomly generated test data of various sizes: 1 kb, 10 kb, 100 kb, 1 mb, and 10 mb. +Add code to run all test data with different sizes and measure performance. +Don't create the Makefile yet. +``` + +The output is below: + +```output + +GitHub Copilot: I'll create a test file for the Adler-32 implementation that includes performance testing with different data sizes. I'll first check the current implementation to understand how to use the functions properly.Now I understand how the Adler-32 implementation works. Let me create a test file that measures the performance of the adler32 function with different data sizes and verifies that it produces the expected checksums. + +/* + * adler32-test.c - Test and benchmark for Adler-32 checksum implementation + */ + +#include +#include +#include +#include +#include +#include "adler32-simple.h" + +/* Known test vectors for Adler-32 */ +struct test_vector { + const char *data; + uint32_t expected; +}; + +/* Test data sizes in bytes */ +#define KB (1024) +#define MB (1024 * 1024) + +static const struct test_vector test_vectors[] = { + { "", 0x00000001 }, + { "a", 0x00620062 }, + { "abc", 0x024d0127 }, + { "message digest", 0x29750586 }, + { "abcdefghijklmnopqrstuvwxyz", 0x90860b20 }, + { "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789", 0x8adb150c }, +}; + +/* Structure to hold test data */ +struct test_data { + unsigned char *buffer; + size_t size; + const char *name; +}; + +/* Generate random test data of specified size */ +static unsigned char* generate_test_data(size_t size) { + unsigned char *data = malloc(size); + if (!data) { + fprintf(stderr, "Failed to allocate memory for test data\n"); + exit(EXIT_FAILURE); + } + + for (size_t i = 0; i < size; i++) { + data[i] = rand() & 0xFF; + } + + return data; +} + +/* Run performance test on a data buffer */ +static void test_performance(const unsigned char *data, size_t size, const char *test_name) { + clock_t start, end; + double cpu_time_used; + uint32_t checksum; + + printf("Testing %s (%zu bytes):\n", test_name, size); + + /* Warm-up run */ + adler32(data, size, adler32_init()); + + /* Timed run */ + start = clock(); + checksum = adler32(data, size, adler32_init()); + end = clock(); + + cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC; + double throughput = (double)size / (cpu_time_used * MB); + + printf(" Checksum: 0x%08x\n", checksum); + printf(" Time: %.6f seconds\n", cpu_time_used); + printf(" Throughput: %.2f MB/s\n\n", throughput); +} + +/* Run verification tests with known test vectors */ +static void run_verification_tests(void) { + printf("Running verification tests:\n"); + printf("==========================\n"); + + int passed = 0; + for (size_t i = 0; i < sizeof(test_vectors) / sizeof(test_vectors[0]); i++) { + const struct test_vector *tv = &test_vectors[i]; + uint32_t result = adler32((const unsigned char *)tv->data, strlen(tv->data), adler32_init()); + + printf("Test %zu: \"%s\"\n", i + 1, tv->data); + printf(" Expected: 0x%08x\n", tv->expected); + printf(" Result: 0x%08x\n", result); + + if (result == tv->expected) { + printf(" Status: PASS\n"); + passed++; + } else { + printf(" Status: FAIL\n"); + } + printf("\n"); + } + + printf("Verification summary: %d/%zu tests passed\n\n", + passed, sizeof(test_vectors) / sizeof(test_vectors[0])); +} + +/* Run performance tests with different data sizes */ +static void run_performance_tests(void) { + printf("Running performance tests:\n"); + printf("=========================\n"); + + /* Define test data sizes */ + struct test_data tests[] = { + { NULL, 1 * KB, "1 KB" }, + { NULL, 10 * KB, "10 KB" }, + { NULL, 100 * KB, "100 KB" }, + { NULL, 1 * MB, "1 MB" }, + { NULL, 10 * MB, "10 MB" } + }; + + /* Generate random test data for each size */ + for (size_t i = 0; i < sizeof(tests) / sizeof(tests[0]); i++) { + tests[i].buffer = generate_test_data(tests[i].size); + test_performance(tests[i].buffer, tests[i].size, tests[i].name); + } + + /* Free allocated memory */ + for (size_t i = 0; i < sizeof(tests) / sizeof(tests[0]); i++) { + free(tests[i].buffer); + } +} + +/* Test the incremental API */ +static void test_incremental_api(void) { + printf("Testing incremental API:\n"); + printf("=======================\n"); + + /* Generate test data */ + size_t size = 1 * MB; + unsigned char *data = generate_test_data(size); + + /* Calculate checksum in one go */ + uint32_t direct = adler32(data, size, adler32_init()); + + /* Calculate checksum incrementally */ + uint32_t incremental = adler32_init(); + + /* Process in chunks of 64KB */ + size_t chunk_size = 64 * KB; + for (size_t offset = 0; offset < size; offset += chunk_size) { + size_t remaining = size - offset; + size_t current_chunk = (remaining < chunk_size) ? remaining : chunk_size; + incremental = adler32_update(incremental, data + offset, current_chunk); + } + incremental = adler32_final(incremental); + + /* Compare results */ + printf(" Direct calculation: 0x%08x\n", direct); + printf(" Incremental calculation: 0x%08x\n", incremental); + printf(" Match: %s\n\n", (direct == incremental) ? "YES" : "NO"); + + free(data); +} + +int main(void) { + /* Seed random number generator */ + srand((unsigned int)time(NULL)); + + printf("Adler-32 Test and Benchmark\n"); + printf("===========================\n\n"); + + /* Run verification tests with known test vectors */ + run_verification_tests(); + + /* Test the incremental API */ + test_incremental_api(); + + /* Run performance tests with different data sizes */ + run_performance_tests(); + + return 0; +} + + +Made changes. + +I've created the adler32-test.c file with a comprehensive test suite for the Adler-32 checksum implementation. This test file includes: + +1. **Verification tests** with known test vectors to ensure the implementation produces the expected output for standard inputs. + +2. **Incremental API test** to verify that performing the checksum calculation incrementally (in chunks) produces the same result as calculating it in one pass. + +3. **Performance tests** with randomly generated data of various sizes: + - 1 KB + - 10 KB + - 100 KB + - 1 MB + - 10 MB + +The test program performs the following operations: +- Generates random test data for each size +- Calculates the checksum for each data buffer +- Measures the execution time +- Calculates and displays the throughput in MB/s + +The test output will show: +- Results from verification tests against known checksums +- Comparison between direct and incremental calculation +- Performance metrics for each data size + +To compile and run this test, you'll need to use a command like: + +gcc -Wall -O2 adler32-simple.c adler32-test.c -o adler32-test +./adler32-test + +If you'd like me to create the Makefile for easier compilation, just let me know. + +``` + +You now have a test program which confirms the Adler32 algorithm is working correctly for various input data and can measure the performance of the previously generated C function. + +Continue to the next section to run the test program and see the results. +