Skip to content

Latest commit

 

History

History

README.md

Vector Addition Kernel

This CUDA program demonstrates parallel vector addition on the GPU using 4D vectors. It includes performance profiling with NVIDIA Nsight Systems, providing insights into execution time, memory transfer overhead, and the benefits of GPU parallelization.

Overview

The program performs parallel addition of two arrays of 4D vectors:

  • Creates two input vector arrays on the host (CPU)
  • Transfers data to the GPU device memory
  • Launches a CUDA kernel to perform parallel addition
  • Transfers results back to host memory
  • Displays the first five resulting vectors

Vector Addition Operation

Each thread computes one element of the result:

result[i] = vector_a[i] + vector_b[i]

For 4D vectors, this means adding corresponding x, y, z, and w components.

Performance Analysis

The program demonstrates key performance characteristics of GPU computing:

Memory Transfer Overhead

As shown in the Nsight Systems profile, the most significant overhead comes from:

  • cudaMemcpy (Host to Device): Transferring input vectors to GPU
  • cudaMemcpy (Device to Host): Transferring results back to CPU

Computational Speedup

  • The kernel execution itself is extremely fast due to parallel processing
  • Thousands of vector additions occur simultaneously across GPU cores

Optimization Insights

To maximize GPU efficiency:

  1. Batch Operations: Perform multiple operations on the same data to amortize transfer costs
  2. Data Locality: Keep data on the GPU between operations when possible
  3. Async Transfers: Use streams for overlapping computation and transfers
  4. Large Datasets: GPU advantages increase with dataset size

Nsight Systems Profiling

The included performance analysis visualization shows:

  • Timeline of CUDA operations (memory transfers and kernel execution)
  • Relative time spent in each operation
  • GPU utilization patterns

Nsight Systems Analysis

The profile clearly illustrates that memory transfer operations dominate the execution time, emphasizing the importance of minimizing data movement between CPU and GPU.

Usage

Compile and run the program:

nvcc kernel.cu -o vector_addition
./vector_addition

For performance profiling with Nsight Systems:

nsys profile --stats=true ./vector_addition

Key Takeaways

  1. Parallelism: Vector operations are embarrassingly parallel - perfect for GPU acceleration
  2. Memory Matters: Always consider memory transfer overhead in GPU programming
  3. Scale: GPU benefits increase with problem size
  4. Profile: Use profiling tools to identify bottlenecks and optimization opportunities