Name	Name	Last commit message	Last commit date
parent directory ..
Nvidea_Nsight.PNG	Nvidea_Nsight.PNG
README.md	README.md
Vector_Addition_Kernel.sln	Vector_Addition_Kernel.sln
Vector_Addition_Kernel.vcxproj	Vector_Addition_Kernel.vcxproj
kernel.cu	kernel.cu

Name

Last commit message

Last commit date

Vector Addition Kernel

This CUDA program demonstrates parallel vector addition on the GPU using 4D vectors. It includes performance profiling with NVIDIA Nsight Systems, providing insights into execution time, memory transfer overhead, and the benefits of GPU parallelization.

Overview

The program performs parallel addition of two arrays of 4D vectors:

Creates two input vector arrays on the host (CPU)
Transfers data to the GPU device memory
Launches a CUDA kernel to perform parallel addition
Transfers results back to host memory
Displays the first five resulting vectors

Vector Addition Operation

Each thread computes one element of the result:

result[i] = vector_a[i] + vector_b[i]

For 4D vectors, this means adding corresponding x, y, z, and w components.

Performance Analysis

The program demonstrates key performance characteristics of GPU computing:

Memory Transfer Overhead

As shown in the Nsight Systems profile, the most significant overhead comes from:

cudaMemcpy (Host to Device): Transferring input vectors to GPU
cudaMemcpy (Device to Host): Transferring results back to CPU

Computational Speedup

The kernel execution itself is extremely fast due to parallel processing
Thousands of vector additions occur simultaneously across GPU cores

Optimization Insights

To maximize GPU efficiency:

Batch Operations: Perform multiple operations on the same data to amortize transfer costs
Data Locality: Keep data on the GPU between operations when possible
Async Transfers: Use streams for overlapping computation and transfers
Large Datasets: GPU advantages increase with dataset size

Nsight Systems Profiling

The included performance analysis visualization shows:

Timeline of CUDA operations (memory transfers and kernel execution)
Relative time spent in each operation
GPU utilization patterns

The profile clearly illustrates that memory transfer operations dominate the execution time, emphasizing the importance of minimizing data movement between CPU and GPU.

Usage

Compile and run the program:

nvcc kernel.cu -o vector_addition
./vector_addition

For performance profiling with Nsight Systems:

nsys profile --stats=true ./vector_addition

Key Takeaways

Parallelism: Vector operations are embarrassingly parallel - perfect for GPU acceleration
Memory Matters: Always consider memory transfer overhead in GPU programming
Scale: GPU benefits increase with problem size
Profile: Use profiling tools to identify bottlenecks and optimization opportunities

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Vector Addition Kernel

Overview

Vector Addition Operation

Performance Analysis

Memory Transfer Overhead

Computational Speedup

Optimization Insights

Nsight Systems Profiling

Usage

Key Takeaways

FilesExpand file tree

Vector_Addition_Kernel

Directory actions

More options

Directory actions

More options

Latest commit

History

Vector_Addition_Kernel

Folders and files

parent directory

README.md

Vector Addition Kernel

Overview

Vector Addition Operation

Performance Analysis

Memory Transfer Overhead

Computational Speedup

Optimization Insights

Nsight Systems Profiling

Usage

Key Takeaways