This repository contains multiple CUDA implementations of Bitonic Sort with various optimizations, along with a Radix Sort implementation for performance comparison. The project demonstrates GPU-accelerated sorting techniques, highlights optimization trade-offs, and analyzes performance using CUDA events and the NVIDIA Nsight profiler.
- Introduction
- Setup and Requirements
- Compilation and Usage
- Directory Structure
- Algorithm Versions
- Performance Metrics
- Debugging and Verification
- Makefile Targets
- Cleanup
This project implements four optimized versions of Bitonic Sort and a Radix Sort on NVIDIA GPUs using CUDA:
Algorithm | Optimizations Applied |
---|---|
Bitonic Sort V0 | Baseline implementation using global memory |
Bitonic Sort V1 | Kernel fusion for intra-block communication |
Bitonic Sort V2 | Shared memory for efficient intra-block operations |
Bitonic Sort V3 | Reduced thread spawn, warp shuffling, pinned memory |
Radix Sort | Uses CUDA CUB library for comparison |
✅ Hardware:
- NVIDIA GPU with Compute Capability ≥ 3.5
✅ Software:
- CUDA Toolkit ≥ 11.0
- NVCC Compiler (part of CUDA Toolkit)
- Make build system
✅ Tested On:
- NVIDIA T4 (Compute 7.5)
- NVIDIA Tesla P100 (Compute 6.0)
- NVIDIA Ampere A100 (Compute 8.0)
Use the provided Makefile to build all implementations:
make all
To build specific versions:
make bitonic v=[0|1|2|3] # Build a specific Bitonic Sort version
make radix # Build Radix Sort
✅ Executables are stored in the bin/
directory:
bitonic_v0
,bitonic_v1
,bitonic_v2
,bitonic_v3
(Bitonic Sort versions)radix_sort
(Radix Sort)
Run with q
specifying log₂(number of elements):
make run v=[0|1|2|3|radix] q=<value>
Example (Sort 2²⁰ elements using Bitonic Sort V2):
make run v=2 q=20
📦 Project Root
┣ 📂 src/ # Source code for Bitonic and Radix Sort implementations
┣ 📂 bin/ # Compiled executables
┣ 📂 results/ # Performance metrics and verification results
┣ 📂 docs/ # Documentation and report files
┣ 📜 Makefile # Build system
- Uses global memory for all operations.
- Simple one-thread-per-element approach.
- Serves as a performance baseline for optimizations.
- Fuses multiple sorting steps into a single kernel launch.
- Minimizes global synchronization overhead.
- Memory access pattern is still global memory dependent.
- Performance Improvement: Achieves a 30% speedup over V0.
- Utilizes fast shared memory for intra-block operations.
- Reduces global memory transactions, improving performance.
- Performance Improvement: Achieves a 20% speedup over V1.
- Minimized thread spawn in main sorting kernel.
- Optimized warp shuffle operations to reduce global synchronization.
- Pinned Memory Usage for efficient host-device transfers.
- Performance Improvement: Achieves a 20% speedup over V2.
- Uses CUB library for Radix Sort.
- Included for performance comparison with Bitonic Sort.
Performance is measured using CUDA events (cudaEventElapsedTime
). Each test runs multiple times to ensure accurate measurement.
🔍 Profiling Tools Used:
- NVIDIA Nsight Profiler for kernel analysis
- CUDA event timers for execution time measurement
📈 Performance Graph (A100 GPU):
To enable/disable debugging and verification options, modify the following #define
statements in the beginning of each source file:
#define DEBUG // Enables detailed output for debugging
#define VERIFY_SEQUENTIAL // Enables comparison with sequential CPU sort (for verification)
Command | Description |
---|---|
make all |
Builds all Bitonic Sort versions and Radix Sort |
make bitonic v=[0|1|2|3] |
Builds a specific Bitonic Sort version |
make radix |
Builds Radix Sort |
make run v=[0|1|2|3|radix] q=<value> |
Runs the specified sorting algorithm with 2^q elements |
make clean |
Removes all compiled binaries and object files |
make help |
Displays help message |
To remove all compiled executables:
make clean