This project implements a blockchain mining simulation using CUDA to accelerate two critical components: Merkle Root computation and Proof of Work nonce finding. The implementation demonstrates how GPU parallelization can significantly improve the performance of blockchain operations compared to traditional CPU-based approaches.
The project consists of two main components:
- CPU Miner (
cpu_miner/): Reference implementation showing the serial CPU approach - GPU Miner (
gpu_miner/): Optimized CUDA implementation with parallelized Merkle Root computation and nonce finding
- Nodes/Miners: Participate in the consensus process without trusting each other
- Mathematical Problem: Miners solve computationally complex problems to gain trust
- Dynamic Difficulty: Problem complexity increases with more participants
- Previous Block Hash: Links to the previous block in the chain
- Merkle Root: Hash of all transactions in the block using a Merkle Tree
- Nonce: 32-bit integer found through Proof of Work to meet difficulty requirements
The Merkle Root is computed using two CUDA kernels:
hash_kernel: Computes SHA-256 hashes for each initial transactionmerge_kernel: Concatenates and hashes pairs of hashes to build the tree
Tree Construction Process:
Level 0 (initial): [H0] [H1] [H2] [H3] [H4] [H5] [H6] [H7]
↓↓ ↓↓ ↓↓ ↓↓
Level 1: [H01] [H23] [H45] [H67]
↓↓ ↓↓
Level 2: [H0123] [H4567]
↓↓
Level 3 (root): [H01234567]
The implementation uses batch processing for nonce discovery:
- Batch Division: INT32_MAX is divided into batches processed by different threads
- Parallel Search: Multiple threads search different nonce ranges simultaneously
- Early Termination: Stops searching once a valid nonce is found
- Atomic Operations: Uses
atomicMinto track the smallest valid nonce found
Memory Optimization Evolution:
- Started with unified memory (
cudaMallocManaged) - Experimented with shared memory for constant data
- Final implementation uses
cudaMallocfor optimal performance
First line contains (in order):
number_of_transactions: Total transactions in fileprev_block_hash: Hash of the previous blockdifficulty_zeros: Number of leading zeros requiredmax_nonce: Maximum nonce values to checkmax_transactions_in_a_block: Maximum transactions per blocktransaction_size: Size of each transaction in bytes
Subsequent lines contain the actual transactions.
For each block:
BLOCK_ID,NONCE,BLOCK_HASH,TIME_FOR_MERKLE_ROOT_COMPUTATION,TIME_FOR_NONCE_COMPUTATION,TIME_SUM
Final line shows total times:
TOTAL_TIME_FOR_MERKLE_ROOT_COMPUTATION,TOTAL_TIME_FOR_NONCE_COMPUTATION,TOTAL_TIME_SUM
- Transaction Grouping: Transactions are grouped into blocks based on
max_transactions_in_a_block - Merkle Root Calculation: Compute Merkle Root for each block's transactions
- Block Formation: Concatenate previous block hash and Merkle Root
- Nonce Search: Find valid nonce that meets difficulty requirements
- Validation: Block is invalidated if no valid nonce is found within
max_nonce
This project requires:
- CUDA Toolkit (tested with 12.2.2)
- NVIDIA GPU with compute capability 3.5+
- Slurm workload manager
# Compile
make
# Compile with specific compiler
make COMPILER=gcc build
# Compile with specific compiler and locally
make COMPILER=gcc build LOCAL=y
# Run specific test
make run TEST=<TEST_NAME> # Example: make run TEST=test4
# Clean build artifacts
make clean| Test | Component | CPU Time (s) | GPU Time (s) | Speedup |
|---|---|---|---|---|
| Test 1 | Merkle Root | 0.00043 | 0.00484 | 0.09x |
| Test 1 | Nonce Finding | 12.86962 | 0.04931 | 261x |
| Test 1 | Total | 12.87005 | 0.05415 | 238x |
| Test 2 | Merkle Root | 0.00055 | 0.00790 | 0.07x |
| Test 2 | Nonce Finding | 12.04284 | 0.04940 | 244x |
| Test 2 | Total | 12.04339 | 0.05730 | 210x |
| Test 3 | Merkle Root | - | 0.02439 | - |
| Test 3 | Nonce Finding | - | 32.15315 | - |
| Test 3 | Total | - | 32.17754 | |
| Test 4 | Merkle Root | 0.83722 | 0.04266 | 19.6x |
| Test 4 | Nonce Finding | 2.08098 | 0.01596 | 130x |
| Test 4 | Total | 2.91820 | 0.05862 | 49.8x |
Test 3 results are not available for CPU due to excessive computation time.
The CUDA implementation demonstrates significant speedup over the CPU version:
- Merkle Root: Parallel tree construction vs sequential hashing
- Nonce Finding: Massive parallel search vs iterative trial-and-error
- Parallel Tree Construction: Multiple levels of Merkle tree built concurrently
- Batch Nonce Processing: Divides search space across GPU threads
- Memory Management: Optimized data transfer between host and device
- Early Termination: Stops unnecessary computation once solution is found





