CUDA Merkel Root Tree & Proof of Work

Introduction

This project implements a blockchain mining simulation using CUDA to accelerate two critical components: Merkle Root computation and Proof of Work nonce finding. The implementation demonstrates how GPU parallelization can significantly improve the performance of blockchain operations compared to traditional CPU-based approaches.

Project Structure

The project consists of two main components:

CPU Miner (cpu_miner/): Reference implementation showing the serial CPU approach
GPU Miner (gpu_miner/): Optimized CUDA implementation with parallelized Merkle Root computation and nonce finding

Algorithm Overview

Blockchain Consensus Mechanism

Nodes/Miners: Participate in the consensus process without trusting each other
Mathematical Problem: Miners solve computationally complex problems to gain trust
Dynamic Difficulty: Problem complexity increases with more participants

Block Structure

Previous Block Hash: Links to the previous block in the chain
Merkle Root: Hash of all transactions in the block using a Merkle Tree
Nonce: 32-bit integer found through Proof of Work to meet difficulty requirements

CUDA Implementation Details

Merkle Root Computation

The Merkle Root is computed using two CUDA kernels:

hash_kernel: Computes SHA-256 hashes for each initial transaction
merge_kernel: Concatenates and hashes pairs of hashes to build the tree

Tree Construction Process:

Level 0 (initial):   [H0] [H1] [H2] [H3] [H4] [H5] [H6] [H7]
                      ↓↓   ↓↓   ↓↓   ↓↓
Level 1:            [H01] [H23] [H45] [H67]
                     ↓↓    ↓↓
Level 2:            [H0123] [H4567]
                      ↓↓
Level 3 (root):     [H01234567]

Nonce Finding (Proof of Work)

The implementation uses batch processing for nonce discovery:

Batch Division: INT32_MAX is divided into batches processed by different threads
Parallel Search: Multiple threads search different nonce ranges simultaneously
Early Termination: Stops searching once a valid nonce is found
Atomic Operations: Uses atomicMin to track the smallest valid nonce found

Memory Optimization Evolution:

Started with unified memory (cudaMallocManaged)
Experimented with shared memory for constant data
Final implementation uses cudaMalloc for optimal performance

Input/Output Format

Input File (`<TEST_NAME>.in`)

First line contains (in order):

number_of_transactions: Total transactions in file
prev_block_hash: Hash of the previous block
difficulty_zeros: Number of leading zeros required
max_nonce: Maximum nonce values to check
max_transactions_in_a_block: Maximum transactions per block
transaction_size: Size of each transaction in bytes

Subsequent lines contain the actual transactions.

Output File (`<TEST_NAME>.out`)

For each block:

BLOCK_ID,NONCE,BLOCK_HASH,TIME_FOR_MERKLE_ROOT_COMPUTATION,TIME_FOR_NONCE_COMPUTATION,TIME_SUM

Final line shows total times:

TOTAL_TIME_FOR_MERKLE_ROOT_COMPUTATION,TOTAL_TIME_FOR_NONCE_COMPUTATION,TOTAL_TIME_SUM

Processing Pipeline

Transaction Grouping: Transactions are grouped into blocks based on max_transactions_in_a_block
Merkle Root Calculation: Compute Merkle Root for each block's transactions
Block Formation: Concatenate previous block hash and Merkle Root
Nonce Search: Find valid nonce that meets difficulty requirements
Validation: Block is invalidated if no valid nonce is found within max_nonce

Build and Execution

Environment Setup

This project requires:

CUDA Toolkit (tested with 12.2.2)
NVIDIA GPU with compute capability 3.5+
Slurm workload manager

# Compile
make    

# Compile with specific compiler
make COMPILER=gcc build

# Compile with specific compiler and locally
make COMPILER=gcc build LOCAL=y

# Run specific test
make run TEST=<TEST_NAME>  # Example: make run TEST=test4

# Clean build artifacts
make clean

Performance Comparison

CPU vs GPU Execution

Test	Component	CPU Time (s)	GPU Time (s)	Speedup
Test 1	Merkle Root	0.00043	0.00484	0.09x
Test 1	Nonce Finding	12.86962	0.04931	261x
Test 1	Total	12.87005	0.05415	238x
Test 2	Merkle Root	0.00055	0.00790	0.07x
Test 2	Nonce Finding	12.04284	0.04940	244x
Test 2	Total	12.04339	0.05730	210x
Test 3	Merkle Root	-	0.02439	-
Test 3	Nonce Finding	-	32.15315	-
Test 3	Total	-	32.17754
Test 4	Merkle Root	0.83722	0.04266	19.6x
Test 4	Nonce Finding	2.08098	0.01596	130x
Test 4	Total	2.91820	0.05862	49.8x

Test 3 results are not available for CPU due to excessive computation time.

The CUDA implementation demonstrates significant speedup over the CPU version:

Merkle Root: Parallel tree construction vs sequential hashing
Nonce Finding: Massive parallel search vs iterative trial-and-error

Key Optimizations

Parallel Tree Construction: Multiple levels of Merkle tree built concurrently
Batch Nonce Processing: Divides search space across GPU threads
Memory Management: Optimized data transfer between host and device
Early Termination: Stops unnecessary computation once solution is found

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README_Assignment		README_Assignment
README_imgs		README_imgs
common		common
cpu_miner		cpu_miner
gpu_miner		gpu_miner
include		include
tests		tests
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CUDA Merkel Root Tree & Proof of Work

Introduction

Project Structure

Algorithm Overview

Blockchain Consensus Mechanism

Block Structure

CUDA Implementation Details

Merkle Root Computation

Nonce Finding (Proof of Work)

Input/Output Format

Input File (`<TEST_NAME>.in`)

Output File (`<TEST_NAME>.out`)

Processing Pipeline

Build and Execution

Environment Setup

Performance Comparison

CPU vs GPU Execution

Key Optimizations

Resources

About

Uh oh!

Releases

Packages

Languages

robertpaulp/blockchain

Folders and files

Latest commit

History

Repository files navigation

CUDA Merkel Root Tree & Proof of Work

Introduction

Project Structure

Algorithm Overview

Blockchain Consensus Mechanism

Block Structure

CUDA Implementation Details

Merkle Root Computation

Nonce Finding (Proof of Work)

Input/Output Format

Input File (<TEST_NAME>.in)

Output File (<TEST_NAME>.out)

Processing Pipeline

Build and Execution

Environment Setup

Performance Comparison

CPU vs GPU Execution

Key Optimizations

Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Input File (`<TEST_NAME>.in`)

Output File (`<TEST_NAME>.out`)

Packages