Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
392 changes: 392 additions & 0 deletions blackwell/IMPLEMENTATION_PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,392 @@
# NVIDIA Blackwell GPU Architecture Support Implementation Plan

## Overview

This document outlines the implementation plan for adding comprehensive NVIDIA Blackwell GPU architecture support to llama.cpp. The plan is structured in phases to ensure systematic development, testing, and validation of Blackwell-specific optimizations.

## Current State Analysis

- **Compute Capability**: Currently supports up to Ada Lovelace (8.9)
- **Blackwell Support**: [PR #13360](https://github.com/ggml-org/llama.cpp/pull/13360) adds CUDA 12.8 + sm120 build support
- **Missing Features**: Thread Block Clusters, L2 cache management, HBM3/HBM3e optimizations
- **Flash Attention**: Multiple kernel variants but no Blackwell-specific optimizations
- **Compatibility**: Basic functionality works via backward compatibility, but performance is sub-optimal

## Architecture Constants Update

**Critical Finding**: Blackwell GPUs use compute capability **12.0** (sm120), not 10.0 as initially assumed.

## Flash Attention Analysis

llama.cpp implements multiple Flash Attention kernel variants:
- **MMA-based kernels** (`fattn-mma-f16.cuh`): Modern implementation for Turing+
- **Vector kernels** (`fattn-vec-f32/f16.cuh`): For smaller batches/specific dimensions
- **WMMA kernels** (`fattn-wmma-f16.cu`): Legacy implementation for Volta
- **Tile kernels** (`fattn-tile-f16/f32.cu`): For architectures without tensor cores

Selection logic in `ggml_cuda_flash_attn_ext()` considers compute capability, batch size, head dimensions, and data types.

## Phase 1: Foundation and Architecture Detection ⚡ **ACCELERATED**

### 1.1 Add Blackwell Constants and Detection **✅ FOUNDATION COMPLETE**

**Status**: Foundation provided by [PR #13360](https://github.com/ggml-org/llama.cpp/pull/13360)
- ✅ CUDA 12.8 toolkit support
- ✅ sm120 compilation target
- ✅ Build system integration

**Files to modify:**
- `ggml/src/ggml-cuda/common.cuh`
- `ggml/src/ggml-cuda/ggml-cuda.cu`

**Updated Implementation:**
```cpp
// Add to common.cuh - CORRECTED for actual Blackwell compute capability
#define GGML_CUDA_CC_BLACKWELL 1200 // B100/B200/RTX50 (12.0) - CORRECTED
#define GGML_CUDA_CC_BLACKWELL_FUTURE 1300 // Future Blackwell variants

#define GGML_CUDA_CC_IS_BLACKWELL(cc) (cc >= GGML_CUDA_CC_BLACKWELL && cc < GGML_CUDA_CC_BLACKWELL_FUTURE)
#define GGML_CUDA_CC_SUPPORTS_CLUSTERS(cc) (cc >= GGML_CUDA_CC_BLACKWELL)
```

**Timeline:** ~~Week 1-2~~ **COMPLETE** ✅

### 1.2 Enhanced Device Information Structure

**Files to modify:**
- `ggml/src/ggml-cuda/ggml-cuda.cu` (cuda_device_info)

**New fields:**
- `max_clusters_per_multiprocessor`
- `max_blocks_per_cluster`
- `l2_cache_size`
- `hbm_bandwidth`

**Updated Timeline:** Week 1 ⚡ (accelerated due to build foundation)

### 1.3 Blackwell Feature Detection **NEW**

**Files to create:**
- `ggml/src/ggml-cuda/blackwell-detect.cu`

**Implementation:**
```cpp
bool ggml_cuda_supports_blackwell_features(int device_id) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, device_id);

// Verify compute capability 12.0+
int cc = 100 * prop.major + 10 * prop.minor;
if (!GGML_CUDA_CC_IS_BLACKWELL(cc)) return false;

// Verify cluster support
int max_cluster_size;
cudaDeviceGetAttribute(&max_cluster_size,
cudaDevAttrClusterLaunch, device_id);

return max_cluster_size > 0;
}
```

**Timeline:** Week 1-2

## Phase 2: Thread Block Clusters Foundation

### 2.1 Cluster Detection and Support Infrastructure

**Files to create:**
- `ggml/src/ggml-cuda/clusters.cuh`
- `ggml/src/ggml-cuda/clusters.cu`

**Key functions:**
- `ggml_cuda_cluster_occupancy()`
- `ggml_cuda_launch_kernel_clusters()`
- `ggml_cuda_cluster_sync_init()`

**Updated Timeline:** Week 2-3 ⚡ (accelerated)

### 2.2 L2 Cache Management

**Files to modify:**
- `ggml/src/ggml-cuda/ggml-cuda.cu`

**Implementation:**
- L2 cache persistence API wrappers
- Cache allocation strategy for KV cache data
- Stream-based cache management

**Updated Timeline:** Week 3-4 ⚡ (accelerated)

## Phase 3: Flash Attention Blackwell Optimizations

### 3.1 MMA Kernel Enhancements for Blackwell

**Files to modify:**
- `ggml/src/ggml-cuda/fattn-mma-f16.cuh`
- `ggml/src/ggml-cuda/fattn-common.cuh`

**Key optimizations:**

#### 3.1.1 Enhanced Shared Memory Usage
```cpp
// Leverage 228 KB shared memory per SM vs 164 KB on Ada Lovelace
template<int DKQ, int DV>
struct fattn_blackwell_config : fattn_mma_f16_config<DKQ, DV> {
static constexpr int cc_target = GGML_CUDA_CC_BLACKWELL; // 1200
static constexpr int smpb_blackwell = 228 * 1024; // 228 KB
static constexpr int enhanced_batch_size = smpb_blackwell / (DKQ * sizeof(half));

// Increase tile sizes for better cache utilization
static constexpr int nbatch_fa_blackwell = std::min(enhanced_batch_size, 128);
};
```

#### 3.1.2 Thread Block Cluster Integration
```cpp
template<int DKQ, int DV, int ncols1, int ncols2, int cluster_size>
__cluster_dims__(cluster_size, 1, 1)
__global__ void flash_attn_ext_f16_clustered(/* parameters */) {
// Distributed shared memory across cluster
extern __shared__ half2 cluster_shared_memory[];

// Cluster-wide synchronization
cluster.sync();

// Enhanced memory access patterns
// ...
}
```

#### 3.1.3 L2 Cache-Aware KV Access
```cpp
// Optimize KV cache access patterns for L2 persistence
template<typename T>
__device__ void prefetch_kv_to_l2(const T* kv_data, size_t size) {
// Use cache hints for Blackwell L2 (126 MB vs 40 MB)
__builtin_nontemporal_store(); // Blackwell-specific hints
}
```

**Updated Timeline:** Week 4-7 ⚡ (accelerated)

### 3.2 Kernel Selection Logic Updates

**Files to modify:**
- `ggml/src/ggml-cuda/fattn.cu`

**Enhanced selection for Blackwell:**
```cpp
void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;

if (GGML_CUDA_CC_IS_BLACKWELL(cc)) {
// Prefer cluster-based kernels for larger problems
if (can_use_clusters && problem_size_threshold_met) {
ggml_cuda_flash_attn_ext_mma_f16_clusters(ctx, dst);
return;
}

// Use enhanced MMA kernels with Blackwell optimizations
ggml_cuda_flash_attn_ext_mma_f16_blackwell(ctx, dst);
return;
}

// ... existing fallback logic ...
}
```

**Updated Timeline:** Week 6-7 ⚡ (accelerated)

### 3.3 Advanced Memory Access Optimizations

#### 3.3.1 HBM3/HBM3e Bandwidth Optimization
- Implement wider memory transactions (512-bit vs 256-bit)
- Optimize memory coalescing patterns for higher bandwidth
- Implement memory prefetching strategies

#### 3.3.2 Async Copy Enhancements
```cpp
// Enhanced async copy for Blackwell
template<int cluster_size>
__device__ void async_copy_cluster_aware(
void* dst, const void* src, size_t bytes,
cuda::barrier<cuda::thread_scope_cluster>& barrier) {
// Blackwell-optimized async copy with cluster coordination
}
```

**Updated Timeline:** Week 8-9 ⚡ (accelerated)

## Phase 4: Advanced Blackwell Features

### 4.1 Distributed Shared Memory Implementation

**Files to create:**
- `ggml/src/ggml-cuda/distributed-shared-memory.cuh`

**Key features:**
- Cross-block shared memory access
- Cluster-wide data sharing for attention heads
- Optimized memory layout for distributed access

**Updated Timeline:** Week 10-11 ⚡ (accelerated)

### 4.2 Advanced Occupancy Management

**Files to modify:**
- `ggml/src/ggml-cuda/fattn-common.cuh`

**Implementation:**
- `cudaOccupancyMaxActiveClusters` integration
- Dynamic cluster size selection
- Load balancing across SMs

**Updated Timeline:** Week 11-12 ⚡ (accelerated)

### 4.3 Multi-Head Attention Cluster Optimization

**New kernel variants:**
- Cluster-aware multi-head processing
- Cross-head data sharing via distributed shared memory
- Optimized attention head grouping strategies

**Updated Timeline:** Week 12-13 ⚡ (accelerated)

## Phase 5: General CUDA Kernel Optimizations

### 5.1 Matrix Operations Enhancement

**Files to modify:**
- `ggml/src/ggml-cuda/gemm.cu`
- `ggml/src/ggml-cuda/mul-mat.cu`

**Optimizations:**
- Leverage 255 registers per thread with improved scheduling
- Enhanced warp-level primitives for Blackwell
- L2 cache persistence for weight matrices

**Updated Timeline:** Week 14-15 ⚡ (accelerated)

### 5.2 Attention-Adjacent Operations

**Files to modify:**
- `ggml/src/ggml-cuda/rope.cu` (Rotary Position Embedding)
- `ggml/src/ggml-cuda/norm.cu` (Layer Normalization)

**Optimizations:**
- Thread block cluster integration where beneficial
- Enhanced shared memory usage
- Optimized memory access patterns

**Updated Timeline:** Week 15-16 ⚡ (accelerated)

## Phase 6: Performance Validation and Optimization

### 6.1 Benchmarking Infrastructure

**Files to create:**
- `tools/blackwell-bench/`
- Comprehensive benchmarking suite
- Performance regression detection
- A/B testing framework

**Updated Timeline:** Week 17-18 ⚡ (accelerated)

### 6.2 Performance Tuning

**Focus areas:**
- Kernel parameter auto-tuning
- Dynamic optimization based on problem size
- Memory allocation strategy optimization
- Cache management tuning

**Updated Timeline:** Week 18-20 ⚡ (accelerated)

### 6.3 Integration Testing

**Test scenarios:**
- Various model architectures (Llama, Mistral, etc.)
- Different sequence lengths and batch sizes
- Mixed precision scenarios
- Multi-GPU configurations

**Updated Timeline:** Week 20-21 ⚡ (accelerated)

## Phase 7: Documentation and Integration

### 7.1 Documentation Updates

**Files to create/modify:**
- `docs/backend/BLACKWELL.md`
- Update existing CUDA documentation
- Code documentation and examples

**Updated Timeline:** Week 22 ⚡ (accelerated)

### 7.2 Build System Integration **⚡ FOUNDATION COMPLETE**

**Status**: Core build system complete via [PR #13360](https://github.com/ggml-org/llama.cpp/pull/13360)

**Remaining tasks:**
- ✅ CUDA version detection (complete)
- ✅ Blackwell-specific compilation flags (complete)
- 🔄 Optional feature toggles for Blackwell optimizations

**Updated Timeline:** Week 22 ⚡ (accelerated)

## Updated Success Metrics

### Performance Targets
- **Flash Attention**: 20-40% improvement over Ada Lovelace
- **Overall Inference**: 15-30% improvement in tokens/second
- **Memory Efficiency**: Better utilization of 126 MB L2 cache
- **Scalability**: Improved performance on larger context lengths

### Validation Criteria
- All existing tests pass
- No performance regression on older architectures
- Blackwell-specific optimizations activate correctly for compute capability 12.0+
- Proper fallback behavior on non-Blackwell hardware

## Updated Risk Mitigation

### Technical Risks - REDUCED ⚡
- ✅ **Build Infrastructure**: Resolved by PR #13360
- ✅ **Compute Capability Detection**: Corrected to 12.0
- 🔄 **Hardware Availability**: Still limited but build foundation ready
- 🔄 **API Changes**: Version detection in place
- 🔄 **Complexity**: Incremental implementation continues

### Timeline Risks - MITIGATED ⚡
- ✅ **Foundation Delays**: Eliminated by PR #13360
- 🔄 **Scope Creep**: Strict phase gating maintained
- 🔄 **Dependencies**: CUDA 12.8 foundation complete

## Updated Timeline Summary

**Original Timeline**: 24 weeks
**Accelerated Timeline**: 22 weeks ⚡ (2-week acceleration)

**Key Accelerations**:
- Phase 1: Complete → Immediate start on Phase 2
- Phase 2-7: 1-2 week acceleration per phase
- Build system risks eliminated

## Immediate Next Steps (Week 1)

1. **Implement Phase 1.2**: Enhanced device information structure
2. **Begin Phase 1.3**: Blackwell feature detection
3. **Start Phase 2.1**: Cluster infrastructure development
4. **Update all compute capability constants**: 1000 → 1200

## Conclusion

[PR #13360](https://github.com/ggml-org/llama.cpp/pull/13360) provides crucial foundation acceleration for our Blackwell implementation. The corrected compute capability (12.0) and completed build infrastructure allow us to begin advanced optimizations immediately.

**Key Benefits**:
- ⚡ **2-week timeline acceleration**
- ✅ **Build foundation complete**
- 🎯 **Accurate architecture targeting** (cc 12.0)
- 🚀 **Immediate development start** capability

The plan now reflects actual Blackwell specifications and leverages the completed foundation to achieve aggressive performance improvements while maintaining our systematic, phased approach.
Loading