ggml-org · thad0ctor · Jun 16, 2025 · Jun 16, 2025 · Jun 16, 2025 · Jul 2, 2025
diff --git a/blackwell/IMPLEMENTATION_PLAN.md b/blackwell/IMPLEMENTATION_PLAN.md
@@ -0,0 +1,392 @@
+# NVIDIA Blackwell GPU Architecture Support Implementation Plan
+
+## Overview
+
+This document outlines the implementation plan for adding comprehensive NVIDIA Blackwell GPU architecture support to llama.cpp. The plan is structured in phases to ensure systematic development, testing, and validation of Blackwell-specific optimizations.
+
+## Current State Analysis
+
+- **Compute Capability**: Currently supports up to Ada Lovelace (8.9)
+- **Blackwell Support**: [PR #13360](https://github.com/ggml-org/llama.cpp/pull/13360) adds CUDA 12.8 + sm120 build support
+- **Missing Features**: Thread Block Clusters, L2 cache management, HBM3/HBM3e optimizations
+- **Flash Attention**: Multiple kernel variants but no Blackwell-specific optimizations
+- **Compatibility**: Basic functionality works via backward compatibility, but performance is sub-optimal
+
+## Architecture Constants Update
+
+**Critical Finding**: Blackwell GPUs use compute capability **12.0** (sm120), not 10.0 as initially assumed.
+
+## Flash Attention Analysis
+
+llama.cpp implements multiple Flash Attention kernel variants:
+- **MMA-based kernels** (`fattn-mma-f16.cuh`): Modern implementation for Turing+ 
+- **Vector kernels** (`fattn-vec-f32/f16.cuh`): For smaller batches/specific dimensions
+- **WMMA kernels** (`fattn-wmma-f16.cu`): Legacy implementation for Volta
+- **Tile kernels** (`fattn-tile-f16/f32.cu`): For architectures without tensor cores
+
+Selection logic in `ggml_cuda_flash_attn_ext()` considers compute capability, batch size, head dimensions, and data types.
+
+## Phase 1: Foundation and Architecture Detection ⚡ **ACCELERATED**
+
+### 1.1 Add Blackwell Constants and Detection **✅ FOUNDATION COMPLETE**
+
+**Status**: Foundation provided by [PR #13360](https://github.com/ggml-org/llama.cpp/pull/13360)
+- ✅ CUDA 12.8 toolkit support
+- ✅ sm120 compilation target
+- ✅ Build system integration
+
+**Files to modify:**
+- `ggml/src/ggml-cuda/common.cuh`
+- `ggml/src/ggml-cuda/ggml-cuda.cu`
+
+**Updated Implementation:**
+```cpp
+// Add to common.cuh - CORRECTED for actual Blackwell compute capability
+#define GGML_CUDA_CC_BLACKWELL       1200  // B100/B200/RTX50 (12.0) - CORRECTED
+#define GGML_CUDA_CC_BLACKWELL_FUTURE 1300  // Future Blackwell variants
+
+#define GGML_CUDA_CC_IS_BLACKWELL(cc) (cc >= GGML_CUDA_CC_BLACKWELL && cc < GGML_CUDA_CC_BLACKWELL_FUTURE)
+#define GGML_CUDA_CC_SUPPORTS_CLUSTERS(cc) (cc >= GGML_CUDA_CC_BLACKWELL)
+```
+
+**Timeline:** ~~Week 1-2~~ **COMPLETE** ✅
+
+### 1.2 Enhanced Device Information Structure
+
+**Files to modify:**
+- `ggml/src/ggml-cuda/ggml-cuda.cu` (cuda_device_info)
+
+**New fields:**
+- `max_clusters_per_multiprocessor`
+- `max_blocks_per_cluster` 
+- `l2_cache_size`
+- `hbm_bandwidth`
+
+**Updated Timeline:** Week 1 ⚡ (accelerated due to build foundation)
+
+### 1.3 Blackwell Feature Detection **NEW**
+
+**Files to create:**
+- `ggml/src/ggml-cuda/blackwell-detect.cu`
+
+**Implementation:**
+```cpp
+bool ggml_cuda_supports_blackwell_features(int device_id) {
+    cudaDeviceProp prop;
+    cudaGetDeviceProperties(&prop, device_id);
+
+    // Verify compute capability 12.0+
+    int cc = 100 * prop.major + 10 * prop.minor;
+    if (!GGML_CUDA_CC_IS_BLACKWELL(cc)) return false;
+
+    // Verify cluster support
+    int max_cluster_size;
+    cudaDeviceGetAttribute(&max_cluster_size, 
+        cudaDevAttrClusterLaunch, device_id);
+
+    return max_cluster_size > 0;
+}
+```
+
+**Timeline:** Week 1-2
+
+## Phase 2: Thread Block Clusters Foundation
+
+### 2.1 Cluster Detection and Support Infrastructure
+
+**Files to create:**
+- `ggml/src/ggml-cuda/clusters.cuh`
+- `ggml/src/ggml-cuda/clusters.cu`
+
+**Key functions:**
+- `ggml_cuda_cluster_occupancy()`
+- `ggml_cuda_launch_kernel_clusters()`
+- `ggml_cuda_cluster_sync_init()`
+
+**Updated Timeline:** Week 2-3 ⚡ (accelerated)
+
+### 2.2 L2 Cache Management
+
+**Files to modify:**
+- `ggml/src/ggml-cuda/ggml-cuda.cu`
+
+**Implementation:**
+- L2 cache persistence API wrappers
+- Cache allocation strategy for KV cache data
+- Stream-based cache management
+
+**Updated Timeline:** Week 3-4 ⚡ (accelerated)
+
+## Phase 3: Flash Attention Blackwell Optimizations
+
+### 3.1 MMA Kernel Enhancements for Blackwell
+
+**Files to modify:**
+- `ggml/src/ggml-cuda/fattn-mma-f16.cuh`
+- `ggml/src/ggml-cuda/fattn-common.cuh`
+
+**Key optimizations:**
+
+#### 3.1.1 Enhanced Shared Memory Usage
+```cpp
+// Leverage 228 KB shared memory per SM vs 164 KB on Ada Lovelace
+template<int DKQ, int DV>
+struct fattn_blackwell_config : fattn_mma_f16_config<DKQ, DV> {
+    static constexpr int cc_target = GGML_CUDA_CC_BLACKWELL; // 1200
+    static constexpr int smpb_blackwell = 228 * 1024; // 228 KB
+    static constexpr int enhanced_batch_size = smpb_blackwell / (DKQ * sizeof(half));
+
+    // Increase tile sizes for better cache utilization
+    static constexpr int nbatch_fa_blackwell = std::min(enhanced_batch_size, 128);
+};
+```
+
+#### 3.1.2 Thread Block Cluster Integration
+```cpp
+template<int DKQ, int DV, int ncols1, int ncols2, int cluster_size>
+__cluster_dims__(cluster_size, 1, 1)
+__global__ void flash_attn_ext_f16_clustered(/* parameters */) {
+    // Distributed shared memory across cluster
+    extern __shared__ half2 cluster_shared_memory[];
+
+    // Cluster-wide synchronization
+    cluster.sync();
+
+    // Enhanced memory access patterns
+    // ...
+}
+```
+
+#### 3.1.3 L2 Cache-Aware KV Access
+```cpp
+// Optimize KV cache access patterns for L2 persistence
+template<typename T>
+__device__ void prefetch_kv_to_l2(const T* kv_data, size_t size) {
+    // Use cache hints for Blackwell L2 (126 MB vs 40 MB)
+    __builtin_nontemporal_store(); // Blackwell-specific hints
+}
+```
+
+**Updated Timeline:** Week 4-7 ⚡ (accelerated)
+
+### 3.2 Kernel Selection Logic Updates
+
+**Files to modify:**
+- `ggml/src/ggml-cuda/fattn.cu`
+
+**Enhanced selection for Blackwell:**
+```cpp
+void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
+    const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
+
+    if (GGML_CUDA_CC_IS_BLACKWELL(cc)) {
+        // Prefer cluster-based kernels for larger problems
+        if (can_use_clusters && problem_size_threshold_met) {
+            ggml_cuda_flash_attn_ext_mma_f16_clusters(ctx, dst);
+            return;
+        }
+
+        // Use enhanced MMA kernels with Blackwell optimizations
+        ggml_cuda_flash_attn_ext_mma_f16_blackwell(ctx, dst);
+        return;
+    }
+
+    // ... existing fallback logic ...
+}
+```
+
+**Updated Timeline:** Week 6-7 ⚡ (accelerated)
+
+### 3.3 Advanced Memory Access Optimizations
+
+#### 3.3.1 HBM3/HBM3e Bandwidth Optimization
+- Implement wider memory transactions (512-bit vs 256-bit)
+- Optimize memory coalescing patterns for higher bandwidth
+- Implement memory prefetching strategies
+
+#### 3.3.2 Async Copy Enhancements
+```cpp
+// Enhanced async copy for Blackwell
+template<int cluster_size>
+__device__ void async_copy_cluster_aware(
+    void* dst, const void* src, size_t bytes, 
+    cuda::barrier<cuda::thread_scope_cluster>& barrier) {
+    // Blackwell-optimized async copy with cluster coordination
+}
+```
+
+**Updated Timeline:** Week 8-9 ⚡ (accelerated)
+
+## Phase 4: Advanced Blackwell Features
+
+### 4.1 Distributed Shared Memory Implementation
+
+**Files to create:**
+- `ggml/src/ggml-cuda/distributed-shared-memory.cuh`
+
+**Key features:**
+- Cross-block shared memory access
+- Cluster-wide data sharing for attention heads
+- Optimized memory layout for distributed access
+
+**Updated Timeline:** Week 10-11 ⚡ (accelerated)
+
+### 4.2 Advanced Occupancy Management
+
+**Files to modify:**
+- `ggml/src/ggml-cuda/fattn-common.cuh`
+
+**Implementation:**
+- `cudaOccupancyMaxActiveClusters` integration
+- Dynamic cluster size selection
+- Load balancing across SMs
+
+**Updated Timeline:** Week 11-12 ⚡ (accelerated)
+
+### 4.3 Multi-Head Attention Cluster Optimization
+
+**New kernel variants:**
+- Cluster-aware multi-head processing
+- Cross-head data sharing via distributed shared memory
+- Optimized attention head grouping strategies
+
+**Updated Timeline:** Week 12-13 ⚡ (accelerated)
+
+## Phase 5: General CUDA Kernel Optimizations
+
+### 5.1 Matrix Operations Enhancement
+
+**Files to modify:**
+- `ggml/src/ggml-cuda/gemm.cu`
+- `ggml/src/ggml-cuda/mul-mat.cu`
+
+**Optimizations:**
+- Leverage 255 registers per thread with improved scheduling
+- Enhanced warp-level primitives for Blackwell
+- L2 cache persistence for weight matrices
+
+**Updated Timeline:** Week 14-15 ⚡ (accelerated)
+
+### 5.2 Attention-Adjacent Operations
+
+**Files to modify:**
+- `ggml/src/ggml-cuda/rope.cu` (Rotary Position Embedding)
+- `ggml/src/ggml-cuda/norm.cu` (Layer Normalization)
+
+**Optimizations:**
+- Thread block cluster integration where beneficial
+- Enhanced shared memory usage
+- Optimized memory access patterns
+
+**Updated Timeline:** Week 15-16 ⚡ (accelerated)
+
+## Phase 6: Performance Validation and Optimization
+
+### 6.1 Benchmarking Infrastructure
+
+**Files to create:**
+- `tools/blackwell-bench/`
+- Comprehensive benchmarking suite
+- Performance regression detection
+- A/B testing framework
+
+**Updated Timeline:** Week 17-18 ⚡ (accelerated)
+
+### 6.2 Performance Tuning
+
+**Focus areas:**
+- Kernel parameter auto-tuning
+- Dynamic optimization based on problem size
+- Memory allocation strategy optimization
+- Cache management tuning
+
+**Updated Timeline:** Week 18-20 ⚡ (accelerated)
+
+### 6.3 Integration Testing
+
+**Test scenarios:**
+- Various model architectures (Llama, Mistral, etc.)
+- Different sequence lengths and batch sizes
+- Mixed precision scenarios
+- Multi-GPU configurations
+
+**Updated Timeline:** Week 20-21 ⚡ (accelerated)
+
+## Phase 7: Documentation and Integration
+
+### 7.1 Documentation Updates
+
+**Files to create/modify:**
+- `docs/backend/BLACKWELL.md`
+- Update existing CUDA documentation
+- Code documentation and examples
+
+**Updated Timeline:** Week 22 ⚡ (accelerated)
+
+### 7.2 Build System Integration **⚡ FOUNDATION COMPLETE**
+
+**Status**: Core build system complete via [PR #13360](https://github.com/ggml-org/llama.cpp/pull/13360)
+
+**Remaining tasks:**
+- ✅ CUDA version detection (complete)
+- ✅ Blackwell-specific compilation flags (complete)
+- 🔄 Optional feature toggles for Blackwell optimizations
+
+**Updated Timeline:** Week 22 ⚡ (accelerated)
+
+## Updated Success Metrics
+
+### Performance Targets
+- **Flash Attention**: 20-40% improvement over Ada Lovelace
+- **Overall Inference**: 15-30% improvement in tokens/second
+- **Memory Efficiency**: Better utilization of 126 MB L2 cache
+- **Scalability**: Improved performance on larger context lengths
+
+### Validation Criteria
+- All existing tests pass
+- No performance regression on older architectures
+- Blackwell-specific optimizations activate correctly for compute capability 12.0+
+- Proper fallback behavior on non-Blackwell hardware
+
+## Updated Risk Mitigation
+
+### Technical Risks - REDUCED ⚡
+- ✅ **Build Infrastructure**: Resolved by PR #13360
+- ✅ **Compute Capability Detection**: Corrected to 12.0
+- 🔄 **Hardware Availability**: Still limited but build foundation ready
+- 🔄 **API Changes**: Version detection in place
+- 🔄 **Complexity**: Incremental implementation continues
+
+### Timeline Risks - MITIGATED ⚡
+- ✅ **Foundation Delays**: Eliminated by PR #13360
+- 🔄 **Scope Creep**: Strict phase gating maintained
+- 🔄 **Dependencies**: CUDA 12.8 foundation complete
+
+## Updated Timeline Summary
+
+**Original Timeline**: 24 weeks
+**Accelerated Timeline**: 22 weeks ⚡ (2-week acceleration)
+
+**Key Accelerations**:
+- Phase 1: Complete → Immediate start on Phase 2
+- Phase 2-7: 1-2 week acceleration per phase
+- Build system risks eliminated
+
+## Immediate Next Steps (Week 1)
+
+1. **Implement Phase 1.2**: Enhanced device information structure
+2. **Begin Phase 1.3**: Blackwell feature detection
+3. **Start Phase 2.1**: Cluster infrastructure development
+4. **Update all compute capability constants**: 1000 → 1200
+
+## Conclusion
+
+[PR #13360](https://github.com/ggml-org/llama.cpp/pull/13360) provides crucial foundation acceleration for our Blackwell implementation. The corrected compute capability (12.0) and completed build infrastructure allow us to begin advanced optimizations immediately.
+
+**Key Benefits**:
+- ⚡ **2-week timeline acceleration**
+- ✅ **Build foundation complete**
+- 🎯 **Accurate architecture targeting** (cc 12.0)
+- 🚀 **Immediate development start** capability
+
+The plan now reflects actual Blackwell specifications and leverages the completed foundation to achieve aggressive performance improvements while maintaining our systematic, phased approach.