v0.8.0
Summary
CubeCL 0.8.0 introduces major enhancements to quantization and matrix operations, near-complete flash attention implementation, and comprehensive matmul refactoring built on a new views and layouts system. This release brings a new MLIR-based CPU backend with LLVM, improved memory management with multi-stream support, and persistent storage capabilities.
What's New
Features
- Flash Attention: Full implementation with masking support, partitions, row-wise reductions, and multi-plane operations (@louisfd, #845, #962, #902, #920, #907)
- MLIR CPU Backend: Initial implementation providing CPU runtime support for non-Linux systems (@marcantoinem, #698, #790)
- Advanced Quantization: Block-scaled MMA, global quantization for matmul, quantized views, and support for FP4/FP2 formats (@wingertge, @nathanielsimard, #815, #960, #954, #836, #809)
- Persistent Memory: Added persistent storage capabilities for artifacts (@nathanielsimard, #947)
- Multi-Stream Support: Implemented multi-stream processing for WGPU and CUDA (@nathanielsimard, #914, #896)
- Tensor Memory Arrays (TMA): Added TMA views for optimized memory access (@wingertge, #943)
- Pinned Memory: Support for pinned memory allocations (@nathanielsimard, #885)
- Manual MMA Operations: Added manually managed MMA operations with custom tile support (@wingertge, #935, #810)
- Stacked and Tensor Layouts: New layout system for matmul and advanced tensor operations (@wingertge, #855, #835, #839)
- Saturating Arithmetic: Added saturating add/sub operations (@wingertge, #898)
- Shuffle Operations: Basic shuffle operations support (@huy209vn, #968)
- Additional Ops:
Trunc,IsNan,IsInf, andpowifor CUDA/HIP (@mooori, @laggui, @wingertge, #956, #937, #857) - Partition Scheduler: New scheduling system for shared memory reads in Matmul (@louisfd, #837)
Performance Improvements
- Optimized Line Sizes: Unrolled line sizes for matmul, convolution, reduce, and attention operations (@wingertge, #918)
- Memory Management: Refactored memory management API and static memory pool (@wingertge, @nathanielsimard, #800, #787)
- Device Locking: Improved device management and CUDA device change optimization (@nathanielsimard, #959, #864)
- Reusable Shared Memory: Enhanced shared memory management (@wingertge, #931)
Breaking Changes
- CUDA 12.8 Default: Bumped default CUDA version to 12.8 with new feature implementations (@wingertge, #820)
- Item Rework: Refactored item handling system (@wingertge, #844)
Refactoring
- Matmul Restructuring: Extensive refactoring of matmul components including inputs, tile operations, generics, and stage memory configuration (@wingertge, @louisfd, #949, #886, #819, #795, #794)
- Launch System: Refactored launch mechanism (@wingertge, #944)
- Stage and Global Writers: Improved writer architecture (@wingertge, #924)
- Runtime Features: Split and reorganized runtime traits (@wingertge, #883, #868)
- Convolution: Refactored convolution implementation (@wingertge, #822)
Bug Fixes
- Quantized Matmul: Fixed quant matmul line sizes and packed matmul issues (@wingertge, #978, #967)
- Tensor Operations: Corrected tensor shapes in reduce operations and fixed reverse sequence mutation (@TsaoLun, @wingertge, #976, #957)
- Metal Backend: Fixed plane operations on Metal (@louisfd, #964)
- WGPU Improvements: Fixed async readback, multi-stream support, and out-of-bounds writes (@ArthurBrussee, @nathanielsimard, #925, #912, #961)
- Broadcasting: Fixed broadcasting issues in compare ops and binary operations (@wingertge, #916, #895)
- WGSL Fixes: Corrected scalar declarations, vec-to-scalar casts, and boolean logic (@wingertge, @Cielbird, #818, #808, #840)
- Profiling: Resolved profiling deadlock (@nathanielsimard, #963)
- Type Conversions: Fixed packed FP4 casting and comparison vectorization (@wingertge, @laggui, #890, #858)
Infrastructure
- WGPU 26: Upgraded to wgpu version 26 (@janhohenheim, #850)
- Vulkan/rspirv Fork: Forked and integrated Vulkan/rspirv (@wingertge, #880)
- SPIRV Dump: Auto-enable spirv-dump when output path is set during build (@wingertge, #928)
- Deterministic Hashing: Made hash generation deterministic (@wingertge, #948)
- No-std Support: Added no-std compatibility for cubecl-quant (@laggui, #911, #812)
- Streaming Logger: Added streaming logger and configuration (@nathanielsimard, #917)
- Build Improvements: Enhanced CUDA version selection with build scripts (@wingertge, #856)
Documentation
- Book Updates: Various improvements to documentation (@louisfd, #977)
- Getting Started: Fixed GpuTensor examples (@ChosunOne, #852)
Platform Support
- CPU on All OSes: Enabled cubecl-cpu on all operating systems (@syl20bnr, #897)
- WebGPU/WASM: Fixed WebGPU and WASM support (@ArthurBrussee, #824, #908)
- HIP Updates: Updated HIP backend with wmma compiler refactoring (@nathanielsimard, #975, #789)