Summary

CubeCL 0.8.0 introduces major enhancements to quantization and matrix operations, near-complete flash attention implementation, and comprehensive matmul refactoring built on a new views and layouts system. This release brings a new MLIR-based CPU backend with LLVM, improved memory management with multi-stream support, and persistent storage capabilities.

What's New

Features

Flash Attention: Full implementation with masking support, partitions, row-wise reductions, and multi-plane operations (@louisfd, #845, #962, #902, #920, #907)
MLIR CPU Backend: Initial implementation providing CPU runtime support for non-Linux systems (@marcantoinem, #698, #790)
Advanced Quantization: Block-scaled MMA, global quantization for matmul, quantized views, and support for FP4/FP2 formats (@wingertge, @nathanielsimard, #815, #960, #954, #836, #809)
Persistent Memory: Added persistent storage capabilities for artifacts (@nathanielsimard, #947)
Multi-Stream Support: Implemented multi-stream processing for WGPU and CUDA (@nathanielsimard, #914, #896)
Tensor Memory Arrays (TMA): Added TMA views for optimized memory access (@wingertge, #943)
Pinned Memory: Support for pinned memory allocations (@nathanielsimard, #885)
Manual MMA Operations: Added manually managed MMA operations with custom tile support (@wingertge, #935, #810)
Stacked and Tensor Layouts: New layout system for matmul and advanced tensor operations (@wingertge, #855, #835, #839)
Saturating Arithmetic: Added saturating add/sub operations (@wingertge, #898)
Shuffle Operations: Basic shuffle operations support (@huy209vn, #968)
Additional Ops: Trunc, IsNan, IsInf, and powi for CUDA/HIP (@mooori, @laggui, @wingertge, #956, #937, #857)
Partition Scheduler: New scheduling system for shared memory reads in Matmul (@louisfd, #837)

Performance Improvements

Optimized Line Sizes: Unrolled line sizes for matmul, convolution, reduce, and attention operations (@wingertge, #918)
Memory Management: Refactored memory management API and static memory pool (@wingertge, @nathanielsimard, #800, #787)
Device Locking: Improved device management and CUDA device change optimization (@nathanielsimard, #959, #864)
Reusable Shared Memory: Enhanced shared memory management (@wingertge, #931)

Breaking Changes

CUDA 12.8 Default: Bumped default CUDA version to 12.8 with new feature implementations (@wingertge, #820)
Item Rework: Refactored item handling system (@wingertge, #844)

Refactoring

Matmul Restructuring: Extensive refactoring of matmul components including inputs, tile operations, generics, and stage memory configuration (@wingertge, @louisfd, #949, #886, #819, #795, #794)
Launch System: Refactored launch mechanism (@wingertge, #944)
Stage and Global Writers: Improved writer architecture (@wingertge, #924)
Runtime Features: Split and reorganized runtime traits (@wingertge, #883, #868)
Convolution: Refactored convolution implementation (@wingertge, #822)

Bug Fixes

Quantized Matmul: Fixed quant matmul line sizes and packed matmul issues (@wingertge, #978, #967)
Tensor Operations: Corrected tensor shapes in reduce operations and fixed reverse sequence mutation (@TsaoLun, @wingertge, #976, #957)
Metal Backend: Fixed plane operations on Metal (@louisfd, #964)
WGPU Improvements: Fixed async readback, multi-stream support, and out-of-bounds writes (@ArthurBrussee, @nathanielsimard, #925, #912, #961)
Broadcasting: Fixed broadcasting issues in compare ops and binary operations (@wingertge, #916, #895)
WGSL Fixes: Corrected scalar declarations, vec-to-scalar casts, and boolean logic (@wingertge, @Cielbird, #818, #808, #840)
Profiling: Resolved profiling deadlock (@nathanielsimard, #963)
Type Conversions: Fixed packed FP4 casting and comparison vectorization (@wingertge, @laggui, #890, #858)

Infrastructure

WGPU 26: Upgraded to wgpu version 26 (@janhohenheim, #850)
Vulkan/rspirv Fork: Forked and integrated Vulkan/rspirv (@wingertge, #880)
SPIRV Dump: Auto-enable spirv-dump when output path is set during build (@wingertge, #928)
Deterministic Hashing: Made hash generation deterministic (@wingertge, #948)
No-std Support: Added no-std compatibility for cubecl-quant (@laggui, #911, #812)
Streaming Logger: Added streaming logger and configuration (@nathanielsimard, #917)
Build Improvements: Enhanced CUDA version selection with build scripts (@wingertge, #856)

Documentation

Book Updates: Various improvements to documentation (@louisfd, #977)
Getting Started: Fixed GpuTensor examples (@ChosunOne, #852)

Platform Support

CPU on All OSes: Enabled cubecl-cpu on all operating systems (@syl20bnr, #897)
WebGPU/WASM: Fixed WebGPU and WASM support (@ArthurBrussee, #824, #908)
HIP Updates: Updated HIP backend with wmma compiler refactoring (@nathanielsimard, #975, #789)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.8.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Summary

What's New

Features

Performance Improvements

Breaking Changes

Refactoring

Bug Fixes

Infrastructure

Documentation

Platform Support

Contributors

Uh oh!