Skip to content

Releases: NVIDIA/cutlass

CUTLASS 2.10.0

16 Sep 02:42
fc9ebc6

Choose a tag to compare

CUTLASS 2.10.0

CUTLASS Python now supports GEMM, Convolution and Grouped GEMM for different data types as well as different epilogue flavors.
Optimizations for CUTLASS's Grouped GEMM kernel. It can move some scheduling into the host side if applicable.
Optimizations for GEMM+Softmax.
Grouped GEMM for Multihead Attention is a general MHA that does not require equal sequence length in every GEMM.
GEMM + Layer norm fusion for Ampere can fuse the layernorm into GEMMs before and after.
GEMM Epilogue Permutation Fusion can permute the GEMM output before storing.
Grouped convolution targeting implicit GEMM introduces the first group convolution implementation to CUTLASS. It is an Analytical implementation, not an Optimized.
Depthwise separable convolution introduces the first depthwise convolution which is also Analytical for now.
Standalone Layernorm and Pooling kernels.
Back-to-back GEMM enhancements.
Updates and bugfixes from the community (thanks!)

CUTLASS 2.9.1

29 Jun 01:15
e45e773

Choose a tag to compare

Bug fixes, performance tuning, and enhancements to documentation.

CUTLASS 2.9.0

27 Apr 16:31
319a389

Choose a tag to compare

CUTLASS 2.9.0

CUTLASS 2.8

06 Dec 19:22
5fe09c2

Choose a tag to compare

CUTLASS 2.7

20 Sep 18:10
2e07c4c

Choose a tag to compare

2.7.0

CUTLASS 2.6.1

03 Sep 17:27
6c2f8f2

Choose a tag to compare

  • Arbitrary padding and striding for CUTLASS Strided DGRAD Convolution operator (Analytic Iterators)
  • Tuning for GEMMs fused with partial reductions
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!

CUTLASS 2.6.0

03 Sep 16:52
a01feb9

Choose a tag to compare

CUTLASS 2.6.0

  • Optimal performance when compiled with the CUDA 11.4 Toolkit
  • Fused operators with GEMM and Convolution
  • 64b tensor strides and leading dimensions support for GEMMs
  • Affine rank=2 matrix layouts
  • Batched GEMV preview implementation
  • New strided Dgrad implementation
    • Accelerates over previous implementation by cutting down redundant math by 4x
    • Support using new Dy and w analytic iterators and existing cutlass::conv::device::ImplicitGemmConvolution interface
  • Quaternion-valued GEMM and Convolution in single- and double-precision (targeting CUDA Cores)
  • Many improvements to the epilogue.
    • Provide an option to not fully unroll the epilogue to reduce the code size and improve the performance when using complicated elementwise operations
    • Performance improvement for FP16 tensor core kernels
    • Bug fixes
  • Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere.
  • Updated minimum CUDA Toolkit requirement to 10.2
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!

CUTLASS 2.5.0

03 Mar 19:20
0f10563

Choose a tag to compare

CUTLASS 2.5 is a minor release contributing:

  • Tensor reductions
    • m-to-n reductions of tensors with affine layout
    • Specializations for reductions including contiguous dimension
    • Specializations for reductions excluding contiguous dimension
    • Custom reduction functors such as cutlass::logical_and
    • Large tensor support, up to 2^63 elements (however, each dimension is limited to an extent of 2^31)
  • Optimizations for 3-D convolution
  • Fused Convolution+Convolution example
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!

CUTLASS 2.4.0

03 Dec 16:03

Choose a tag to compare

CUTLASS 2.4

  • Implicit GEMM convolution kernels supporting CUDA and Tensor Cores on NVIDIA GPUs
    • Operators: forward (Fprop), backward data gradient (Dgrad), and backward weight gradient (Wgrad) convolution
    • Data type: FP32, complex, Tensor Float 32 (TF32), BFloat16 (BF16), Float16, Int4, Int8, Int32
    • Spatial dimensions: 1-D, 2-D, and 3-D
    • Layout: NHWC, NCxHWx
  • Implicit GEMM convolution components:
    • Global memory iterators supporting Fprop, Dgrad, and Wgrad
    • MmaMultistage for implicit GEMM convolution for NVIDIA Ampere architecture
    • MmaPipeline for implicit GEMM convolution for NVIDIA Volta and Turing architectures
    • Documentation describing Implicit GEMM Convolution algorithm and implementation

CUTLASS 2.3

25 Sep 18:27
c2b80ad

Choose a tag to compare

CUTLASS 2.3