Skip to content

Releases: NVIDIA/cutlass

CUTLASS 4.3.5

09 Jan 06:08
4faf1a1

Choose a tag to compare

CuTe DSL

  • Bug fixing and improvements
    • Fixed the unexpected CPU overhead issue introduced by 4.3.4
  • Update copyright to 2026.

CUTLASS C++

  • Update copyright to 2026.
  • Use CUDA Driver Get Version Runtime APIs Rather than Driver APIs.

CUTLASS 4.3.4

24 Dec 05:49
1810164

Choose a tag to compare

CuTe DSL

  • New features

  • Bug fixing and improvements

    • Fixed a frame refcnt issue with cuda graph
    • Enhancement for tvm-ffi AoT case for earlier module unload
    • Fixed order issue in make_smem_layout_a in utils/hopper_helpers.py

CUTLASS C++

  • Work around a driver TMA descriptor related bug which will cause occasionally errors on Blackwell when the tensor's backing memory allocation is less than 128KB and it is not a dense non-overlapping tensor.

CUTLASS 4.3.3

12 Dec 05:12
d55f6be

Choose a tag to compare

CuTe DSL

  • New features

    • Supported namedtuple and kwargs for JIT function arguments in tvm-ffi
    • Supported variadic tuples for JIT function argument in tvm-ffi
  • Bug fixing and improvements

    • Fixed an issue when JIT function argument with union type annotation for tvm-ffi
    • Clearer error message for the case of runtime error cudaErrorInsufficientDriver

CUTLASS 4.3.2

05 Dec 18:51
5c149f5

Choose a tag to compare

CuTe DSL

  • New features

    • New env var CUTE_DSL_CACHE_DIR to specify the path for dumping caches
  • Bug fixing and improvements

    • Fixed an issue of CUDA JitExecutor when unloading kernels
    • Fixed an issue of allocating max smem when there's statically allocated smem

CUTLASS 4.3.1

02 Dec 03:22

Choose a tag to compare

CuTe DSL

  • New features
    • Added Blackwell SM103 support
    • Multiple dependent DSOs in the wheel have been merged into one single DSO
  • Bug fixing and improvements
    • Fixed device reset issue with tvm-ffi
    • Fixed tvm-ffi export compiled function

CUTLASS C++

  • Support blockscaled variant of ragged contiguous grouped gemm with the new simplified MoE API in example 92.
    • The new example works for all microscaling types.

CUTLASS 4.3.0

24 Nov 22:24
e67e63c

Choose a tag to compare

CuTe DSL

CUTLASS C++

  • Further enhance Blackwell SM100 Attention kernels in example 77.
    • Add softmax skip correction.
    • Fix a shared memory allocation bug where it needs to opt in maximum dynamics shared memory explicitly once it exceeds 48KB.
    • Fix a dead hang issue caused by early return warp.
  • Add support through cmdline argument lists for batch, no_verif, cluster_shape and cluster_shape_fallback in example 89.
  • Add Ragged Contiguous Grouped gemm kernel in example 92.
    • This kernel uses a TMA 3D load to load the weights matrix and use the tensormap update method to load activations.
  • Add 256x128 tile size support for Hopper SM90 deepgemm in example 67.
    • Performance is optimized to align with Deepseek implementation.
  • Simplification of API for MoE gemms.
    • Instead of requiring users to call several cute utilities to set up the stride, API moe_stride_utils is introduced to help setup strides in the kernel.
    • Instead of requiring users to set vectors like problem_shapes_device and problem_shapes_hosts, a new problem shape struct called MoEProblemShape is introduced which takes in max_m, max_n, max_k and counts vector as input and deduce problem shapes internally whenever required.
  • Enable GEMM_K = 0 in grouped gemm.
  • Optimize group gemm kernels by enabling async TMA desc update.
  • Support Blackwell SM100 convolution stream-K kernel.
  • Add Blackwell SM100 sparse gemm compressor unit tests.
    • Unit tests: compressor_fp16.
    • Add sub-bytes and runtime data type support in compressor unit test testbed.
  • Add profiler support for:
    • Blackwell SM100 and SM120 blockscaled sparse kernels.
    • New MoE grouped gemm API.
    • Blackwell SM100 cpasync kernel.
  • Fix some kernel issues:
    • Fix a race check issue of Blackwell SM103 kernels by adding missing elect one for prefetch barrier initialization.
    • Allow user to directly specify the number of stages for Hopper sm90 mixed input gemm.
    • Remove warnings caused by cuda vector type alignment setting in CUDA 13.
    • Remove problematic cutlass::int8_t and replace it with int8_t.
    • Fix a few bugs in distributed gemm API and examples.
    • Fix handling negative zero in sparse compressor.
    • Add missing wait_on_dependent_grids for PDL use case.
  • Fix some profiler issues:
    • Add some missing reference kernels.
    • Support VoidC reference kernels.
    • Add calculation of scale factor A and B in function bytes_with_problem_shape of block scaled profiler.
    • Fix an issue when epilogue tile N is not divided by default subtile N.
  • Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
  • Optimal code generation with CUDA toolkit versions 13.0U1.

CUTLASS 4.2.1

24 Sep 05:23
f3fde58

Choose a tag to compare

CuTe DSL

  • Bug fixings and improvements
    • Fixed an issue when running DSL codes with cuda-python 13.0
    • Fixed an issue when running inductor with DSL codes
    • Fixed an issue with unexpected logging when running DSL codes in FlashInfer
    • Fixed the issue reported in #2647
    • Fixed an issue when conditional define of variables outside of dynamic control flow

CUTLASS C++

  • Bypass EVT for nosmem blockwise kernels on Blackwell.
  • Rename cutlass/python/cutlass directory to cutlass/python/cutlass_cppgen.

CUTLASS 4.2.0

18 Sep 03:32

Choose a tag to compare

CuTe DSL

  • More Python versions are now supported for both x86-64 and aarch64, including
    • Python 3.10, 3.11, 3.12, and 3.13
  • Added new example and updated notebook to get started with CuTe DSL
  • API updates
  • Bug fixings and improvements
    • Fixed cute.print_tensor for coordinate tensor
    • Fixed cute.print for tuple of layouts
    • Fixed frozen object is not properly updated after fully assigned in dynamic control flow
    • Fixed assign tuple/list element in a dynamic control flow may cause compilation failure
    • Improved error message when CUDA context is not initialized
    • Improved docstring of congruent and weakly_congruent

CUTLASS C++

  • Support for Blackwell SM103 kernels for B300 GPUs.
  • Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM103 architecture:
  • Set of unit tests that demonstrate the usage of Blackwell SM103 blockscaled GEMM
  • Support for Blackwell SM121 kernels for DGX Spark GPUs.
    • Share the major codes with Blackwell SM120 kernels.
  • Add support for heuristics-based kernel filtering and autotuning using nvidia-matmul-heuristics to find the best kernels for a given scenario.
  • Further enhance Blackwell SM100 Attention kernels in example 77.
    • Add fused reduction kernel support for cutlass MLA.
    • Add softmax skip correction.
    • Support for GQA in FMHA backward kernel.
    • Fix an issue where get_unmasked_trip_count may return a negative value.
    • Fix an issue where mbarriers are initialized with a zero arrival count.
    • Fix a corner case issue where the sequence length of q is not a multiple of tile_q.
    • Remove tma padding for forward kernel inputs.
  • Add Blackwell SM100 kernels for MoEs (focusing on Low-Latency inference performance): example 92. It uses TMA (for weights) and CPASYNC (for tokens) to load input matrices and allow only one problem dimension to vary across groups/experts, unlike general Grouped GEMMs. Note: further API simplifications and kernel improvements are upcoming. Any feedback on API is welcome.
  • Further enhance blockwise and groupwise GEMMs on Hopper and Blackwell
    • On Blackwell SM120, a blockwise gemm kernel is added: example 87.
    • On Hopper, add K major scale factor support for SM90 blockwise kernels.
    • On Hopper, relax the restriction that the k dimension of the problem size has to be the multiple of the k dimension of the tile size.
    • On Hopper, grouped version supports the case when k = 0.
  • Support for Blackwell SM100 fp4 gemv kernels.
  • Support for Blackwell SM100 legacy mixed input GEMM kernels.
  • Support for Blackwell SM100 cpasync kernel.
  • Support Blackwell SM120 mixed input blockscaled grouped GEMM.
  • Instantiating more Blackwell kernels in profiler.
    • Blackwell SM100 and SM103 kernels support CUTLASS_LIBRARY_INSTANTIATION_LEVEL to instantiate all possible combinations.
    • To use this feature, CUTLASS_LIBRARY_KERNELS must be non-empty. Profiler will combine CUTLASS_LIBRARY_KERNELS and CUTLASS_LIBRARY_INSTANTIATION_LEVEL to instantiate specific kernels.
    • Details please check Profiler Doc.
  • Fix some profiler issues:
    • Modify default cluster callback values to none 0 to avoid profiler failure when these values are not set in command line.
    • Fix some no output and timeout issues.
    • Fix Pingpong Blockwise Hopper library generation.
  • From CUDA 13.0, the Blackwell SM101 for Thor GPUs is renamed to SM110.
    • For CUDA toolkit version < 13.0, SM101 is still used for Thor GPUs.
    • For CUDA toolkit version >= 13.0, SM110 is used for Thor GPUs and SM101 is no longer valid.
  • Rename legacy Python API package from cutlass to cutlass_cppgen and add Blackwell EVT support to legacy Python interface.
    • Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's EpilogueDescriptors.
    • Added Blackwell SM100 EVT Emitter on the Python side and routed most emission through Hopper SM90 Emitter.
    • Added some support for running SM100 kernels via the Python interface.
  • CuTe changes:
    • Fix inaccurate GridDim calculation under CuTe tutorial.
    • Add movmatrix support.
    • Fix smallest MMA-N allowed for Blackwell fp8 and fp16 gemm kernels.
    • Support fp16 accmulator for sm89 fp8 mma.
    • Shorten nullspace implementation.
    • Isolate and comment on cosize hacks.
    • Important documentation correction: E<0,1> == 1@0@1.
  • Fix some kernel issues:
    • Fix Hopper SM90 group gemm kernel to only use the commit group and wait group instead of also waiting on mbarriers.
    • Fix a tiny bug when K is large for Blackwell SM103 fp4 grouped GEMM kernel.
  • Add following unit tests:
  • Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
  • Optimal code generation with CUDA toolkit versions 13.0U1.

CUTLASS 4.1.0

28 Jul 03:57
e51efbf

Choose a tag to compare

CuTe DSL

CUTLASS C++

  • Further enhance Blackwell SM100 Attention kernels in example 77.
    • Add variable sequence length support for FMHA Backward kernel.
    • Add varlen test support to Backward runner.
    • Codes support empty batch sequences.
  • Replace subbyte_iterator with cute::recast_ptr when constructing logical iterators/arrays.
  • CuTe changes:
    • Rewrite ArithTuple and ScaledBasis for robustness and clarity.
    • Remove buggy and kludgy get_layoutA|B|C_MN and friends from Atoms/TiledX.
    • Factor out print_latex and friends and rewrite.
    • Factor out print_svg and friends and rewrite.
  • Support Blackwell SM100 SIMT packed fp32x2 kernels.
  • Support residual add for implicit gemm kernels.
  • Various fixes for CUTLASS C++ Python interface's EVT tracer:
    • Add verifier for sm90 to report the invalid input.
    • When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.
    • Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.
    • Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.
  • Fix profiler bugs in exhaustive perf search.
    • Fix incorrect cluster shape output issue when doing exhaustive search.
    • Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.
  • Fix some profiler issues.
    • Complete the reference for Blackwell blockwise gemm kernels.
    • Fix incorrect regex logic for L1 test.

CUTLASS 4.0.0

27 Jun 14:17
b995f93

Choose a tag to compare

CuTe DSL

CuTe DSL is a Python DSL centered around CuTe's abstractions

CUTLASS C++