Releases · NVIDIA/cutlass

New features
- Added PDL support along with example Kernel launch with Programmatic Dependent Launch
Bug fixing and improvements
- Fixed a frame refcnt issue with cuda graph
- Enhancement for tvm-ffi AoT case for earlier module unload
- Fixed order issue in make_smem_layout_a in utils/hopper_helpers.py

CUTLASS C++

Work around a driver TMA descriptor related bug which will cause occasionally errors on Blackwell when the tensor's backing memory allocation is less than 128KB and it is not a dense non-overlapping tensor.

Assets 2

0 Join discussion

12 Dec 05:12

hwu36

v4.3.3

d55f6be

CUTLASS 4.3.3

CuTe DSL

New features
- Supported namedtuple and kwargs for JIT function arguments in tvm-ffi
- Supported variadic tuples for JIT function argument in tvm-ffi
Bug fixing and improvements
- Fixed an issue when JIT function argument with union type annotation for tvm-ffi
- Clearer error message for the case of runtime error cudaErrorInsufficientDriver

Assets 2

0 Join discussion

05 Dec 18:51

hwu36

v4.3.2

5c149f5

CUTLASS 4.3.2

CuTe DSL

New features
- New env var CUTE_DSL_CACHE_DIR to specify the path for dumping caches
Bug fixing and improvements
- Fixed an issue of CUDA JitExecutor when unloading kernels
- Fixed an issue of allocating max smem when there's statically allocated smem

Assets 2

0 Join discussion

02 Dec 03:22

hwu36

v4.3.1

f88806b

CUTLASS 4.3.1

CuTe DSL

New features
- Added Blackwell SM103 support
- Multiple dependent DSOs in the wheel have been merged into one single DSO
Bug fixing and improvements
- Fixed device reset issue with tvm-ffi
- Fixed tvm-ffi export compiled function

CUTLASS C++

Support blockscaled variant of ragged contiguous grouped gemm with the new simplified MoE API in example 92.
- The new example works for all microscaling types.

Assets 2

0 Join discussion

24 Nov 22:24

hwu36

v4.3.0

e67e63c

CUTLASS 4.3.0

CuTe DSL

New features:
- Supported Apache TVM-FFI for further reduced host runtime overhead for JIT functions, better PyTorch and ML frameworks interopability
- Added fake tensor and stream to decouple compile jit function with "from_dlpack" flow. Now we no longer require users to have real tensor when compile jit function.
- Added FastDivmodDivisor with Python operator overloads, new APIs, Cute dialect integration, and optimized static tile scheduler performance for faster index mapping.
- Added l2 cache evict priority for tma related ops. Users could do fine-grain l2 cache control.
Debuggability improvements:
- Supported source location tracking for DSL APIs (Allow tools like nsight profiling to correlate perf metrics with Python source code)
- Supported dumping PTX and CUBIN code: Hello World Example
More examples and notebooks to get started with CuTe DSL:
- Improved performance of elementwise example:
  - Generalize code to handle list of input tensors
  - Generalize TV layout computation to handle different data types
- Improved Blackwell SM100 persistent dense GEMM with static scheduling:
  - To demonstrate usage of new Pipeline APIs PipelineProducer and PipelineConsumer to simplify code without explicit pipeline state management (Exiting APIs are still maintained)
  - Separated epilogue code for non-TMA and TMA implementation
- Tutorial for Blackwell GEMM: Basic Blackwell SM100 GEMM
  - Baseline Blackwell GEMM achieves 84% SOL performance with MNK 8K
  - More examples are coming for demo of optimization: Baseline + X
- Tutorial for Async Pipeline API
- Reworked elementwise add notebook with more details and detailed explanation about TV layout
  - Updated implementation to handle general data type and multiple inputs
  - Updated explanation for TV layout in simpler language
  - Added visualization of TV Layout with 3rd party utils
- Benchmark and autotune demonstration
More examples of authorizing peak-performance kernels:
API updates:
- Please refer to DSL API changelog for details
Bug fixings and improvements
- Add mma_tiler_n=64 and mma_tiler_n=192 support in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed TensorSSA.reduce to support static value as initial value
- Updated docstring for following APIs to be more concise and easier to understand:
  - make_layout_tv
  - is_static
  - PipelineAsync
  - SmemAllocator
- Fixed documentation for pipeline, utils and cute.math
- Added overlapping accumulator optimization for block tile N = 256 case for better epilogue latency hiding in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed TensorSSA.getitem indexing to match CuTe's indexing convention
- Fixed an issue with cutlass.max and cutlass.min
- Fixed an issue with mark_compact_shape_dynamic

CUTLASS C++

Further enhance Blackwell SM100 Attention kernels in example 77.
- Add softmax skip correction.
- Fix a shared memory allocation bug where it needs to opt in maximum dynamics shared memory explicitly once it exceeds 48KB.
- Fix a dead hang issue caused by early return warp.
Add support through cmdline argument lists for batch, no_verif, cluster_shape and cluster_shape_fallback in example 89.
Add Ragged Contiguous Grouped gemm kernel in example 92.
- This kernel uses a TMA 3D load to load the weights matrix and use the tensormap update method to load activations.
Add 256x128 tile size support for Hopper SM90 deepgemm in example 67.
- Performance is optimized to align with Deepseek implementation.
Simplification of API for MoE gemms.
- Instead of requiring users to call several cute utilities to set up the stride, API moe_stride_utils is introduced to help setup strides in the kernel.
- Instead of requiring users to set vectors like problem_shapes_device and problem_shapes_hosts, a new problem shape struct called MoEProblemShape is introduced which takes in max_m, max_n, max_k and counts vector as input and deduce problem shapes internally whenever required.
Enable GEMM_K = 0 in grouped gemm.
Optimize group gemm kernels by enabling async TMA desc update.
Support Blackwell SM100 convolution stream-K kernel.
- Unit tests: fprop_streamK, dgrad_streamK, wgrad_streamK.
Add Blackwell SM100 sparse gemm compressor unit tests.
- Unit tests: compressor_fp16.
- Add sub-bytes and runtime data type support in compressor unit test testbed.
Add profiler support for:
- Blackwell SM100 and SM120 blockscaled sparse kernels.
- New MoE grouped gemm API.
- Blackwell SM100 cpasync kernel.
Fix some kernel issues:
- Fix a race check issue of Blackwell SM103 kernels by adding missing elect one for prefetch barrier initialization.
- Allow user to directly specify the number of stages for Hopper sm90 mixed input gemm.
- Remove warnings caused by cuda vector type alignment setting in CUDA 13.
- Remove problematic cutlass::int8_t and replace it with int8_t.
- Fix a few bugs in distributed gemm API and examples.
- Fix handling negative zero in sparse compressor.
- Add missing wait_on_dependent_grids for PDL use case.
Fix some profiler issues:
- Add some missing reference kernels.
- Support VoidC reference kernels.
- Add calculation of scale factor A and B in function bytes_with_problem_shape of block scaled profiler.
- Fix an issue when epilogue tile N is not divided by default subtile N.
Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
Optimal code generation with CUDA toolkit versions 13.0U1.

Assets 2

0 Join discussion

24 Sep 05:23

hwu36

v4.2.1

f3fde58

CUTLASS 4.2.1

CuTe DSL

Bug fixings and improvements
- Fixed an issue when running DSL codes with cuda-python 13.0
- Fixed an issue when running inductor with DSL codes
- Fixed an issue with unexpected logging when running DSL codes in FlashInfer
- Fixed the issue reported in #2647
- Fixed an issue when conditional define of variables outside of dynamic control flow

CUTLASS C++

Bypass EVT for nosmem blockwise kernels on Blackwell.
Rename cutlass/python/cutlass directory to cutlass/python/cutlass_cppgen.

Assets 2

18 Sep 03:32

hwu36

v4.2.0

59b61c6

CUTLASS 4.2.0

CuTe DSL

More Python versions are now supported for both x86-64 and aarch64, including
- Python 3.10, 3.11, 3.12, and 3.13
Added new example and updated notebook to get started with CuTe DSL
- Call kernels with dlpack bypassed
- Updates on TensorSSA demonstration
  - Added a section for introducing the broadcast
API updates
- Please refer to DSL API changelog for details
Bug fixings and improvements
- Fixed cute.print_tensor for coordinate tensor
- Fixed cute.print for tuple of layouts
- Fixed frozen object is not properly updated after fully assigned in dynamic control flow
- Fixed assign tuple/list element in a dynamic control flow may cause compilation failure
- Improved error message when CUDA context is not initialized
- Improved docstring of congruent and weakly_congruent

CUTLASS C++

Support for Blackwell SM103 kernels for B300 GPUs.
- Collective mainloop codes: Blockscaled datatypes with support for dense GEMM mainloop
- New GEMM and epilogue dispatch policies for collectives, kernel layers, and builders.
- Kernel codes: Blockscaled datatypes with support for dense GEMM kernel.
Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM103 architecture:
- Blockscaled ultra fp4 dense GEMM.
- Blockscaled ultra fp4 dense grouped GEMM.
Set of unit tests that demonstrate the usage of Blackwell SM103 blockscaled GEMM
- Unit test files with prefix name of sm103_ under GEMM device unit tests.
Support for Blackwell SM121 kernels for DGX Spark GPUs.
- Share the major codes with Blackwell SM120 kernels.
Add support for heuristics-based kernel filtering and autotuning using nvidia-matmul-heuristics to find the best kernels for a given scenario.
- Details please refer to heuristics doc.
Further enhance Blackwell SM100 Attention kernels in example 77.
- Add fused reduction kernel support for cutlass MLA.
- Add softmax skip correction.
- Support for GQA in FMHA backward kernel.
- Fix an issue where get_unmasked_trip_count may return a negative value.
- Fix an issue where mbarriers are initialized with a zero arrival count.
- Fix a corner case issue where the sequence length of q is not a multiple of tile_q.
- Remove tma padding for forward kernel inputs.
Add Blackwell SM100 kernels for MoEs (focusing on Low-Latency inference performance): example 92. It uses TMA (for weights) and CPASYNC (for tokens) to load input matrices and allow only one problem dimension to vary across groups/experts, unlike general Grouped GEMMs. Note: further API simplifications and kernel improvements are upcoming. Any feedback on API is welcome.
Further enhance blockwise and groupwise GEMMs on Hopper and Blackwell
- On Blackwell SM120, a blockwise gemm kernel is added: example 87.
- On Hopper, add K major scale factor support for SM90 blockwise kernels.
- On Hopper, relax the restriction that the k dimension of the problem size has to be the multiple of the k dimension of the tile size.
- On Hopper, grouped version supports the case when k = 0.
Support for Blackwell SM100 fp4 gemv kernels.
- Kernel codes: Gemv kernel.
- Example codes: example 91
Support for Blackwell SM100 legacy mixed input GEMM kernels.
- Collective mainloop codes: Mixed input mainloop.
- Kernel codes: Mixed input kernel.
- Example codes: example 86.
Support for Blackwell SM100 cpasync kernel.
- Collective mainloop codes: cpasync mainloop.
- Kernel codes: cpasync kernel.
Support Blackwell SM120 mixed input blockscaled grouped GEMM.
Instantiating more Blackwell kernels in profiler.
- Blackwell SM100 and SM103 kernels support CUTLASS_LIBRARY_INSTANTIATION_LEVEL to instantiate all possible combinations.
- To use this feature, CUTLASS_LIBRARY_KERNELS must be non-empty. Profiler will combine CUTLASS_LIBRARY_KERNELS and CUTLASS_LIBRARY_INSTANTIATION_LEVEL to instantiate specific kernels.
- Details please check Profiler Doc.
Fix some profiler issues:
- Modify default cluster callback values to none 0 to avoid profiler failure when these values are not set in command line.
- Fix some no output and timeout issues.
- Fix Pingpong Blockwise Hopper library generation.
From CUDA 13.0, the Blackwell SM101 for Thor GPUs is renamed to SM110.
- For CUDA toolkit version < 13.0, SM101 is still used for Thor GPUs.
- For CUDA toolkit version >= 13.0, SM110 is used for Thor GPUs and SM101 is no longer valid.
Rename legacy Python API package from cutlass to cutlass_cppgen and add Blackwell EVT support to legacy Python interface.
- Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's EpilogueDescriptors.
- Added Blackwell SM100 EVT Emitter on the Python side and routed most emission through Hopper SM90 Emitter.
- Added some support for running SM100 kernels via the Python interface.
CuTe changes:
- Fix inaccurate GridDim calculation under CuTe tutorial.
- Add movmatrix support.
- Fix smallest MMA-N allowed for Blackwell fp8 and fp16 gemm kernels.
- Support fp16 accmulator for sm89 fp8 mma.
- Shorten nullspace implementation.
- Isolate and comment on cosize hacks.
- Important documentation correction: E<0,1> == 1@0@1.
Fix some kernel issues:
- Fix Hopper SM90 group gemm kernel to only use the commit group and wait group instead of also waiting on mbarriers.
- Fix a tiny bug when K is large for Blackwell SM103 fp4 grouped GEMM kernel.
Add following unit tests:
- fp16 accmulator for sm89 fp8 mma
- movmatrix test
- fp8 narrow mma n and fp16 narrow mma n
Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
Optimal code generation with CUDA toolkit versions 13.0U1.

Assets 2

0 Join discussion

28 Jul 03:57

hwu36

v4.1.0

e51efbf

CUTLASS 4.1.0

CuTe DSL

Add aarch64 support, you can now pip install nvidia-cutlass-dsl on GB200 systems!
More examples demonstrating how to use CuTe DSL to write peak-performance kernels
- Blackwell Mamba2 SSD
- Blackwell SM100 persistent dense blockscaled GEMM with static scheduling
API updates
- Please refer to FUNCTIONALITY.md for details

CUTLASS C++

Further enhance Blackwell SM100 Attention kernels in example 77.
- Add variable sequence length support for FMHA Backward kernel.
- Add varlen test support to Backward runner.
- Codes support empty batch sequences.
Replace subbyte_iterator with cute::recast_ptr when constructing logical iterators/arrays.
CuTe changes:
- Rewrite ArithTuple and ScaledBasis for robustness and clarity.
- Remove buggy and kludgy get_layoutA|B|C_MN and friends from Atoms/TiledX.
- Factor out print_latex and friends and rewrite.
- Factor out print_svg and friends and rewrite.
Support Blackwell SM100 SIMT packed fp32x2 kernels.
Support residual add for implicit gemm kernels.
Various fixes for CUTLASS C++ Python interface's EVT tracer:
- Add verifier for sm90 to report the invalid input.
- When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.
- Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.
- Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.
Fix profiler bugs in exhaustive perf search.
- Fix incorrect cluster shape output issue when doing exhaustive search.
- Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.
Fix some profiler issues.
- Complete the reference for Blackwell blockwise gemm kernels.
- Fix incorrect regex logic for L1 test.

Assets 2

1 Join discussion

27 Jun 14:17

kerrmudgeon

v4.0.0

b995f93

CUTLASS 4.0.0

CuTe DSL

CuTe DSL is a Python DSL centered around CuTe's abstractions

Enables authoring kernels in Python to reach peak performance on NVIDIA GPUs
Core DSL implementation files
DSL quick start
DSL Overview
Educational notebooks for getting started with CuTe DSL

CUTLASS C++

Support Family Specific Architecture Features which was introduced in CUDA 12.9
Further improved Blockwise and Groupwise GEMMs on Hopper and Blackwell
Enhance Blackwell SM100 Attention kernels in example 77
Add Blackwell SM100 implicit GEMM conv fprop/dgrad/wgrad unit tests
New Hopper SM90 FMHA example, similar in design to the existing Blackwell FMHA
Cute enhancements: CuTe C++ reduce op
Other functional and performance enhancements

Assets 2

Releases: NVIDIA/cutlass

CUTLASS 4.3.5

CuTe DSL

CUTLASS C++

Uh oh!

CUTLASS 4.3.4

CuTe DSL

CUTLASS C++

Uh oh!

CUTLASS 4.3.3

CuTe DSL

Uh oh!

CUTLASS 4.3.2

CuTe DSL

Uh oh!

CUTLASS 4.3.1

CuTe DSL

CUTLASS C++

Uh oh!

CUTLASS 4.3.0

CuTe DSL

CUTLASS C++

Uh oh!

CUTLASS 4.2.1

CuTe DSL

CUTLASS C++

Uh oh!

CUTLASS 4.2.0

CuTe DSL

CUTLASS C++

Uh oh!

CUTLASS 4.1.0

Uh oh!

CUTLASS 4.0.0

Uh oh!