Releases: NVIDIA/cutlass
Releases · NVIDIA/cutlass
CUTLASS 4.3.5
CUTLASS 4.3.4
CuTe DSL
-
New features
- Added PDL support along with example Kernel launch with Programmatic Dependent Launch
-
Bug fixing and improvements
- Fixed a frame refcnt issue with cuda graph
- Enhancement for tvm-ffi AoT case for earlier module unload
- Fixed order issue in
make_smem_layout_ain utils/hopper_helpers.py
CUTLASS C++
- Work around a driver TMA descriptor related bug which will cause occasionally errors on Blackwell when the tensor's backing memory allocation is less than 128KB and it is not a dense non-overlapping tensor.
CUTLASS 4.3.3
CuTe DSL
-
New features
- Supported namedtuple and kwargs for JIT function arguments in tvm-ffi
- Supported variadic tuples for JIT function argument in tvm-ffi
-
Bug fixing and improvements
- Fixed an issue when JIT function argument with union type annotation for tvm-ffi
- Clearer error message for the case of runtime error cudaErrorInsufficientDriver
CUTLASS 4.3.2
CuTe DSL
-
New features
- New env var
CUTE_DSL_CACHE_DIRto specify the path for dumping caches
- New env var
-
Bug fixing and improvements
- Fixed an issue of CUDA JitExecutor when unloading kernels
- Fixed an issue of allocating max smem when there's statically allocated smem
CUTLASS 4.3.1
CuTe DSL
- New features
- Added Blackwell SM103 support
- Multiple dependent DSOs in the wheel have been merged into one single DSO
- Bug fixing and improvements
- Fixed device reset issue with tvm-ffi
- Fixed tvm-ffi export compiled function
CUTLASS C++
- Support blockscaled variant of ragged contiguous grouped gemm with the new simplified MoE API in example 92.
- The new example works for all microscaling types.
CUTLASS 4.3.0
CuTe DSL
- New features:
- Supported Apache TVM-FFI for further reduced host runtime overhead for JIT functions, better PyTorch and ML frameworks interopability
- Added fake tensor and stream to decouple compile jit function with "from_dlpack" flow. Now we no longer require users to have real tensor when compile jit function.
- Added FastDivmodDivisor with Python operator overloads, new APIs, Cute dialect integration, and optimized static tile scheduler performance for faster index mapping.
- Added l2 cache evict priority for tma related ops. Users could do fine-grain l2 cache control.
- Debuggability improvements:
- Supported source location tracking for DSL APIs (Allow tools like
nsightprofiling to correlate perf metrics with Python source code) - Supported dumping PTX and CUBIN code: Hello World Example
- Supported source location tracking for DSL APIs (Allow tools like
- More examples and notebooks to get started with CuTe DSL:
- Improved performance of elementwise example:
- Generalize code to handle list of input tensors
- Generalize TV layout computation to handle different data types
- Improved Blackwell SM100 persistent dense GEMM with static scheduling:
- To demonstrate usage of new Pipeline APIs
PipelineProducerandPipelineConsumerto simplify code without explicit pipeline state management (Exiting APIs are still maintained) - Separated epilogue code for non-TMA and TMA implementation
- To demonstrate usage of new Pipeline APIs
- Tutorial for Blackwell GEMM: Basic Blackwell SM100 GEMM
- Baseline Blackwell GEMM achieves 84% SOL performance with MNK 8K
- More examples are coming for demo of optimization:
Baseline + X
- Tutorial for Async Pipeline API
- Reworked elementwise add notebook with more details and detailed explanation about TV layout
- Updated implementation to handle general data type and multiple inputs
- Updated explanation for TV layout in simpler language
- Added visualization of TV Layout with 3rd party utils
- Benchmark and autotune demonstration
- Improved performance of elementwise example:
- More examples of authorizing peak-performance kernels:
- Blackwell SM100 mixed-input GEMM
- Blackwell SM100 persistent blockwise dense GEMM
- Blackwell SM100 persistent blockwise contiguous grouped dense GEMM
- Blackwell SM100 persistent blockwise masked grouped dense GEMM
- Blackwell SM100 fmha bwd
- Blackwell SM100 mla
- Hopper SM90 persistent dense GEMM with static scheduling
- Blackwell GeForce batched dense GEMM
- Ampere HSTU Attention
- API updates:
- Please refer to DSL API changelog for details
- Bug fixings and improvements
- Add mma_tiler_n=64 and mma_tiler_n=192 support in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed
TensorSSA.reduceto support static value as initial value - Updated docstring for following APIs to be more concise and easier to understand:
make_layout_tvis_staticPipelineAsyncSmemAllocator
- Fixed documentation for
pipeline,utilsandcute.math - Added overlapping accumulator optimization for block tile N = 256 case for better epilogue latency hiding in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed TensorSSA.getitem indexing to match CuTe's indexing convention
- Fixed an issue with cutlass.max and cutlass.min
- Fixed an issue with mark_compact_shape_dynamic
CUTLASS C++
- Further enhance Blackwell SM100 Attention kernels in example 77.
- Add softmax skip correction.
- Fix a shared memory allocation bug where it needs to opt in maximum dynamics shared memory explicitly once it exceeds 48KB.
- Fix a dead hang issue caused by early return warp.
- Add support through cmdline argument lists for
batch,no_verif,cluster_shapeandcluster_shape_fallbackin example 89. - Add Ragged Contiguous Grouped gemm kernel in example 92.
- This kernel uses a TMA 3D load to load the weights matrix and use the tensormap update method to load activations.
- Add 256x128 tile size support for Hopper SM90 deepgemm in example 67.
- Performance is optimized to align with Deepseek implementation.
- Simplification of API for MoE gemms.
- Instead of requiring users to call several cute utilities to set up the stride, API
moe_stride_utilsis introduced to help setup strides in the kernel. - Instead of requiring users to set vectors like
problem_shapes_deviceandproblem_shapes_hosts, a new problem shape struct calledMoEProblemShapeis introduced which takes in max_m, max_n, max_k and counts vector as input and deduce problem shapes internally whenever required.
- Instead of requiring users to call several cute utilities to set up the stride, API
- Enable GEMM_K = 0 in grouped gemm.
- Optimize group gemm kernels by enabling async TMA desc update.
- Support Blackwell SM100 convolution stream-K kernel.
- Unit tests: fprop_streamK, dgrad_streamK, wgrad_streamK.
- Add Blackwell SM100 sparse gemm compressor unit tests.
- Unit tests: compressor_fp16.
- Add sub-bytes and runtime data type support in compressor unit test testbed.
- Add profiler support for:
- Blackwell SM100 and SM120 blockscaled sparse kernels.
- New MoE grouped gemm API.
- Blackwell SM100 cpasync kernel.
- Fix some kernel issues:
- Fix a race check issue of Blackwell SM103 kernels by adding missing elect one for prefetch barrier initialization.
- Allow user to directly specify the number of stages for Hopper sm90 mixed input gemm.
- Remove warnings caused by cuda vector type alignment setting in CUDA 13.
- Remove problematic
cutlass::int8_tand replace it withint8_t. - Fix a few bugs in distributed gemm API and examples.
- Fix handling negative zero in sparse compressor.
- Add missing
wait_on_dependent_gridsfor PDL use case.
- Fix some profiler issues:
- Add some missing reference kernels.
- Support VoidC reference kernels.
- Add calculation of scale factor A and B in function
bytes_with_problem_shapeof block scaled profiler. - Fix an issue when epilogue tile N is not divided by default subtile N.
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
- Optimal code generation with CUDA toolkit versions 13.0U1.
CUTLASS 4.2.1
CuTe DSL
- Bug fixings and improvements
- Fixed an issue when running DSL codes with cuda-python 13.0
- Fixed an issue when running inductor with DSL codes
- Fixed an issue with unexpected logging when running DSL codes in FlashInfer
- Fixed the issue reported in #2647
- Fixed an issue when conditional define of variables outside of dynamic control flow
CUTLASS C++
- Bypass EVT for nosmem blockwise kernels on Blackwell.
- Rename cutlass/python/cutlass directory to cutlass/python/cutlass_cppgen.
CUTLASS 4.2.0
CuTe DSL
- More Python versions are now supported for both x86-64 and aarch64, including
- Python 3.10, 3.11, 3.12, and 3.13
- Added new example and updated notebook to get started with CuTe DSL
- Call kernels with dlpack bypassed
- Updates on TensorSSA demonstration
- Added a section for introducing the broadcast
- API updates
- Please refer to DSL API changelog for details
- Bug fixings and improvements
- Fixed
cute.print_tensorfor coordinate tensor - Fixed
cute.printfor tuple of layouts - Fixed frozen object is not properly updated after fully assigned in dynamic control flow
- Fixed assign tuple/list element in a dynamic control flow may cause compilation failure
- Improved error message when CUDA context is not initialized
- Improved docstring of congruent and weakly_congruent
- Fixed
CUTLASS C++
- Support for Blackwell SM103 kernels for B300 GPUs.
- Collective mainloop codes: Blockscaled datatypes with support for dense GEMM mainloop
- New GEMM and epilogue dispatch policies for collectives, kernel layers, and builders.
- Kernel codes: Blockscaled datatypes with support for dense GEMM kernel.
- Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM103 architecture:
- Set of unit tests that demonstrate the usage of Blackwell SM103 blockscaled GEMM
- Unit test files with prefix name of
sm103_under GEMM device unit tests.
- Unit test files with prefix name of
- Support for Blackwell SM121 kernels for DGX Spark GPUs.
- Share the major codes with Blackwell SM120 kernels.
- Add support for heuristics-based kernel filtering and autotuning using
nvidia-matmul-heuristicsto find the best kernels for a given scenario.- Details please refer to heuristics doc.
- Further enhance Blackwell SM100 Attention kernels in example 77.
- Add fused reduction kernel support for cutlass MLA.
- Add softmax skip correction.
- Support for GQA in FMHA backward kernel.
- Fix an issue where
get_unmasked_trip_countmay return a negative value. - Fix an issue where mbarriers are initialized with a zero arrival count.
- Fix a corner case issue where the sequence length of q is not a multiple of tile_q.
- Remove tma padding for forward kernel inputs.
- Add Blackwell SM100 kernels for MoEs (focusing on Low-Latency inference performance): example 92. It uses TMA (for weights) and CPASYNC (for tokens) to load input matrices and allow only one problem dimension to vary across groups/experts, unlike general Grouped GEMMs. Note: further API simplifications and kernel improvements are upcoming. Any feedback on API is welcome.
- Further enhance blockwise and groupwise GEMMs on Hopper and Blackwell
- On Blackwell SM120, a blockwise gemm kernel is added: example 87.
- On Hopper, add K major scale factor support for SM90 blockwise kernels.
- On Hopper, relax the restriction that the k dimension of the problem size has to be the multiple of the k dimension of the tile size.
- On Hopper, grouped version supports the case when k = 0.
- Support for Blackwell SM100 fp4 gemv kernels.
- Kernel codes: Gemv kernel.
- Example codes: example 91
- Support for Blackwell SM100 legacy mixed input GEMM kernels.
- Collective mainloop codes: Mixed input mainloop.
- Kernel codes: Mixed input kernel.
- Example codes: example 86.
- Support for Blackwell SM100 cpasync kernel.
- Collective mainloop codes: cpasync mainloop.
- Kernel codes: cpasync kernel.
- Support Blackwell SM120 mixed input blockscaled grouped GEMM.
- Instantiating more Blackwell kernels in profiler.
- Blackwell SM100 and SM103 kernels support
CUTLASS_LIBRARY_INSTANTIATION_LEVELto instantiate all possible combinations. - To use this feature,
CUTLASS_LIBRARY_KERNELSmust be non-empty. Profiler will combineCUTLASS_LIBRARY_KERNELSandCUTLASS_LIBRARY_INSTANTIATION_LEVELto instantiate specific kernels. - Details please check Profiler Doc.
- Blackwell SM100 and SM103 kernels support
- Fix some profiler issues:
- Modify default cluster callback values to none 0 to avoid profiler failure when these values are not set in command line.
- Fix some no output and timeout issues.
- Fix Pingpong Blockwise Hopper library generation.
- From CUDA 13.0, the Blackwell SM101 for Thor GPUs is renamed to SM110.
- For CUDA toolkit version < 13.0, SM101 is still used for Thor GPUs.
- For CUDA toolkit version >= 13.0, SM110 is used for Thor GPUs and SM101 is no longer valid.
- Rename legacy Python API package from
cutlasstocutlass_cppgenand add Blackwell EVT support to legacy Python interface.- Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's
EpilogueDescriptors. - Added Blackwell SM100 EVT Emitter on the Python side and routed most emission through Hopper SM90 Emitter.
- Added some support for running SM100 kernels via the Python interface.
- Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's
- CuTe changes:
- Fix inaccurate GridDim calculation under CuTe tutorial.
- Add movmatrix support.
- Fix smallest MMA-N allowed for Blackwell fp8 and fp16 gemm kernels.
- Support fp16 accmulator for sm89 fp8 mma.
- Shorten
nullspaceimplementation. - Isolate and comment on
cosizehacks. - Important documentation correction:
E<0,1> == 1@0@1.
- Fix some kernel issues:
- Fix Hopper SM90 group gemm kernel to only use the commit group and wait group instead of also waiting on mbarriers.
- Fix a tiny bug when K is large for Blackwell SM103 fp4 grouped GEMM kernel.
- Add following unit tests:
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
- Optimal code generation with CUDA toolkit versions 13.0U1.
CUTLASS 4.1.0
CuTe DSL
- Add aarch64 support, you can now pip install
nvidia-cutlass-dslon GB200 systems! - More examples demonstrating how to use CuTe DSL to write peak-performance kernels
- API updates
- Please refer to FUNCTIONALITY.md for details
CUTLASS C++
- Further enhance Blackwell SM100 Attention kernels in example 77.
- Add variable sequence length support for FMHA Backward kernel.
- Add varlen test support to Backward runner.
- Codes support empty batch sequences.
- Replace
subbyte_iteratorwithcute::recast_ptrwhen constructing logical iterators/arrays. - CuTe changes:
- Rewrite ArithTuple and ScaledBasis for robustness and clarity.
- Remove buggy and kludgy
get_layoutA|B|C_MNand friends from Atoms/TiledX. - Factor out
print_latexand friends and rewrite. - Factor out
print_svgand friends and rewrite.
- Support Blackwell SM100 SIMT packed fp32x2 kernels.
- Support residual add for implicit gemm kernels.
- Various fixes for CUTLASS C++ Python interface's EVT tracer:
- Add verifier for sm90 to report the invalid input.
- When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.
- Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.
- Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.
- Fix profiler bugs in exhaustive perf search.
- Fix incorrect cluster shape output issue when doing exhaustive search.
- Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.
- Fix some profiler issues.
- Complete the reference for Blackwell blockwise gemm kernels.
- Fix incorrect regex logic for L1 test.
CUTLASS 4.0.0
CuTe DSL
CuTe DSL is a Python DSL centered around CuTe's abstractions
- Enables authoring kernels in Python to reach peak performance on NVIDIA GPUs
- Core DSL implementation files
- DSL quick start
- DSL Overview
- Educational notebooks for getting started with CuTe DSL
CUTLASS C++
- Support Family Specific Architecture Features which was introduced in CUDA 12.9
- Further improved Blockwise and Groupwise GEMMs on Hopper and Blackwell
- Enhance Blackwell SM100 Attention kernels in example 77
- Add Blackwell SM100 implicit GEMM conv fprop/dgrad/wgrad unit tests
- New Hopper SM90 FMHA example, similar in design to the existing Blackwell FMHA
- Cute enhancements: CuTe C++ reduce op
- Other functional and performance enhancements