You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Fixed order issue in `make_smem_layout_a` in utils/hopper_helpers.py
140
146
141
147
### CUTLASS C++
142
-
* Work around a driver bug which will cause occasionally errors when executing kernels.
148
+
* Work around a driver TMA descriptor related bug which will cause occasional errors on Blackwell when the tensor's backing memory allocation is less than 128KB and it is not a dense non-overlapping tensor.
Copy file name to clipboardExpand all lines: README.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
1

2
2
# Overview
3
3
4
-
# CUTLASS 4.4.0
4
+
# CUTLASS 4.4.1
5
5
6
-
_CUTLASS 4.4.0 - Feb 2026_
6
+
_CUTLASS 4.4.1 - Feb 2026_
7
7
8
8
CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM)
9
9
and related computations at all levels and scales within CUDA. It incorporates strategies for
@@ -84,6 +84,7 @@ To get started quickly - please refer :
84
84
- Fixed `cute.printf` with f-string
85
85
- Fixed an indexing issue of scalar tensor
86
86
- Fixed small K reference check error for cta_tile_n = 256 case with overlapping accumulator optimization in [Blackwell SM100 persistent dense blockscaled GEMM with static scheduling](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/dense_blockscaled_gemm_persistent.py).
87
+
- Fixed a segfault issue with tvm-ffi on aarch64
87
88
88
89
* API changes
89
90
- Deprecate get_num_tmem_alloc_cols from blackwell_helpers.py. Use the one from tmem_allocator.py instead.
0 commit comments