Commit 15ed377
TinySemVer
Release: v0.8.0 [skip ci]
### Minor
- Add: Warp-Group Binary MMA (d6daf3a)
- Add: Larger `m64n256k8` WGMMA variant (3e3530e)
- Add: Warp-Group Async kernels (6cc7e34)
- Add: `f64` MMA PTX variant (ae450e5)
- Add: CuTe draft (fdea727)
- Add: CUTLASS placeholders (b1ab93d)
- Add: Hopper `sm90a` PTX kernels (4bcf74a)
### Patch
- Improve: `CUresult` error handling (d74d430)
- Improve: Logging CUDA errors (953a696)
- Fix: Synchronize TCs (494ba52)
- Improve: Impossible `%tid` condition against NVCC (8a9c9c5)
- Make: Temporarily block CUTLASS (df1b39c)
- Improve: Cleaner PTX code (71dea0c)
- Improve: Avoid NVCC-specific features (3d65c7f)
- Fix: Re-creating a CUDA stream (e831650)
- Make: Compile in parallel by default (8e671c6)
- Make: Separate host-only code (f751fbf)
- Docs: Counter-intuitive PTX facts (822fa2f)
- Docs: H200 vs MI 300X vs GB200 specs (cc36bcd)
- Make: CUTLASS dependency (f272c40)
- Fix: Synchronize cuBLAS for profiling (4077f26)
- Docs: Blackwell tensor cores (ec35b35)
- Fix: Missing `_Float16` in NVCC, use `half` (71cadca)
- Improve: Same size range for GEMM (d914fce)
- Fix: Different output size for `cublasGemmEx` (304c880)1 parent 21cf516 commit 15ed377
2 files changed
+2
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
11 | | - | |
| 11 | + | |
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
0 commit comments