Releases: xlite-dev/LeetCUDA
Releases · xlite-dev/LeetCUDA
v2.6 Refactor 7/N
What's Changed
- [HGEMM] Update NVIDIA L20/4090 Perf plots by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/126
- [Blog]图解DeepSpeed-Ulysses&Megatron-LM TP/SP by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/127
- [README] Add contents lists by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/128
- [README] Update README by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/129
- [README] Update README.md by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/130
- Bump up to v2.6 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/131
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.5...v2.6
v2.5
What's Changed
- [HGEMM] Update HGEMM README.md by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/120
- [HGEMM] Add plot tflops function by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/121
- [HGEMM] Add NVIDIA RTX 3090 Laptop perf plot by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/122
- [PERF] Update HGEMM benchmark scripts by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/123
- [HGEMM] Add HGEMM L20/4090 benchmark figures by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/124
- Bump up to v2.5 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/125
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.18...v2.5
v2.4.18
What's Changed
- Update README.md by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/115
- [HGEMM] Update HGEMM Supported Matrix by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/116
- [HGEMM] Update HGEMM/SGEMM Supported Matrix by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/117
- [README] Update HGEMM/SGEMM Supported Matrix by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/118
- [HGEMM] Add NVIDIA RTX 4090 benchmark by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/119
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.17...v2.4.18
v2.4.17
What's Changed
- [NMS] Add nms f32 cuda kernel. by @bear-zd in https://github.com/DefTruth/CUDA-Learn-Notes/pull/102
- [HGEMM] Add some note to collective store by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/103
- [HGEMM] Add HGEMM MMA Col Major Kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/104
- [HGEMM] Update HGEMM benchmark scripts by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/105
- [HGEMM] Add Warp Swizzle as template param by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/106
- [HGEMM] add -Xptxas -v compile flag by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/107
- [HGEMM] Try reduce registers usage by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/108
- [HGEMM] Update HGEMM MMA/WMMA Usage by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/109
- [HGEMM][Docs] Add HGEMM Supported Matrix by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/110
- [HGEMM] Add M=N=K option for benchmark by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/111
- [HGEMM] Update HGEMM/SGEMM Supported Matrix by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/112
- [README] Update HGEMM/SGEMM Supported matrix by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/113
- [Docs] Update HGEMM/SGEMM Supported Matrix by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/114
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.16...v2.4.17
HGEMM Warp Swizzle/Reg Buffers
What's Changed
- [HGEMM] HGEMM MMA with Reg Double Buffers by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/99
- [HGEMM] ldmatrix.x4.trans with reg double buffers by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/100
- [HGEMM] collective store via warp shfl® reuse by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/101
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.15...v2.4.16
HGEMM Up to 115 TFLOPS:L20
What's Changed
- [HGEMM] Add MMA 16816 swizzle, Up to 115 TFLOPS by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/98
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.13...v2.4.15
HGEMM Up to 113 TFLOPS:L20
What's Changed
- [Mat][Trans] Add f32/f32x4 row/col first kernel by @bear-zd in https://github.com/DefTruth/CUDA-Learn-Notes/pull/89
- [Docs][Contribute] Add How to contribute Notes by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/90
- [HGEMM] optimize SMEM padding, up to 113 TFLOPS by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/92
- [Mat][Trans] Add f32x4_shared/bcf row/col first kernel. by @bear-zd in https://github.com/DefTruth/CUDA-Learn-Notes/pull/91
- [Docs] rename mat_transpose -> mat-transpose by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/93
- [HGEMM] Add GeForce RTX 3080 Laptop benchmark by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/94
- [HGEMM] update HGEMM benchmark option by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/95
- [HGEMM] Refactor HGEMM WMMA 161616 kernels by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/96
- [HGEMM] Update HGEMM WMMA Benchmark by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/97
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.12...v2.4.13
v2.4.12 SGEMM TF32 Swizzle
What's Changed
- [SGEMM] SGEMM TF32 Thread Block Swizzle by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/84
- [HGEMM] mma4x4_warp4x4_stages with swizzle by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/86
- [SWISH] support Swish F32/F16 kernel by @wangzijian1010 in https://github.com/DefTruth/CUDA-Learn-Notes/pull/85
- [SGEMM] Update SGEMM TF32 Benchmark by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/87
New Contributors
- @wangzijian1010 made their first contribution in https://github.com/DefTruth/CUDA-Learn-Notes/pull/85
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.11...v2.4.12
v2.4.11 HGEMM Block Swizzle
What's Changed
- [Docs] Update README.md by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/81
- [HEGMM] HGEMM WMMA Thread Block Swizzle by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/82
- [HGEMM] make thread block swizzle stride as N/4 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/83
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.10...v2.4.11
v2.4.10 SGEMM TF32 Stage 2/3
What's Changed
- [HGEMM] HGEMM WMMA Stage mma4x2+warp4x4 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/76
- [SGEMM] Add SGEMM WMMA TF32 Stage2/3 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/77
- [SGEMM] Add cuBLAS SGEMM F32/TF32 baseline by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/78
- [SGEMM] Add Kernel cudaFuncSetAttribute hint by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/79
- [RoPE] Add minimal RoPE f32/f32x4 pack impl by @bear-zd in https://github.com/DefTruth/CUDA-Learn-Notes/pull/80
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.9...v2.4.10