Releases: xlite-dev/LeetCUDA
Releases · xlite-dev/LeetCUDA
v2.4.9 HGEMM WMMA Stage
What's Changed
- [HGEMM] Add HGEMM WMMA Double Buffers by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/69
- [Embedding] Add embedding kernel f32/x4/x4_pack, f16/x8/x8_pack by @bear-zd in https://github.com/DefTruth/CUDA-Learn-Notes/pull/68
- [HGEMM] Add HGEMM mma4x2, warp2x4x2 kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/70
- [HGEMM] HGEMM WMMA with Reg double buffers by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/71
- [HGEMM] Add HGEMM WMMA Stage 3/4 Kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/74
- [Softmax] Add online softmax f32x4 pack kernel by @bear-zd in https://github.com/DefTruth/CUDA-Learn-Notes/pull/73
- [HEGMM][Bugfix] fix HGEMM Stage cp.async error by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/75
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.8...v2.4.9
v2.4.8 HGEMM WMMA Part-1
What's Changed
- [GELU] Add f32/x4, f16/x2/x8/x8pack kernel. by @bear-zd in https://github.com/DefTruth/CUDA-Learn-Notes/pull/66
- [HGEMM] HGEMM Tensor Cores Support Part-1 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/67
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.7...v2.4.8
v2.4.7 SGEMM Copy Async
What's Changed
- [SGEMM][Async] Add naive copy async SGEMM by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/64
- [SGEMM][Async] Add K16 + Copy Async Kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/65
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.6...v2.4.7
v2.4.6 HGEMM Copy Async
What's Changed
- [Softmax] Add online softmax according to Nvidia Paper by @bear-zd in https://github.com/DefTruth/CUDA-Learn-Notes/pull/60
- [HGEMM][Async] support K16/32 pack+cp.async+dbuf by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/62
- [Softmax][Bugfix] fixed softmax compile error by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/63
New Contributors
- @bear-zd made their first contribution in https://github.com/DefTruth/CUDA-Learn-Notes/pull/60
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.5...v2.4.6
v2.4.5 HGEMM Double Buffers
What's Changed
- [FlashAttention] Refactor FlashAttention PyTorch bindings by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/55
- [SGEMM] test bank conflicts free with smem offset by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/56
- [HGEMM] HEGMM kernel with double buffers by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/57
- [Docs] Add docs for HGEMM/SGEMM double buffers by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/58
- [HGEMM] Add PyTorch HGEMM profile by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/59
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.4...v2.4.5
v2.4.4 Pack HGEMM
What's Changed
- [SGEMM] Add naive sgemm kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/51
- [SGEMM] bank conflicts free & double buffers by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/52
- [Misc][Benchmark] optimize benchmarks by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/53
- [HGEMM] Pack sliced_k f16x4/fp16x8 HGEMM by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/54
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.3...v2.4.4
v2.4.3 Pack Softmax
What's Changed
- [LayerNorm][FP16] support fp16x8_pack_f32 kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/48
- [Softmax][FP16] Pack f16x8 softmax kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/49
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.2...v2.4.3
v2.4.2 Pack RMSNorm
What's Changed
- [RMSNorm][FP16] Pack f16x8 rmsnorm by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/47
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.1...v2.4.2
v2.4.1 Pack LayerNorm
What's Changed
- [Nsight] Add nsys/ncu usage, ptx/sass by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/44
- [DotProd][FP16] support f16x8_pack kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/45
- [LayerNorm][FP16] Add pack support for f16x8 LD/ST by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/46
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4...v2.4.1
v2.4 Pack Reduce LDST
What's Changed
- [Reduce][Kernel] Pack f16/bf16x8 & fp8/i8x16 LD/ST by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/43
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.3.1...v2.4