v2.4.9 HGEMM WMMA Stage
What's Changed
- [HGEMM] Add HGEMM WMMA Double Buffers by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/69
- [Embedding] Add embedding kernel f32/x4/x4_pack, f16/x8/x8_pack by @bear-zd in https://github.com/DefTruth/CUDA-Learn-Notes/pull/68
- [HGEMM] Add HGEMM mma4x2, warp2x4x2 kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/70
- [HGEMM] HGEMM WMMA with Reg double buffers by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/71
- [HGEMM] Add HGEMM WMMA Stage 3/4 Kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/74
- [Softmax] Add online softmax f32x4 pack kernel by @bear-zd in https://github.com/DefTruth/CUDA-Learn-Notes/pull/73
- [HEGMM][Bugfix] fix HGEMM Stage cp.async error by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/75
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.8...v2.4.9