|
16 | 16 |
|
17 | 17 | <div id="contents"></div>
|
18 | 18 |
|
19 |
| -📚 **Modern CUDA Learn Notes with PyTorch** for Beginners: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖150+ CUDA Kernels🔥🔥](#cuda-kernel) with PyTorch bindings, [📖30+ LLM/VLM🔥](#my-blogs-part-1), [📖40+ CV/C++...🔥](#my-blogs-part-2), [📖50+ CUDA/CuTe...🔥](#other-blogs) Blogs and [📖HGEMM/SGEMM🔥🔥](#hgemm-sgemm) which has been fully optimized, check [📖HGEMM/SGEMM Supported Matrix👇](#hgemm-sgemm) for more details. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉 |
| 19 | +📚 **Modern CUDA Learn Notes with PyTorch** for Beginners: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖150+ CUDA Kernels🔥🔥](#cuda-kernel) with PyTorch bindings, [📖30+ LLM/VLM🔥](#my-blogs-part-1), [📖40+ CV/C++...🔥](#my-blogs-part-2), [📖50+ CUDA/CuTe...🔥](#other-blogs) Blogs and [📖toy-hgemm library🔥🔥](./hgemm) which can achieve the performance of **cuBLAS**, check [📖HGEMM Supported Matrix👇](#hgemm-sgemm) for more details. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉 |
20 | 20 |
|
21 | 21 | <div id="hgemm-sgemm"></div>
|
22 | 22 |
|
|
25 | 25 | <img src='https://github.com/user-attachments/assets/05ef4f5e-d999-48ea-b58e-782cffb24e85' height="225px" width="403px">
|
26 | 26 | </div>
|
27 | 27 |
|
28 |
| -Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [hgemm benchmark](./hgemm) for more details. |
| 28 | +Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./hgemm) for more details. |
29 | 29 |
|
30 | 30 | |CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread|
|
31 | 31 | |:---:|:---:|:---:|:---:|
|
@@ -202,26 +202,27 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's d
|
202 | 202 | | ✔️ [sgemm_t_8x8_sliced_k16...async](./sgemm/sgemm_async.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️|
|
203 | 203 | | ✔️ [sgemm_wmma_m16n16k8...stages*](./sgemm/sgemm_wmma_tf32_stage.cu)|tf32|f32|[link](./sgemm/)|⭐️⭐️⭐️|
|
204 | 204 | | ✔️ [sgemm_wmma_m16n16k8...swizzle*](./sgemm/sgemm_wmma_tf32_stage.cu)|tf32|f32|[link](./sgemm/)|⭐️⭐️⭐️|
|
205 |
| -| ✔️ [hgemm_naive_f16](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️| |
206 |
| -| ✔️ [hgemm_sliced_k_f16](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 205 | +| ✔️ [hgemm_naive_f16](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️| |
| 206 | +| ✔️ [hgemm_sliced_k_f16](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
207 | 207 | | ✔️ [hgemm_t_8x8_sliced_k_f16x4](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
|
208 |
| -| ✔️ [hgemm_t_8x8_sliced_k_f16x4_pack](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
209 |
| -| ✔️ [hgemm_t_8x8_sliced_k_f16x8_pack](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
210 |
| -| ✔️ [hgemm_t_8x8_sliced_k...dbuf](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
211 |
| -| ✔️ [hgemm_t_8/16x8...k16/32...dbuf](./hgemm/hgemm_async.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
212 |
| -| ✔️ [hgemm_t_8/16x8...k16/32...async](./hgemm/hgemm_async.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
213 |
| -| ✔️ [hgemm_wmma_m16n16k16...naive*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
214 |
| -| ✔️ [hgemm_wmma_m16n16k16...mma4x2*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
215 |
| -| ✔️ [hgemm_wmma_m16n16k16...mma4x4*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
216 |
| -| ✔️ [hgemm_wmma_m16n16k16...dbuf*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
217 |
| -| ✔️ [hgemm_wmma_m32n8k16....dbuf*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
218 |
| -| ✔️ [hgemm_wmma_m16n16k16...stages*](./hgemm/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
219 |
| -| ✔️ [hgemm_wmma_m16n16k16...swizzle*](./hgemm/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
220 |
| -| ✔️ [hgemm_mma_m16n8k16...naive*](./hgemm/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
221 |
| -| ✔️ [hgemm_mma_m16n8k16...mma2x4*](./hgemm/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
222 |
| -| ✔️ [hgemm_mma_m16n8k16...stages*](./hgemm/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
223 |
| -| ✔️ [hgemm_mma_m16n8k16...swizzle*](./hgemm/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
224 |
| -| ✔️ [hgemm_mma_stages{swizzle}...cute*](./hgemm/hgemm_mma_stage_tn_cute.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 208 | +| ✔️ [hgemm_t_8x8_sliced_k_f16x4_pack](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 209 | +| ✔️ [hgemm_t_8x8_sliced_k_f16x8_pack](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 210 | +| ✔️ [hgemm_t_8x8_sliced_k...dbuf](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 211 | +| ✔️ [hgemm_t_8/16x8...k16/32...dbuf](./hgemm/naive/hgemm_async.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 212 | +| ✔️ [hgemm_t_8/16x8...k16/32...async](./hgemm/naive/hgemm_async.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 213 | +| ✔️ [hgemm_wmma_m16n16k16...naive*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 214 | +| ✔️ [hgemm_wmma_m16n16k16...mma4x2*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 215 | +| ✔️ [hgemm_wmma_m16n16k16...mma4x4*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 216 | +| ✔️ [hgemm_wmma_m16n16k16...dbuf*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 217 | +| ✔️ [hgemm_wmma_m32n8k16....dbuf*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 218 | +| ✔️ [hgemm_wmma_m16n16k16...stages*](./hgemm/wmma/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 219 | +| ✔️ [hgemm_wmma_m16n16k16...swizzle*](./hgemm/wmma/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 220 | +| ✔️ [hgemm_mma_m16n8k16...naive*](./hgemm/mma/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 221 | +| ✔️ [hgemm_mma_m16n8k16...mma2x4*](./hgemm/mma/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 222 | +| ✔️ [hgemm_mma_m16n8k16...stages*](./hgemm/mma/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 223 | +| ✔️ [hgemm_mma_m16n8k16...swizzle*](./hgemm/mma/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 224 | +| ✔️ [hgemm_mma_stages{swizzle}...cute*](./hgemm/cutlass/hgemm_mma_stage_tn_cute.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️| |
| 225 | +| ✔️ [hgemm_mma_cublas*](./hgemm/cublas/hgemm_cublas.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️| |
225 | 226 | | ✔️ [sgemv_k32_f32](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|
|
226 | 227 | | ✔️ [sgemv_k128_f32x4](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|
|
227 | 228 | | ✔️ [sgemv_k16_f32](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|
|
|
0 commit comments