Skip to content

Commit 697e06f

Browse files
authored
[README] Refactor README✔️ (#176)
* Update README.md * Update README.md * Update README.md * Update README.md
1 parent b56a8c3 commit 697e06f

File tree

1 file changed

+6
-4
lines changed

1 file changed

+6
-4
lines changed

README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -150,13 +150,15 @@ flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half*
150150

151151
<div id="cuda-kernel"></div>
152152

153-
**Workflow**: custom **CUDA** kernel impl -> **PyTorch** Python bindings -> Run tests. 👉TIPS: `*` = Tensor Cores (WMMA, MMA, CuTe), otherwise, CUDA Cores; `/` = not supported; `✔️` = supported; `` = TODO. [📚 Easy](#cuda-kernel-easy-medium) and [📚 Medium](#cuda-kernel-easy-medium) includes element-wise, mat_trans, warp/block reduce, online-softmax, nms, layer-norm, rms-norm, dot-prod etc. [📚 Hard](#cuda-kernel-hard) and [📚 Hard++](#cuda-kernel-hard) mainly focus on `sgemv, sgemm, hgemv, hgemm and flash-attention`.
153+
The kernels listed here will guide you through a step-by-step progression, ranging from easy to very challenging topics. The **Workflow** will look like: custom **CUDA** kernel impl -> **PyTorch** Python bindings -> Run tests. 👉TIPS: `*` = Tensor Cores (WMMA, MMA, CuTe), otherwise, CUDA Cores; `/` = not supported; `✔️` = supported; `` = TODO. Contents:
154154

155155
- [📚 Easy ⭐️](#cuda-kernel-easy-medium)
156156
- [📚 Medium ⭐️⭐️](#cuda-kernel-easy-medium)
157157
- [📚 Hard ⭐️⭐️⭐️](#cuda-kernel-hard)
158158
- [📚 Hard++ ⭐⭐⭐️⭐️⭐️](#cuda-kernel-hard)
159159

160+
[📚 Easy](#cuda-kernel-easy-medium) and [📚 Medium](#cuda-kernel-easy-medium) sections cover fundamental operations such as element-wise, mat_trans, warp/block reduce, online-softmax, nms, layer-norm, rms-norm, dot-prod etc. [📚 Hard](#cuda-kernel-hard) and [📚 Hard++](#cuda-kernel-hard) sections delve deeper into advanced topics, primarily focusing on operations like `sgemv, sgemm, hgemv, hgemm and flash-attention`. These sections also provide numerous kernels implemented using Tensor Cores with pure MMA PTX instructions.
161+
160162
### 📚 Easy ⭐️ & Medium ⭐️⭐️ ([©️back👆🏻](#cuda-kernel))
161163
<div id="cuda-kernel-easy-medium"></div>
162164

@@ -467,7 +469,7 @@ flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half*
467469

468470
<div id="my-blogs-part-1"></div>
469471

470-
### 📖 大模型|多模态|Diffusion|推理优化 (本人作者) ([©️back👆🏻](#contents))
472+
### 📚 大模型|多模态|Diffusion|推理优化 (本人作者) ([©️back👆🏻](#contents))
471473

472474
|📖 类型-标题|📖 作者|
473475
|:---|:---|
@@ -496,7 +498,7 @@ flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half*
496498
|[[LLM推理优化][LLM Infra整理]📖PagedAttention论文新鲜出炉](https://zhuanlan.zhihu.com/p/617015570)|@DefTruth|
497499

498500

499-
### 📖 CV推理部署|C++|算法|技术随笔 (本人作者) ([©️back👆🏻](#contents))
501+
### 📚 CV推理部署|C++|算法|技术随笔 (本人作者) ([©️back👆🏻](#contents))
500502

501503
<div id="my-blogs-part-2"></div>
502504

@@ -548,7 +550,7 @@ flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half*
548550
| [[技术随笔][ML]📖200页:《统计学习方法:李航》笔记 -从原理到实现](https://zhuanlan.zhihu.com/p/461520847)|@DefTruth|
549551

550552

551-
### 📖 CUTLASS|CuTe|NCCL|CUDA|文章推荐 (其他作者) ([©️back👆🏻](#contents))
553+
### 📚 CUTLASS|CuTe|NCCL|CUDA|文章推荐 (其他作者) ([©️back👆🏻](#contents))
552554

553555
<div id="other-blogs"></div>
554556

0 commit comments

Comments
 (0)