|
34 | 34 | </p>
|
35 | 35 | </div>
|
36 | 36 |
|
| 37 | +## ©️Citations🎉🎉 |
| 38 | + |
| 39 | +```BibTeX |
| 40 | +@misc{LeetCUDA@2025, |
| 41 | + title={LeetCUDA: A Modern CUDA Learn Notes with PyTorch for Beginners}, |
| 42 | + url={https://github.com/xlite-dev/LeetCUDA.git}, |
| 43 | + note={Open-source software available at https://github.com/xlite-dev/LeetCUDA.git}, |
| 44 | + author={DefTruth and Many Others}, |
| 45 | + year={2025} |
| 46 | +} |
| 47 | +``` |
| 48 | + |
37 | 49 |
|
38 | 50 | ## 📖 News 🔥🔥
|
39 | 51 | <div id="news"></div>
|
|
54 | 66 | <img src='https://github.com/user-attachments/assets/9472e970-c083-4b31-9252-3eeecc761078' height="170px" width="270px">
|
55 | 67 | </div>
|
56 | 68 |
|
| 69 | + |
57 | 70 | ## 📖 Contents
|
58 | 71 | <div id="contents"></div>
|
59 | 72 | <!---
|
|
98 | 111 | - [📚 Hard++ ⭐⭐⭐️⭐️⭐️](#cuda-kernel-hard-plus)
|
99 | 112 | - [📚 Triton ⭐⭐⭐️](#triton-kernel)
|
100 | 113 | - [📚 CUTLASS ⭐⭐⭐️](#cutlass-kernel)
|
101 |
| -- [📖 100+ 高性能计算文章 💡💡](#my-blogs-part-1) |
| 114 | +- [📖 100+ LLM/CUDA Blogs 🔥](#my-blogs-part-1) |
102 | 115 | - [📖 How to Contribute 👀👇](#contribute)
|
103 | 116 |
|
104 | 117 |
|
@@ -225,18 +238,6 @@ flash_attn_mma_stages_split_q_tiling_qkv_kernel(half* Q, half* K, half* V, half*
|
225 | 238 |
|
226 | 239 | 💡NOTE: [📚Split Q + Fully QKV Fine-grained Tiling](#mma-tiling-qkv) has been refactored into 🤖[ffpa-attn](https://github.com/xlite-dev/ffpa-attn).
|
227 | 240 |
|
228 |
| -## ©️Citations🎉🎉 |
229 |
| - |
230 |
| -```BibTeX |
231 |
| -@misc{LeetCUDA@2025, |
232 |
| - title={LeetCUDA: A Modern CUDA Learn Notes with PyTorch for Beginners}, |
233 |
| - url={https://github.com/xlite-dev/LeetCUDA}, |
234 |
| - note={Open-source software available at https://github.com/xlite-dev/LeetCUDA}, |
235 |
| - author={DefTruth etc}, |
236 |
| - year={2025} |
237 |
| -} |
238 |
| -``` |
239 |
| - |
240 | 241 | ## 📖 200+ CUDA Kernels 🔥🔥 (Easy -> Hard++) ([©️back👆🏻](#contents))
|
241 | 242 |
|
242 | 243 | <div id="cuda-kernel"></div>
|
@@ -481,7 +482,7 @@ The kernels listed here will guide you through a step-by-step progression, rangi
|
481 | 482 |
|
482 | 483 | 💡NOTE: 🤖[ffpa-attn](https://github.com/xlite-dev/ffpa-attn): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, **1.8x~3x**🎉faster than SDPA EA: [📈L20 ~1.9x↑🎉](https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-l20), [📈 A30 ~1.8x↑🎉](https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-a30), [📈3080 ~2.9x↑🎉](https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-3080), [📈4090 ~2.1x↑🎉](https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-4090).
|
483 | 484 |
|
484 |
| -### 📚 Triton Kernel (OpenAI Triton) ([©️back👆🏻](#cuda-kernel)) |
| 485 | +### 📚 Triton Kernel (OpenAI Triton) ⭐️⭐️⭐️ ([©️back👆🏻](#cuda-kernel)) |
485 | 486 |
|
486 | 487 | <div id="triton-kernel"></div>
|
487 | 488 |
|
|
0 commit comments