GitHub - WingEdge777/vitamin-cuda: 🍎 One Kernel a Day, Keeps High Latency Away. A hands-on CUDA dev learning path from novice to expert. 🚀

A hands-on CUDA dev learning path from novice to expert.

One Kernel a Day, Keeps High Latency Away. 🚀

Welcome to your daily dose of CUDA programming! Vitamin-CUDA is a curated collection of hands-on CUDA practices, designed to take you from Hello World to High Performance. Whether you are a beginner looking to understand the grid-stride loop or an enthusiast diving into warp-level primitives, there's a kernel here for you.

💻 Let's get started and happy coding! ⌨️

News

[2026.03.10] sgemm_tf32 tf32 Tensor-Core kernel outperforming cuBLAS cp.async + double smem + swizzle + ldmatrix + mma🚀(and stay tuned!)
[2026.02.27] sgemm SIMT kernel outperforming cuBLAS with smem + swizzle + double buffer + coalesced r/w🚀

Prerequisites 🛠️

NVIDIA GPU (Compute Capability 6.0+)
CUDA Toolkit 11.0+
C++ Compiler (GCC/Clang/MSVC)
CMake 3.18+ (Optional, but recommended)
PyTorch (For extension examples/python binding and performence comparation)

I recommend using nvidia pytorch ngc docker images for a quick start! Refer to https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch

Kernels (100+ kernels)

All kernels were tested on an RTX 5060 GPU (unless otherwise specified) and benchmarked against PyTorch 2.9

Easy (🌟~🌟🌟)

elementwise: elementwise add
- elementwise_add fp32/fp16 版
- elementwise_add_fp16x2(fp16向量化)
- elementwise_add_fp16x8(fp16向量化)
- elementwise_add_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
sigmoid
- sigmoid fp32/fp16 版
- sigmoid_fp16x2(fp16向量化)
- sigmoid_fp16x8(fp16向量化)
- sigmoid_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
swish
- swish fp32/fp16 版
- swish_fp16x2(fp16向量化)
- swish_fp16x8(fp16向量化)
- swish_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
relu
- relu fp32/fp16 版
- relu_fp16x2(fp16向量化)
- relu_fp16x8(fp16向量化)
- relu_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
relu6
- relu6 fp32/fp16 版
- relu6_fp16x2(fp16向量化)
- relu6_fp16x8(fp16向量化)
- relu6_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
elu
- elu fp32/fp16 版
- elu_fp16x2(fp16向量化)
- elu_fp16x8(fp16向量化)
- elu_fp16x8(fp16向量化, packed r/w, half2 近两倍提升)
- pytorch op bindings && diff check
gelu
- gelu fp32/fp16 版
- gelu_fp16x2(fp16向量化)
- gelu_fp16x8(fp16向量化)
- gelu_fp16x8(fp16向量化，packed r/w)
- pytorch op bindings && diff check
hardswish
- hardswish fp32/fp16 版
- hardswish_fp16x2(fp16向量化)
- hardswish_fp16x8(fp16向量化)
- hardswish_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
embedding
- embedding fp32/fp16 版
- embedding_fp32x4(fp32向量化)
- embedding_fp32x4(fp32向量化, packed r/w)
- embedding_fp16x2(fp16向量化)
- embedding_fp16x8(fp16向量化)
- embedding_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
rope
- pytorch naive rope
- pytorch rope with cos/sin table
- rope fp32 版 (比pytorch naive 实现快一个数量级)
- rope fp32x4 版 (fp32向量化，稍大规模后快几十倍)
- pytorch op bindings && diff check

Medium (🌟🌟~🌟🌟🌟)

reduce : 基于 warp shuffle add
- reduce_sum fp32/fp16 版
- reduce_sum_fp16x2(fp16向量化)
- reduce_sum_fp16x8_packed(fp16向量化, packed r/w)
- reduce_sum int8 版
- reduce_sum_i8x16_packed (int8向量化，packed r/w)
- reduce_sum_i8x16_packed (int8向量化，packed r/w, dp4a, 相比torch朴素实现快几十倍)
- reduce_sum_i8x64_packed (int8向量化，packed r/w, dp4a)
- pytorch op bindings && diff check
dot_product
- dot_product fp32/fp16 版
- dot_product_fp32x4(fp32向量化)
- dot_product_fp16x2(fp16向量化)
- dot_product_fp16x8(fp16向量化, packed r/w)
- pytorch op bindings && diff check
softmax
- safe online softmax fp32/fp16 版
- safe online softmax fp32x4 版 (fp32向量化)
- safe online softmax fp16x8 版 (fp16向量化, packed r/w)
- pytorch op bindings && diff check
rmsnorm
- naive torch rmsnorm
- rmsnorm fp32/fp16 版
- rmsnorm fp32x4 版 (fp32向量化)
- rmsnorm_fp32x4_smem
- rmsnorm fp16x8 版 (fp16向量化, packed r/w)
- rmsnorm_fp16x8_smem 版 (fp16向量化, packed r/w)
- pytorch op bindings && diff check
transpose
- transpose_coalesced_read (input视角，合并读)
- transpose_coalesced_write (output视角，合并写)
- transpose_smem (共享内存缓存，块状读写)
- transpose_smem_bcf (共享内存无冲突版)
- transpose_smem_packed_bcf (共享内存无冲突版，float4向量化读写)
- transpose_smem_swizzled_packed (共享内存无冲突版，float4向量化读写)
- pytorch op bindings && diff check

Hard (🌟🌟🌟~🌟🌟🌟🌟)

sgemv
- gemv fp32版
- gemv fp32x4（向量化读取）
- pytorch op bindings && diff check
sgemm
- sgemm_cublas fp32 版
- sgemm_tiling (向量化读写 + block tiling共享内存版)
- sgemm_at_tiling (向量化读写 + a矩阵转置写入smem, 4-way 写入冲突, 内层循环float4读取)
- sgemm_at_bcf_swizzling (向量化读写 + at + swizzle，无冲突版)
- sgemm_at_bcf_swizzling_rw (向量化读写 + at + swizzle + c写回事务合并)
- sgemm_at_bcf_swizzling_dbf_rw(向量化读写 + at + swizzle + c写回事务合并 + double buffer流水线, 超越cuBLAS ！)
- pytorch op bindings && diff check
sgemm_tf32
- sgemm_cublas tf32 版
- sgemm_tf32_bt (向量化读A/B，B转置写入smem, ldmatrix + mma)
- sgemm_tf32_bt_swizzle (向量化读A/B，B转置写入smem, ldmatrix + mma, As 0冲突)
- sgemm_tf32_bt_swizzle_dbf (向量化读A/B，B转置写入smem, ldmatrix + mma, As 0冲突, grid swizzling, 97~102% cuBLAS 性能)
- sgemm_tf32_swizzle_bcf (cp.async读写A/B，warp shuffle b寄存器转置， As/Bs无冲突, grid swizzling)
- sgemm_tf32_swizzle_bcf_dbf (cp.async读写A/B，warp shuffle b寄存器转置， As/Bs无冲突, grid swizzling，双buffer，超越cuBLAS)
- pytorch op bindings && diff check

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
kernels		kernels
posts		posts
samples		samples
.clang-format		.clang-format
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.MD		README.MD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News

Contents 📖

Prerequisites 🛠️

Kernels (100+ kernels)

Easy (🌟~🌟🌟)

Medium (🌟🌟~🌟🌟🌟)

Hard (🌟🌟🌟~🌟🌟🌟🌟)

Samples

Reference

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

News

Contents 📖

Prerequisites 🛠️

Kernels (100+ kernels)

Easy (🌟~🌟🌟)

Medium (🌟🌟~🌟🌟🌟)

Hard (🌟🌟🌟~🌟🌟🌟🌟)

Samples

Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages