Cuda Programming

cuda beginner things

threadIdx.x in CUDA ≈ entries of tl.arange in Triton.
blockIdx.x in CUDA ≈ pid in Triton.

Think of tl.arange as “all the thread IDs in this block at once, in a vector”.

Triton Programming

Installation

git clone https://github.com/VachanVY/gpu-programming.git
cd gpu-programming

uv sync
# or
uv sync --locked # If you want them to install exactly what’s in uv.lock (no resolver changes):

Why triton?
- HBM is the main GPU memory (DRAM)
- Calculations happen in the GPU; there is some memory, but not a lot of memory (SRAM)
- So what the kernels do is to reduce the movement between the HBM and the GPU chip
- Fuse many operations in one kernel, so that we can reduce the data movement between the HBM and the GPU chip
FLOPS = FLoating point OPerations per Second
- Writing custom kernels doesn’t magically increase your GPU’s FLOPs or shrink memory. The chip is fixed. What it does is remove bottlenecks so you get closer to the hardware’s peak
- If you tile/cache properly (like cuBLAS, Triton kernels do), you reuse values in shared memory/registers => drastically fewer memory loads
- That’s why hand-written kernels can approach peak FLOPs
General Structure of Triton Program
- Define pid (program id)
- Using pid and tl.arange of block_size, get range/indices for tl.load to get the part of the input tensor using the input pointer
- Now that you have the loaded tensor, perform operations on it
- Store the output tensor using tl.store in the output pointer

threadIdx.x in CUDA ≈ entries of tl.arange in Triton.
blockIdx.x in CUDA ≈ pid in Triton.

Think of tl.arange as “all the thread IDs in this block at once, in a vector”.

Why blocks?
Memory Hierarchy

Notes

Trash/Notes

Problem 7/G: Long Sum G_sum_dim1.py
yt video
Matmul 11

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.vscode		.vscode
matmul_cuda_triton		matmul_cuda_triton
pics		pics
softmax_cuda		softmax_cuda
triton_kernels		triton_kernels
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
device_info.cu		device_info.cu
note.md		note.md
pyproject.toml		pyproject.toml
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cuda Programming

Triton Programming

Trash/Notes

About

Uh oh!

Uh oh!

Languages

License

VachanVY/gpu-programming

Folders and files

Latest commit

History

Repository files navigation

Cuda Programming

Triton Programming

Trash/Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages