-
Installation
git clone https://github.com/VachanVY/gpu-programming.git cd gpu-programming uv sync # or uv sync --locked # If you want them to install exactly what’s in uv.lock (no resolver changes):
-
Why triton?
- HBM is the main GPU memory (DRAM)
- Calculations happen in the GPU; there is some memory, but not a lot of memory (SRAM)
- So what the kernels do is to reduce the movement between the HBM and the GPU chip
- Fuse many operations in one kernel, so that we can reduce the data movement between the HBM and the GPU chip
-
-
FLOPS = FLoating point OPerations per Second
- Writing custom kernels doesn’t magically increase your GPU’s FLOPs or shrink memory. The chip is fixed. What it does is remove bottlenecks so you get closer to the hardware’s peak
- If you tile/cache properly (like cuBLAS, Triton kernels do), you reuse values in shared memory/registers => drastically fewer memory loads
- That’s why hand-written kernels can approach peak FLOPs
-
General Structure of Triton Program
- Define
pid(program id) - Using
pidandtl.arangeofblock_size, getrange/indicesfortl.loadto get the part of the input tensor using the input pointer - Now that you have the loaded tensor, perform operations on it
- Store the output tensor using
tl.storein the output pointer
- Define
threadIdx.x in CUDA ≈ entries of tl.arange in Triton.
blockIdx.x in CUDA ≈ pid in Triton.
Think of tl.arange as “all the thread IDs in this block at once, in a vector”.











