Skip to content

VachanVY/gpu-programming

Repository files navigation

Cuda Programming

  • cuda beginner things image
    • threadIdx.x in CUDAentries of tl.arange in Triton.
      blockIdx.x in CUDApid in Triton.
      
      Think of tl.arange asall the thread IDs in this block at once, in a vector”.

Triton Programming

  • Installation
    git clone https://github.com/VachanVY/gpu-programming.git
    cd gpu-programming
    
    uv sync
    # or
    uv sync --locked # If you want them to install exactly what’s in uv.lock (no resolver changes):
  • Why triton? image
    • HBM is the main GPU memory (DRAM)
    • Calculations happen in the GPU; there is some memory, but not a lot of memory (SRAM)
    • So what the kernels do is to reduce the movement between the HBM and the GPU chip
    • Fuse many operations in one kernel, so that we can reduce the data movement between the HBM and the GPU chip
    • image
  • FLOPS = FLoating point OPerations per Second image image
    • Writing custom kernels doesn’t magically increase your GPU’s FLOPs or shrink memory. The chip is fixed. What it does is remove bottlenecks so you get closer to the hardware’s peak
    • If you tile/cache properly (like cuBLAS, Triton kernels do), you reuse values in shared memory/registers => drastically fewer memory loads
    • That’s why hand-written kernels can approach peak FLOPs
  • General Structure of Triton Program

    • Define pid (program id)
    • Using pid and tl.arange of block_size, get range/indices for tl.load to get the part of the input tensor using the input pointer
    • Now that you have the loaded tensor, perform operations on it
    • Store the output tensor using tl.store in the output pointer
threadIdx.x in CUDAentries of tl.arange in Triton.
blockIdx.x in CUDApid in Triton.

Think of tl.arange asall the thread IDs in this block at once, in a vector”.
  • Why blocks? image
  • Memory Hierarchy image image

Notes

Trash/Notes

  • Problem 7/G: Long Sum G_sum_dim1.py
    image

  • image image

    yt video

    image image image image image
  • Matmul 11 image image

About

Some Triton & Cuda kernels

Resources

License

Stars

Watchers

Forks