[DEPRECATED] Moved to ROCm/rocm-systems repo
-
Updated
Mar 18, 2026 - Python
[DEPRECATED] Moved to ROCm/rocm-systems repo
Online CUDA Occupancy Calculator
(Spring 2017) Assignment 2: GPU Executor
Runs a single CUDA/OpenCL kernel, taking its source from a file and arguments from the command-line
GPU Drano Static Analysis for GPU programs.
Prototype for a SPIR-V assembler and dissasembler. It provides a composable Java interface for generating SPIR-V code at runtime.
A self-hosted low-level functional-style programming language 🌀
High-performance GPU-accelerated C# scripting for Rhino Grasshopper, powered by ILGPU
🍭 Sweet GPU compute kernels in CUDA, wrapped via CuPy
Medical AI diagnostics system implementing real compiled Mojo GPU kernels with MAX Graph integration
Runtime correctness checker for custom CUDA kernels. Attach a single decorator to periodically verify outputs against a reference implementation, with outlier-biased sampling and zero training graph impact.
A lightweight utility for monitoring and analyzing Triton kernel compilation cache behavior.
16-step CUDA optimization of FlashAttention-2 achieving 99.2% of official performance on A100 — Ampere architecture
LLM primitives rebuilt in Triton — FlashAttention 2.52×, fused AdamW 3.45×, Bias+GELU 14.65× faster than PyTorch
Benchmarking hand-written CUDA C, Numba, and Triton self-attention kernels against PyTorch's SDPA - how fast can you go depending on the tool?
Triton optimizations ran on AMD GPU
Add a description, image, and links to the gpu-kernels topic page so that developers can more easily learn about it.
To associate your repository with the gpu-kernels topic, visit your repo's landing page and select "manage topics."