Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion _blog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5071,4 +5071,15 @@
- open-source
- community
- agents
- rl
- rl

- local: custom-cuda-kernels-agent-skills
date: Feb 13, 2026
tags:
- open-source
- cuda
- kernels
- agent-skills
- community
- agents

Binary file added assets/custom-cuda-kernels/meme.png
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very bold thing to do for a thumbnail 😆

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
300 changes: 300 additions & 0 deletions custom-cuda-kernels-agent-skills.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

custom-cuda-kernels.md ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was seo maxing. lol

Original file line number Diff line number Diff line change
@@ -0,0 +1,300 @@
---
title: "Custom Kernels for All from Codex and Claude"
thumbnail: /blog/assets/custom-cuda-kernels/meme.png
authors:
- user: burtenshaw
- user: sayakpaul
- user: ariG23498
- user: evalstate
---

<!-- TODO: PLEASE ADD YOURSELF TO THE AUTHORS LIST -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to leave this in until merging so that colleagues feel comfortable taking their (deserved) authorship.


# Custom Kernels for All from Codex and Claude

![oprah custom cuda kernels](assets/custom-cuda-kernels/meme.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this in the blog post as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I feel like it sets the tone and it's kinda lost as a pure thumbnail.


tl;dr: We built an agent skill that teaches coding agents how to write production CUDA kernels. Then we pointed Claude and Codex at two real targets: a **diffusers** pipeline and a **transformers** model. The agents produced working kernels for both, with correct PyTorch bindings and benchmarks, end to end.

Writing CUDA kernels is hard. Writing CUDA kernels that correctly integrate with `transformers` and `diffusers` is harder. There are architecture-specific memory access patterns, vectorization strategies, warp shuffle reductions, and a dozen integration pitfalls that trip up even experienced developers. It is exactly the kind of specialized, high-stakes problem where agent skills shine.

We gave coding agents the domain knowledge they need, like which GPU architecture to target, how to structure a kernel-builder project, when to use shared memory versus registers, and how to write PyTorch bindings. The agents did the rest. If you have used the [LLM training skill](https://huggingface.co/blog/hf-skills-training) or read [We Got Claude to Teach Open Models](https://huggingface.co/blog/upskill), the pattern will feel familiar: package domain expertise into a skill, point the agent at a problem, and let it work.

## Why a skill for kernels?

The [Kernel Hub](https://huggingface.co/blog/hello-hf-kernels) solved the distribution of custom hardware kernels. You can load pre-compiled kernels from the Hub with a single `get_kernel` call. No builds, no flags. However, someone still needs to **write the kernels**. That is the gap this skill fills.

CUDA kernel development has a brutal surface area:

- Hardware-specific optimization guides for each generation of GPU. H100, A100, and T4 each have different compute capabilities, shared memory sizes, and bandwidth profiles
- In Libraries, `diffusers` and `transformers` have different module hierarchies, normalization conventions, and integration patterns. Custom kernels need to be registered in PyTorch for `torch.compile` to recognize.
- For distribution, kernels can depend on CUDA, Pytorch, and Python versions creating massive environment matrices.

This is domain knowledge that gets lost in documentation tabs and Stack Overflow answers. An agent skill packages it into context that loads on demand.

First, let's show how to use the skill right away, then we'll dive into the details of how we benchmarked the kernels.

## Installing the skill

The skill ships with the `kernels` library. Install it into your coding agent with a single command:

```shell
pip install kernels
kernels skills add cuda-kernels --claude
```

This drops the skill into `.claude/skills/cuda-kernels/` where Claude Code and Cursor pick it up automatically. For other agents:

```shell
# Codex
kernels skills add cuda-kernels --codex

# OpenCode
kernels skills add cuda-kernels --opencode

# Custom destination
kernels skills add cuda-kernels --dest ./my-agent/skills/

# Install globally (available across all projects)
kernels skills add cuda-kernels --global

# Overwrite an existing installation
kernels skills add cuda-kernels --claude --force
```

Once installed, prompt your agent:

```
Build a vectorized RMSNorm kernel for H100 targeting the Qwen3-8B model in transformers.
```

Or, you can go for something more open-ended:

```
Build an optimized attention kernel for H100 targeting the Qwen3-8B model in transformers. Benchmark it against the PyTorch baseline and validate improvements in end-to-end performance.
```

The agent can read the skill, select the right architecture parameters, generate the CUDA source, write the PyTorch bindings, set up `build.toml`, and create a benchmark script.

If you're working on more complex kernels, or architecture-specific optimizations, that aren't covered in the skill, then the skill supplies the fundamental building blocks and patterns to get you started. We are also open to contributions on the [skill itself](https://github.com/huggingface/kernels/tree/main/.docs/skills).

## What is in the skill

The skill is roughly **550 tokens** of structured guidance plus reference scripts, GPU optimization guides, troubleshooting docs, and complete working examples. Agentic coding tools like Codex and Claude can read this and produce a working kernel project.

It covers:

- NVIDIA GPU Architecture-aware optimization for H100, A100, and T4 (compute capabilities, memory bandwidth, shared memory sizes, block sizing)
- Integration patterns for both `diffusers` and `transformers`, including the pitfalls specific to each library
- Kernel templates with vectorized memory access patterns for BF16, FP16, and FP32
- Benchmarking workflows for both isolated kernel micro-benchmarks and end-to-end pipeline comparisons
- HuggingFace Kernel Hub integration via `get_kernel` for loading community kernels

```
.claude/skills/cuda-kernels/
├── SKILL.md # Main instructions (~550 tokens)
├── scripts/
│ ├── benchmark_example.py # End-to-end benchmark template
│ ├── benchmark_rmsnorm.py # Isolated kernel micro-benchmark
│ ├── ltx_kernel_injection_example.py # Diffusers integration pattern
│ ├── transformers_injection_example.py # Transformers integration pattern
│ └── huggingface_kernels_example.py # Kernel Hub integration
└── references/
├── diffusers-integration.md # Diffusers guide with pitfalls
├── transformers-integration.md # Transformers guide
├── huggingface-kernels-integration.md
├── h100-optimization-guide.md
├── a100-optimization-guide.md
├── t4-optimization-guide.md
├── kernel-templates.md
└── troubleshooting.md
```

When an agent loads this, it gets everything it needs to go from "write me an RMSNorm kernel" to a buildable, benchmarkable project. It will grep and glob the skill to find the relevant files and directories. So it's important to structure the skill in a way that is easy to find.

The agent is instructed to generate kernels that conform to the templates in `references/kernel-templates.md` and produce a complete kernel project:

```
examples/your_model/
├── kernel_src/
│ └── rmsnorm.cu # Vectorized CUDA kernel
├── torch-ext/
│ ├── your_kernels/__init__.py
│ └── torch_binding.cpp # PyTorch C++ bindings
├── benchmark_rmsnorm.py # Micro-benchmark script
├── build.toml # kernel-builder config
├── setup.py # pip install -e .
└── pyproject.toml
```

We tested this on two real targets.

## Benchmarking the kernels: Diffusers (LTX-Video on H100)

The agent built RMSNorm, RoPE 3D, GEGLU, and AdaLN kernels for [LTX-Video](https://huggingface.co/Lightricks/LTX-Video), a video generation pipeline from `diffusers`. The full example is at `examples/ltx_video/`. We optimized the RMSNorm kernel for H100. Both benchmarks were run on H100 80GB HBM3 at precision BFloat16.

If you want to check out the generated kernel, got to [this example](https://github.com/burtenshaw/kernel-skill/tree/main/examples/ltx_video)

### Isolated RMSNorm benchmark

First, we compare the isolated RMSNorm kernel performance against the PyTorch baseline. This is the main speedup in the optimized pipeline.

![isolated rmsnorm benchmark ltx-video](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kernels-skill-benchmark/rmsnorm_ltx_video.png)

<details>
<summary>Table</summary>

| Shape | Custom (ms) | PyTorch (ms) | Speedup |
| :---- | :---: | :---: | :---: |
| [1x1024x2048] | 0.039 | 0.064 | **1.64x** |
| [2x1024x2048] | 0.040 | 0.073 | **1.82x** |
| [4x1024x2048] | 0.052 | 0.093 | **1.78x** |
| [1x4096x2048] | 0.052 | 0.093 | **1.79x** |
| [2x4096x3072] | 0.102 | 0.209 | **2.04x** |
| [1x8192x2048] | 0.083 | 0.150 | **1.81x** |
| [4x4096x3072] | 0.173 | 0.393 | **2.26x** |

**Average speedup: 1.88x** and a bandwidth efficiency: 34.7% of H100 theoretical (3,350 GB/s)

</details>

### End-to-end video generation (49 frames, 30 steps, H100 80GB)

Next, we compare the end-to-end video generation performance of the optimized kernels against the baseline (no compile) and the `torch.compile` baseline.

![e2e benchmark ltx-video](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kernels-skill-benchmark/e2e_ltx_video.png)

<details>
<summary>Table</summary>

| Configuration | Time (s) | it/s | Speedup |
| :---- | :---: | :---: | :---: |
| Baseline (no compile) | 2.87 | 12.58 | 1.00x |
| **Generated Optimized Kernels** | 2.70 | 13.52 | **1.06x** |
| Baseline + torch.compile | 2.14 | 19.05 | 1.34x |
| Optimized + torch.compile | 2.01 | 18.45 | 1.43x |

</details>

RMSNorm accounts for ~5% of total compute in LTX-Video. The remaining time is spent in attention, linear projections, and VAE decode. The 6% end-to-end speedup from a single kernel type is consistent with that profile.

## Benchmarking the kernels: Transformers (Qwen3-8B on H100)

The agent built an RMSNorm kernel for [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B), a large language model from `transformers` with 65 RMSNorm modules across 32 layers. The full example is at `examples/qwen3_8b/`. We optimized the RMSNorm kernel for H100. Both benchmarks were run on H100 80GB HBM3 at precision BFloat16.

If you want to explore the kernel, check it out [here.](https://github.com/burtenshaw/kernel-skill/tree/main/examples/qwen3_8b)

### Isolated RMSNorm benchmark

Once again, we compare the isolated RMSNorm kernel performance against the PyTorch baseline.

![isolated rmsnorm benchmark qwen3-8b](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kernels-skill-benchmark/rmsnorm_qwen3.png)

**Average speedup: 1.94x** and a bandwidth efficiency: 22.3% of H100 theoretical (3,350 GB/s)

<details>
<summary>Table</summary>

| Shape | Custom (ms) | PyTorch (ms) | Speedup |
| :---- | :---: | :---: | :---: |
| [1x128x4096] | 0.040 | 0.062 | **1.58x** |
| [1x512x4096] | 0.038 | 0.064 | **1.69x** |
| [1x1024x4096] | 0.037 | 0.071 | **1.90x** |
| [1x2048x4096] | 0.045 | 0.091 | **2.03x** |
| [1x4096x4096] | 0.071 | 0.150 | **2.12x** |
| [4x512x4096] | 0.056 | 0.093 | **1.67x** |
| [8x256x4096] | 0.045 | 0.092 | **2.06x** |
| [1x8192x4096] | 0.109 | 0.269 | **2.47x** |

</details>

Speedup scales with sequence length: 1.58x at 128 tokens, 2.47x at 8192 tokens. For long-context inference, the custom kernel roughly halves RMSNorm latency.

## Publishing your kernel to the Hub

The agent gives you a working kernel. The [Kernel Hub](https://huggingface.co/kernels-community) lets you share it so anyone can load it without compilation. Here is the full path from agent output to published kernel.

### 1. Verify the project structure

The agent produces a project that already follows the [kernel-builder](https://huggingface.co/docs/kernels/en/builder/writing-kernels) layout:

```
your_kernel/
├── build.toml # Build configuration
├── kernel_src/
│ └── rmsnorm.cu # CUDA kernel source
└── torch-ext/
├── torch_binding.cpp # Registers Torch ops
└── your_kernels/
└── __init__.py # Python API wrapping _ops
```

The `build.toml` tells `kernel-builder` what to build. The agent generates this for you, including the correct `cuda-capabilities` for your target GPU:

```
[general]
name = "your_kernels"
backends = ["cuda"]

[torch]
src = ["torch-ext/torch_binding.cpp"]

[kernel.rmsnorm]
backend = "cuda"
src = ["kernel_src/rmsnorm.cu"]
depends = ["torch"]
cuda-capabilities = ["9.0"] # H100
```

### 2. Build all variants with Nix

Kernel Hub kernels must support all recent PyTorch and CUDA configurations. The kernel-builder Nix flake handles this automatically. Copy the [example `flake.nix`](https://github.com/huggingface/kernels/blob/main/builder/examples/relu/flake.nix) into your project and run:

```shell
nix flake update
nix run .#build-and-copy -L
```

This builds the kernel for every required PyTorch/CUDA variant and places the results in `build/`. For faster builds, enable the HuggingFace Nix cache:

```shell
nix run nixpkgs#cachix -- use huggingface
```

### 3. Create a Hub repo and push

Create a model repo on the Hub and upload the built kernel:

```shell
huggingface-cli repo create your-org/your-kernel --type model
huggingface-cli upload your-org/your-kernel ./build
```

### 4. Others load it in one line

Once published, anyone can use your kernel with zero compilation:

```py
from kernels import get_kernel

rmsnorm = get_kernel("your-org/your-kernel")
```

`get_kernel` detects the user's Python, PyTorch, and CUDA versions and downloads the matching pre-compiled binary. No builds, no flags, typically ready in seconds.

The skill and the Hub are complementary. The skill handles development. The Hub handles distribution. Build a kernel with the skill, validate it with the benchmark scripts, publish it to the Hub, and it becomes a one-liner for everyone else.

## Conclusion

We built an agent skill that teaches coding agents how to write production CUDA kernels. Then we pointed Claude and Codex at two real targets: a **diffusers** pipeline and a **transformers** model. The agents produced working kernels for both, with correct PyTorch bindings and benchmarks, end to end. We benchmarked the kernels and found that the optimized kernels can provide a speedup in both isolated and end-to-end performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥


## Resources

- [CUDA Kernels Skill (SKILL.md)](http://.claude/skills/cuda-kernels/SKILL.md)
- [LTX-Video Example](http://examples/ltx_video/)
- [Qwen3-8B Example](http://examples/qwen3_8b/)
- [HuggingFace Kernel Hub Blog](https://huggingface.co/blog/hello-hf-kernels)
- [Kernels Hub in TRL](https://huggingface.co/docs/trl/en/kernels_hub)
- [We Got Claude to Fine-Tune an Open Source LLM](https://huggingface.co/blog/hf-skills-training)
- [We Got Claude to Teach Open Models](https://huggingface.co/blog/upskill)
- [HuggingFace Kernels Community](https://huggingface.co/kernels-community)