CUDA-Agent is the first known RL-trained model to surpass advanced models such as Claude Opus-4.6 and Gemini 3 Pro on high-performance CUDA kernel generation. It achieves state-of-the-art results on KernelBench, consistently outperforming the torch.compile baseline across difficulty levels, with especially strong gains on the hardest cases. To support the LLM-based CUDA generation community, we have released our training data, expert-designed SKILL.md and agent environment.
We released the training dataset CUDA-Agent-Ops-6K:
- Dataset URL: BytedTsinghua-SIA/CUDA-Agent-Ops-6K
- Scale: 6,000 training samples
- Construction pipeline:
- Collect reference operators from
torchandtransformers - Use an LLM to compose multiple operators into fused tasks
- Apply rule-based filtering to keep executable, deterministic, and non-trivial samples
- Collect reference operators from
- Filtering criteria:
- Must execute correctly in both eager mode and
torch.compile - Remove stochastic operators and degenerate outputs
- Control runtime range and remove samples highly similar to KernelBench tests to reduce contamination risk
- Must execute correctly in both eager mode and
agent_workdir is a standardized agent workspace example for the full loop:
implement CUDA kernels -> compile -> verify correctness -> profile performance -> iterate.
Key files in this directory:
SKILL.md: workflow constraints and optimization rules for agent executionmodel.py: original PyTorch baseline modelmodel_new.py: optimized model using the custom CUDA extensionbinding.cpp/binding_registry.h: shared Python binding registration infrastructurekernels/: custom CUDA/C++ kernels and their bindingsutils/compile.py+utils/compile.sh: extension build scriptsutils/verification.py: correctness validation scriptutils/profiling.py: performance comparison against baseline andtorch.compile
Common commands (run inside agent_workdir):
bash utils/compile.sh
python3 -m utils.verification
python3 -m utils.profiling

