Skip to content

BigRabbit71/flashinfer-bench-starter-kit

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Create high-performance GPU kernels for state-of-the-art LLM architectures on NVIDIA Blackwell GPUs with humans and/or AI agents.


NVIDIA            Modal            MLSys            FlashInfer            FlashInfer-Bench


FlashInfer-Bench is our official framework to evaluate your AI-generated kernels.

Updates

  • 2026.02.05: Full dataset for definitions and workloads are released at HuggingFace

Competition Tracks

The competition features three tracks, each targeting a critical LLM operation:

Track Description
fused_moe Fused Mixture-of-Experts kernel for efficient expert routing and computation
sparse_attention Sparse attention mechanisms for long-context inference
gated_delta_net Gated delta network operations for efficient state updates

Fork this template once per track you want to compete in (separate repos for each track).

Getting Started

1. Fork This Template

Click "Use this template" or fork this repository to create your solution repo.

2. Install Dependencies

conda create -n fi-bench python=3.12
conda activate fi-bench
pip install flashinfer-bench modal

3. Download the TraceSet

We provide kernel definitions and workloads in FlashInfer-Trace format. Clone the competition dataset from HuggingFace:

git lfs install
git clone https://huggingface.co/datasets/flashinfer-ai/mlsys26-contest

Set the environment variable:

export FIB_DATASET_PATH=/path/to/flashinfer-trace

4. Configure Your Solution

Edit config.toml to set your track and team info:

[solution]
name = "my-team-solution-v1"      # Solution name
definition = "fused_moe"          # Track: fused_moe | sparse_attention | gated_delta_net
author = "team-name"              # Team/author name

[build]
language = "triton"               # triton | cuda
entry_point = "kernel"            # Kernel function name

5. Implement Your Kernel

For Triton: Edit solution/triton/kernel.py with your implementation.

For CUDA: Edit solution/cuda/kernel.cu and solution/cuda/binding.py with your implementation.

Development Workflow

Pack Your Solution

Generate solution.json from your source files:

python scripts/pack_solution.py

Run Local Benchmarks

Test your solution on your local GPU:

python scripts/run_local.py

Requires: Local CUDA-capable GPU and FIB_DATASET_PATH environment variable.

Run Cloud Benchmarks (Modal)

Test your solution on NVIDIA B200 GPUs via Modal:

One-time setup:

modal setup
modal volume create flashinfer-trace
modal volume put flashinfer-trace /path/to/flashinfer-trace

Run benchmark:

modal run scripts/run_modal.py

Submission

To submit your solution for evaluation:

  1. Ensure your implementation is complete and tested
  2. Run python scripts/pack_solution.py to generate solution.json
  3. Commit and push your changes
  4. Tag your commit for evaluation (e.g., git tag submission-v1)

Project Structure

flashinfer-bench-starter-kit/
├── README.md                    # This file
├── config.toml                  # Track configuration (edit this)
├── solution/                    # Solution source files
│   ├── triton/                  # Triton implementation
│   │   └── kernel.py           # Your Triton kernel
│   └── cuda/                    # CUDA implementation
│       ├── kernel.cu           # Your CUDA kernel
│       └── binding.py          # TVM FFI bindings
├── scripts/                     # Utility scripts
│   ├── run_local.py            # Local benchmark runner
│   ├── run_modal.py            # Modal cloud benchmark runner
│   └── pack_solution.py        # Pack source files into solution.json
└── images/                      # Sponsor logos

Additional Resources

Solution Handling API

from flashinfer_bench import BuildSpec
from flashinfer_bench.agents import pack_solution_from_files, extract_solution_to_files

# Pack source files into a Solution object
spec = BuildSpec(
    language="triton",  # or "cuda"
    target_hardware=["cuda"],
    entry_point="my_kernel",
)
solution = pack_solution_from_files(
    path="./my_solution_dir",
    spec=spec,
    name="my_solution_v1",
    definition="fused_moe",
    author="your_name",
)

# Extract a Solution to files in a working directory
extract_solution_to_files(solution, "./output_dir")

Running Sanitizers

from flashinfer_bench.agents import flashinfer_bench_run_sanitizer

output = flashinfer_bench_run_sanitizer(
    solution=solution,
    workload=workload,
    sanitizer_types=["memcheck", "racecheck", "synccheck", "initcheck"],
    timeout=300,
)
print(output)

NCU Profiling

from flashinfer_bench.agents import flashinfer_bench_run_ncu

output = flashinfer_bench_run_ncu(
    solution=solution,
    workload=workload,
    set="detailed",
    page="details",
    timeout=120,
)
print(output)

List Available Tools

from flashinfer_bench.agents import get_all_tool_schemas

schemas = get_all_tool_schemas()
# Returns list of OpenAI-compatible function schemas

Notes

Kernel Signature Requirements

When implementing kernels using Destination Passing Style (DPS), ensure you specify the kernel signature type in your BuildSpec and adjust the build configuration accordingly.

Important: Avoid using variadic input arguments in your kernel signatures, as they will fail the builder validation check.

CUDA Kernel Bindings

For CUDA kernel implementations, we recommend using TVM FFI for Python bindings. The flashinfer_bench.agents module provides TVM FFI agent instruction prompts to assist with development.

About

FlashInfer Bench @ MLSys 2026: Building AI agents to write high performance GPU kernels

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 95.9%
  • Cuda 4.1%