challenges/<difficulty>/<number>_<name>/
├── challenge.html # Problem description
├── challenge.py # Reference impl, test cases, metadata
└── starter/ # One per framework
├── starter.cu
├── starter.cute.py
├── starter.jax.py
├── starter.mojo
├── starter.pytorch.py
└── starter.triton.py
- Naming:
<number>_<challenge_name>— sequential integer, lowercase with underscores - Linting & contribution process: See CONTRIBUTING.md
| Level | Parameters | Concepts | Examples |
|---|---|---|---|
| Easy | 1-2 in + output | Single concept, basic parallelization | Vector add, transpose, element-wise ops |
| Medium | 2-4 in/out | Memory hierarchies, reductions, tiling | Tiled matmul, 2D convolution |
| Hard | Multiple with complex relationships | Warp ops, cooperative groups, heavy perf | GPU sorting, graph algorithms |
Must inherit from ChallengeBase and follow Black formatting (line length 100).
Reference files to read for patterns:
- Base class:
challenges/core/challenge_base.py - Simple example:
challenges/easy/1_vector_add/challenge.py - Matrix example:
challenges/easy/3_matrix_transpose/challenge.py - Medium example:
challenges/medium/22_gemm/challenge.py
super().__init__(
name="Challenge Display Name",
atol=1e-05, # Absolute tolerance (float32 default)
rtol=1e-05, # Relative tolerance (float32 default)
num_gpus=1,
access_tier="free" # "free" or "premium"
)- Same parameters as user's
solvefunction - Must include assertions on shape, dtype, and device (
cuda) - Use PyTorch operations (not Python loops) for performance
Maps parameter names to (ctype, direction) tuples.
| ctypes | Use for |
|---|---|
ctypes.POINTER(ctypes.c_float) |
Tensor data |
ctypes.c_size_t |
Sizes/dimensions |
ctypes.c_int |
Integer parameters |
| Direction | Meaning |
|---|---|
"in" |
Read-only input |
"out" |
Write-only output |
"inout" |
Read and write |
One small, human-readable test case for display. Use literal tensor values.
7-10 test cases with this coverage:
| Category | Sizes | Count |
|---|---|---|
| Edge cases | 1, 2, 3, 4 | 2-3 |
| Power-of-2 | 16, 32, 64, 128, 256, 512, 1024 | 2-3 |
| Non-power-of-2 | 30, 100, 255 | 2-3 |
| Realistic | 1K-10K | 1-2 |
Must also include: zero inputs, negative numbers, mixed values.
One large test case. Size must fit 5x within 16GB (Tesla T4 VRAM).
| Operation type | Size |
|---|---|
| 1D | 10M-100M elements |
| 2D | 4K×4K to 8K×8K |
| Complex | 1M-10M |
HTML fragment with four required sections:
- Problem description — 2-3 sentences: what the function does, data types, constraints
- Implementation requirements — Signature unchanged, no external libs, output location
- Examples — 1-3 examples with Input/Output. The first example must match
generate_example_test(). Format depends on data shape:- 1D data (vectors, sequences): use
<pre>blocks - 2D/3D data (matrices, grids): use LaTeX
\begin{bmatrix}inside<p>blocks - Be consistent within a single challenge
- 1D data (vectors, sequences): use
- Constraints — Size bounds, data types, value ranges, and performance test size
Formatting rules:
<code>for variables/functions;<pre>for 1D examples, LaTeX\begin{bmatrix}for matrices≤,≥,×for math symbols- LaTeX underscores: Inside
\text{}, use plain_(not\_). The backslash-escaped form renders literally as\_in MathJax/KaTeX. - Performance test size bullet: Must include a bullet documenting the exact parameters used in
generate_performance_test(), formatted as:<li>Performance is measured with <code>param</code> = value</li>- Use commas for numbers ≥ 1,000 (e.g.,
25,000,000) - Multiple parameters:
<code>M</code> = 8,192, <code>N</code> = 6,144, <code>K</code> = 4,096
Reference: challenges/easy/2_matrix_multiplication/challenge.html
Must compile/run without errors but not solve the problem. No comments except the parameter description comment (e.g., // A, B, C are device pointers).
Rules:
- Easy problems: provide kernel scaffold with grid/block setup
- Medium/Hard problems: empty
solvefunction only - Match the exact style of existing starters in each framework
Reference files (read these for exact format):
- CUDA:
challenges/easy/1_vector_add/starter/starter.cu - PyTorch:
challenges/easy/1_vector_add/starter/starter.pytorch.py - Triton:
challenges/easy/1_vector_add/starter/starter.triton.py - JAX:
challenges/easy/1_vector_add/starter/starter.jax.py - CuTe:
challenges/easy/1_vector_add/starter/starter.cute.py - Mojo:
challenges/easy/1_vector_add/starter/starter.mojo
Each starter file must have exactly one comment describing the parameters, placed directly before the solve function. Use these exact templates:
| Framework | Comment template |
|---|---|
| CUDA | // <params> are device pointers |
| Mojo | # <params> are device pointers |
| PyTorch, Triton, CuTe | # <params> are tensors on the GPU |
| JAX | # <params> are tensors on GPU (+ # return output tensor directly inside body) |
Rules:
- Easy challenges: include the parenthetical
(i.e. pointers to memory on the GPU)for CUDA/Mojo (matches vector_add reference) - Medium/Hard challenges: omit the parenthetical — just
are device pointers - No other comments anywhere in the starter file
- List only input/output tensor parameter names, not size parameters
- Create directory:
mkdir -p challenges/<difficulty>/<number>_<name>/starter - Write
challenge.py— inherit ChallengeBase, implement all 6 methods - Write
challenge.html— all 4 sections - Write starter code for all 6 frameworks
- Lint:
pre-commit run --all-files
Use scripts/run_challenge.py to submit solutions against the live platform when creating or reviewing challenges. This reads challenge.py from the challenge directory and sends it along with the solution.
python scripts/run_challenge.py path/to/challenge_dir --language cuda --action runRules:
- GPU: Always use
--gpu "NVIDIA TESLA T4"(the default). Do not use any other GPU. - Submission limit: You may only run this script 5 times per session. Use submissions carefully — verify your challenge locally (imports, assertions, lint) before submitting.
- Workflow: Write a CUDA solution in
solution/solution.cu, run the script with--action runto validate, and only use--action submitwhen confident. Do not commit the solution file to the PR.
Verify every item before submitting. This is the single source of truth — workflow prompts reference this section.
- Starts with
<p>(problem description) — never<h1> - Has
<h2>sections for: Implementation Requirements, Example(s), Constraints (not<h1>or<h3>) - First example matches
generate_example_test()values - Examples use
<pre>for 1D data, LaTeX\begin{bmatrix}for matrices — consistent, never mixed - Constraints includes
Performance is measured with <code>param</code> = valuebullet matchinggenerate_performance_test()
-
class ChallengeinheritsChallengeBase -
__init__callssuper().__init__()with name, atol, rtol, num_gpus, access_tier -
reference_implhas assertions on shape, dtype, and device - All 6 methods present:
__init__,reference_impl,get_solve_signature,generate_example_test,generate_functional_test,generate_performance_test -
generate_functional_testreturns 7-10 cases: edge cases (1-4 elements), powers-of-2, non-powers-of-2, realistic sizes, zeros, negatives -
generate_performance_testfits 5x in 16GB VRAM (Tesla T4)
- All 6 files present:
.cu,.pytorch.py,.triton.py,.jax.py,.cute.py,.mojo - Exactly 1 parameter description comment per file, no other comments
- CUDA/Mojo use "device pointers"; easy challenges include
(i.e. pointers to memory on the GPU), medium/hard omit it - Python frameworks use "tensors on the GPU"; JAX also has
# return output tensor directly - Starters compile/run but do NOT produce correct output
- Directory follows
<number>_<name>convention - Linting passes:
pre-commit run --all-files