Add challenge 74: GPT-2 Inference (Medium) by shxjames · Pull Request #194 · AlphaGPU/leetgpu-challenges

shxjames · 2026-02-25T02:12:11Z

Summary

Adds challenge 73: GPT-2 Transformer Block (Medium difficulty)

Implements a single GPT-2 124M decoder block: the user receives an input activation tensor and a packed weight buffer, and must compute the full pre-norm transformer block — LayerNorm, 12-head self-attention with combined QKV projection, residual connection, LayerNorm, two-layer FFN with GELU activation, and a second residual connection

This is the first challenge that composes multiple ML primitives (LayerNorm, matmul, softmax, GELU) into a complete architectural building block, as opposed to implementing them in isolation

What this teaches

End-to-end kernel composition: fusing or scheduling multiple dependent operations (LayerNorm → matmul → softmax → matmul → GELU → matmul) with intermediate memory management
Packed weight buffer indexing: all 12 parameter tensors (~7M floats / ~28MB) are packed into a single contiguous buffer at documented offsets, mimicking how real inference engines store model weights
Multi-head attention mechanics: splitting projections into heads, per-head scaled dot-product attention, and concatenating results back — the core operation of transformer inference
Memory vs compute tradeoffs: the attention score matrix scales as O(seq_len²) while FFN is O(seq_len), requiring different parallelization strategies for each sub-layer

Test plan

10 functional tests covering: single token, seq_len 2/3/4 edge cases, 16 and 64 power-of-2, 30 and 100 non-power-of-2, 128 and 256 realistic sizes
Performance test: seq_len=1,024 with GPT-2 124M dimensions (d_model=768, n_heads=12, ffn_dim=3,072)
All dimensions fixed to GPT-2 124M architecture; only seq_len varies
Tolerance atol/rtol=1e-3 to accommodate floating-point accumulation across composed operations
All 6 framework starters present (CUDA, PyTorch, Triton, JAX, CuTe, Mojo)
Checklist in CLAUDE.md verified

…agrams

shxjames · 2026-02-25T02:20:32Z

@claude rename challenge to 74 and fix linting issues

Challenge 73 is already taken by All-Pairs Shortest Paths on main. Rename medium/73_gpt2_block → medium/74_gpt2_block to avoid conflict. Also add an explicit zero-input test case (x=zeros) to generate_functional_test() to satisfy CLAUDE.md requirement that functional tests include zero inputs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude · 2026-02-25T02:30:22Z

Review: PR #194 — GPT-2 Transformer Block

Issues Fixed

1. Challenge number conflict (Critical)

Challenge 73 is already taken on main by challenges/hard/73_all_pairs_shortest_paths. Renamed 73_gpt2_block → 74_gpt2_block to avoid the conflict.

2. Missing zero-input test case

CLAUDE.md requires generate_functional_test() to include zero inputs. Added an explicit test with x = zeros(4, 768) (replacing a redundant seq_len=3 edge case to stay within the 10-test maximum).

Both fixes have been committed and pushed to the PR branch.

Validation

Wrote a CUDA reference solution and ran it through the platform (--action run). All functional and performance tests passed (atol=rtol=1e-3, seq_len=1,024 perf test).

Checklist Review

challenge.html ✅

Starts with <p> ✓
<h2> sections for Implementation Requirements, Example, Weight Layout, Constraints ✓
Example describes seq_len=4 matching generate_example_test() (random weights make literal display impractical — acceptable for this challenge type) ✓
Performance bullet: seq_len = 1,024 matches generate_performance_test() ✓

challenge.py ✅

Inherits ChallengeBase ✓
super().__init__() with all required params ✓
reference_impl has shape, dtype, and device assertions ✓
All 6 required methods present ✓
generate_functional_test() has 10 cases covering edge (1, 2, 4), zero-input (4), powers-of-2 (16, 64), non-powers-of-2 (30, 100), realistic (128, 256) ✓
generate_performance_test() at seq_len=1,024 fits well within 16GB ✓

Starter files ✅

All 6 present — CUDA, PyTorch, Triton, JAX, CuTe, Mojo ✓

CUDA/Mojo: are device pointers without parenthetical (medium) ✓
Python frameworks: are tensors on the GPU ✓
JAX: # return output tensor directly inside body ✓
All have exactly 1 parameter description comment, no other comments ✓
All are empty (do not solve the problem) ✓

General

Directory now correctly uses 74_gpt2_block ✓
Linting passes (pre-commit run --all-files) ✓

…ng, add weight indexing example - Replace MHA/FFN acronyms with MultiHeadAttn/FeedForward in equations and steps - Fix LaTeX \_ rendering issue inside \text{} (plain _ works in MathJax/KaTeX) - Clarify no causal mask is applied in attention - Fix ambiguous attn_out variable naming across steps (use A, P, F) - Add concrete weight buffer indexing example (W_qkv[i][j]) - Document LaTeX underscore rule in CLAUDE.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kunal-mansukhani · 2026-02-28T08:50:03Z

challenges/medium/74_gpt2_block/challenge.py

+            atol=1e-03,
+            rtol=1e-03,


@claude Are these acceptable tolerance levels for this challenge?

kunal-mansukhani · 2026-02-28T08:52:10Z

challenges/medium/74_gpt2_block/challenge.py

+        proj = fc @ W_proj + b_proj
+
+        # residual connection 2
+        output.copy_(hidden + proj)


Nit: To avoid a memcpy, you can do torch.add(hidden, proj, out=output)

…mment - Add required <h2>Example</h2> section to challenge.html (was missing, checklist requires Implementation Requirements, Example(s), Constraints) - Fix starter.jax.py comment: "on the GPU" → "on GPU" to match CLAUDE.md JAX template format Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add missing Examples section to challenge.html - Add torch.manual_seed(0) to make generate_example_test() deterministic - Fix starter.jax.py comment: "on the GPU" -> "on GPU" (matches template) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add missing <h2>Example</h2> section to challenge.html - Fix device assertion to verify CUDA (assert x.device.type == "cuda") Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previous bot commits had already added Example sections; consolidate to a single <h2>Example</h2> section after Weight Layout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The checklist requires <h2> sections for Implementation Requirements, Example(s), and Constraints. The Example section was missing. Since D=768 is a fixed architecture dimension, exact tensor values cannot be shown, so the example describes input/output shapes for seq_len=4 (matching generate_example_test()). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous commit added an Example section before Weight Layout, but one already existed after Weight Layout. This removes the newly-added duplicate, leaving only the Example section before Constraints. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

shxjames added 2 commits February 23, 2026 23:58

gpt2 transformer block challenge

a8418c7

simplify solve signature + update html description to include nice di…

634e424

…agrams

shxjames requested review from ishaan-arya and kunal-mansukhani as code owners February 25, 2026 02:12

shxjames changed the title ~~Inference challenge with gpt2~~ Add challenge 74: GPT-2 Inference Challenge Feb 25, 2026

shxjames changed the title ~~Add challenge 74: GPT-2 Inference Challenge~~ Add challenge 74: GPT-2 Inference (Medium) Feb 25, 2026

shxjames and others added 2 commits February 24, 2026 23:17

challenge.html update

5b10d47

kunal-mansukhani previously approved these changes Feb 28, 2026

View reviewed changes

kunal-mansukhani reviewed Feb 28, 2026

View reviewed changes

claude bot dismissed kunal-mansukhani’s stale review via 5d2e668 February 28, 2026 09:01

github-actions bot and others added 5 commits February 28, 2026 09:21

Fix checklist issues in challenge 74 GPT-2 Transformer Block

d3cc108

- Add missing <h2>Example</h2> section to challenge.html - Fix device assertion to verify CUDA (assert x.device.type == "cuda") Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove duplicate Example sections from challenge.html

50ad843

Previous bot commits had already added Example sections; consolidate to a single <h2>Example</h2> section after Weight Layout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kunal-mansukhani approved these changes Mar 2, 2026

View reviewed changes

kunal-mansukhani merged commit 1471299 into main Mar 2, 2026
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add challenge 74: GPT-2 Inference (Medium)#194

Add challenge 74: GPT-2 Inference (Medium)#194
kunal-mansukhani merged 11 commits intomainfrom
inference-challenge

shxjames commented Feb 25, 2026 •

edited

Loading

Uh oh!

shxjames commented Feb 25, 2026

Uh oh!

claude bot commented Feb 25, 2026

Uh oh!

kunal-mansukhani Feb 28, 2026

Uh oh!

kunal-mansukhani Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shxjames commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this teaches

Test plan

Uh oh!

shxjames commented Feb 25, 2026

Uh oh!

claude bot commented Feb 25, 2026

Review: PR #194 — GPT-2 Transformer Block

Issues Fixed

Validation

Checklist Review

challenge.html ✅

challenge.py ✅

Starter files ✅

General

Uh oh!

kunal-mansukhani Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

kunal-mansukhani Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shxjames commented Feb 25, 2026 •

edited

Loading