Skip to content

Add challenge 74: GPT-2 Inference (Medium)#194

Merged
kunal-mansukhani merged 11 commits intomainfrom
inference-challenge
Mar 2, 2026
Merged

Add challenge 74: GPT-2 Inference (Medium)#194
kunal-mansukhani merged 11 commits intomainfrom
inference-challenge

Conversation

@shxjames
Copy link
Contributor

@shxjames shxjames commented Feb 25, 2026

Summary

Adds challenge 73: GPT-2 Transformer Block (Medium difficulty)

Implements a single GPT-2 124M decoder block: the user receives an input activation tensor and a packed weight buffer, and must compute the full pre-norm transformer block — LayerNorm, 12-head self-attention with combined QKV projection, residual connection, LayerNorm, two-layer FFN with GELU activation, and a second residual connection

This is the first challenge that composes multiple ML primitives (LayerNorm, matmul, softmax, GELU) into a complete architectural building block, as opposed to implementing them in isolation

What this teaches

  • End-to-end kernel composition: fusing or scheduling multiple dependent operations (LayerNorm → matmul → softmax → matmul → GELU → matmul) with intermediate memory management
  • Packed weight buffer indexing: all 12 parameter tensors (~7M floats / ~28MB) are packed into a single contiguous buffer at documented offsets, mimicking how real inference engines store model weights
  • Multi-head attention mechanics: splitting projections into heads, per-head scaled dot-product attention, and concatenating results back — the core operation of transformer inference
  • Memory vs compute tradeoffs: the attention score matrix scales as O(seq_len²) while FFN is O(seq_len), requiring different parallelization strategies for each sub-layer

Test plan

  • 10 functional tests covering: single token, seq_len 2/3/4 edge cases, 16 and 64 power-of-2, 30 and 100 non-power-of-2, 128 and 256 realistic sizes
  • Performance test: seq_len=1,024 with GPT-2 124M dimensions (d_model=768, n_heads=12, ffn_dim=3,072)
  • All dimensions fixed to GPT-2 124M architecture; only seq_len varies
  • Tolerance atol/rtol=1e-3 to accommodate floating-point accumulation across composed operations
  • All 6 framework starters present (CUDA, PyTorch, Triton, JAX, CuTe, Mojo)
  • Checklist in CLAUDE.md verified

@shxjames shxjames changed the title Inference challenge with gpt2 Add challenge 74: GPT-2 Inference Challenge Feb 25, 2026
@shxjames shxjames changed the title Add challenge 74: GPT-2 Inference Challenge Add challenge 74: GPT-2 Inference (Medium) Feb 25, 2026
@shxjames
Copy link
Contributor Author

@claude rename challenge to 74 and fix linting issues

Challenge 73 is already taken by All-Pairs Shortest Paths on main.
Rename medium/73_gpt2_block → medium/74_gpt2_block to avoid conflict.

Also add an explicit zero-input test case (x=zeros) to
generate_functional_test() to satisfy CLAUDE.md requirement that
functional tests include zero inputs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Contributor

claude bot commented Feb 25, 2026

Review: PR #194 — GPT-2 Transformer Block

Issues Fixed

1. Challenge number conflict (Critical)

Challenge 73 is already taken on main by challenges/hard/73_all_pairs_shortest_paths. Renamed 73_gpt2_block74_gpt2_block to avoid the conflict.

2. Missing zero-input test case

CLAUDE.md requires generate_functional_test() to include zero inputs. Added an explicit test with x = zeros(4, 768) (replacing a redundant seq_len=3 edge case to stay within the 10-test maximum).

Both fixes have been committed and pushed to the PR branch.

Validation

Wrote a CUDA reference solution and ran it through the platform (--action run). All functional and performance tests passed (atol=rtol=1e-3, seq_len=1,024 perf test).

Checklist Review

challenge.html ✅

  • Starts with <p>
  • <h2> sections for Implementation Requirements, Example, Weight Layout, Constraints ✓
  • Example describes seq_len=4 matching generate_example_test() (random weights make literal display impractical — acceptable for this challenge type) ✓
  • Performance bullet: seq_len = 1,024 matches generate_performance_test()

challenge.py ✅

  • Inherits ChallengeBase
  • super().__init__() with all required params ✓
  • reference_impl has shape, dtype, and device assertions ✓
  • All 6 required methods present ✓
  • generate_functional_test() has 10 cases covering edge (1, 2, 4), zero-input (4), powers-of-2 (16, 64), non-powers-of-2 (30, 100), realistic (128, 256) ✓
  • generate_performance_test() at seq_len=1,024 fits well within 16GB ✓

Starter files ✅

All 6 present — CUDA, PyTorch, Triton, JAX, CuTe, Mojo ✓

  • CUDA/Mojo: are device pointers without parenthetical (medium) ✓
  • Python frameworks: are tensors on the GPU
  • JAX: # return output tensor directly inside body ✓
  • All have exactly 1 parameter description comment, no other comments ✓
  • All are empty (do not solve the problem) ✓

General

  • Directory now correctly uses 74_gpt2_block
  • Linting passes (pre-commit run --all-files) ✓

shxjames and others added 2 commits February 24, 2026 23:17
…ng, add weight indexing example

- Replace MHA/FFN acronyms with MultiHeadAttn/FeedForward in equations and steps
- Fix LaTeX \_ rendering issue inside \text{} (plain _ works in MathJax/KaTeX)
- Clarify no causal mask is applied in attention
- Fix ambiguous attn_out variable naming across steps (use A, P, F)
- Add concrete weight buffer indexing example (W_qkv[i][j])
- Document LaTeX underscore rule in CLAUDE.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +35 to +36
atol=1e-03,
rtol=1e-03,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude Are these acceptable tolerance levels for this challenge?

proj = fc @ W_proj + b_proj

# residual connection 2
output.copy_(hidden + proj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: To avoid a memcpy, you can do torch.add(hidden, proj, out=output)

…mment

- Add required <h2>Example</h2> section to challenge.html (was missing,
  checklist requires Implementation Requirements, Example(s), Constraints)
- Fix starter.jax.py comment: "on the GPU" → "on GPU" to match CLAUDE.md
  JAX template format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
github-actions bot and others added 5 commits February 28, 2026 09:21
- Add missing Examples section to challenge.html
- Add torch.manual_seed(0) to make generate_example_test() deterministic
- Fix starter.jax.py comment: "on the GPU" -> "on GPU" (matches template)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add missing <h2>Example</h2> section to challenge.html
- Fix device assertion to verify CUDA (assert x.device.type == "cuda")

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous bot commits had already added Example sections; consolidate
to a single <h2>Example</h2> section after Weight Layout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The checklist requires <h2> sections for Implementation Requirements,
Example(s), and Constraints. The Example section was missing. Since D=768
is a fixed architecture dimension, exact tensor values cannot be shown, so
the example describes input/output shapes for seq_len=4 (matching
generate_example_test()).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous commit added an Example section before Weight Layout, but one
already existed after Weight Layout. This removes the newly-added duplicate,
leaving only the Example section before Constraints.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kunal-mansukhani kunal-mansukhani merged commit 1471299 into main Mar 2, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants