All changes we make to the assignment code or PDF will be documented in this file.
- code: typos regarding FlashAttention2 in the adapters
- code: .python-version constraint for Python 3.12
- handout: details for leaderboard submission
- handout: replace Pytorch profiler with Nsight for DDP
- handout: clarify which attention implementations to benchmark
- handout:
memory_profiling(b) should sweep over context length, not model size - handout: keyword argument in FA2 Triton starter code from
tile_shape->block_shape - handout: minor typo in
flash_backwardproblem statement - handout: launch all-gather example using
uv run
- handout: clarify interface for flash autograd function
- handout: clarify submission for attention benchmarking
- handout: fix small notation issues in the flash algorithm
- handout: fix math error and explanation for flash backward savings
- handout: change parameters used for attention benchmarking to be sensible
- handout: some memory benchmarking that is incompatible with modern PyTorch
- code: Include tests for logsumexp in flash forward implementation
- handout: Clarify interface for flash autograd function
- code: Test causal=True for forward as well as backward
- handout/code: add FlashAttention2
- handout: add proper profiling with Nsight Systems
- handout: add communication accounting
- code: tests for additional content
- handout: greatly improve demo example for Triton
- handout: remove busywork for communication
- handout: clarify that
ddp_bucketed_benchmarkingdoesn't require the full grid of runs.
- code: remove try-finally blocks in DDP tests.
- handout: remove outdated mention of a problem that doesn't exist on the assignment
- handout: fix Slurm environment variables in examples.
- handout: clarify assumptions in
ddp_bucketed_benchmarking(b).
- code: remove
humanfriendlyfrom requirements.txt, addmatplotlib - handout: modify problem
distributed_communication_multi_nodeto specify that multinode measurements should be 2x1, 2x2, and 2x3. - handout: clarify that
torch.cuda.synchronize()is necessary for timing collective communication ops, even when they are called withasync_op=False.
- handout: fixed cut off text in problem memory_profiling (a)
- handout: fixed mismatch between slurm config and description text in section 3.2
- code: fix
ToyModelWithTiedWeightsto actually tie weights. - handout: fix typo in bucketed DDP test command, should be
pytest tests/test_ddp.py - handout: fix deliverable of
ddp_overlap_individual_parameters_benchmarking(a) to not ask for communication time, only end-to-end step time. - handout: clarify analysis in
optimizer_state_sharding_accounting(a).
- handout: added a short question about variability on problem benchmarking_script
- handout: fixed typo in problem
triton_rmsnorm_forward. The adapters should return the classes, not the.applyattribute. - code: added
-eflag to./cs336-systems/'[test]' - handout: clarified recommendation about the timeit module
- handout: clarified question about kernel with highest CUDA total
Initial release.