Skip to content

Latest commit

 

History

History
118 lines (82 loc) · 3.54 KB

File metadata and controls

118 lines (82 loc) · 3.54 KB

Changelog

All changes we make to the assignment code or PDF will be documented in this file.

[1.0.5] - 2025-07-03

Fixed

  • code: typos regarding FlashAttention2 in the adapters

Added

  • code: .python-version constraint for Python 3.12

[1.0.4] - 2025-04-27

Fixed

  • handout: details for leaderboard submission
  • handout: replace Pytorch profiler with Nsight for DDP

Added

  • handout: clarify which attention implementations to benchmark

[1.0.3] - 2025-04-24

Fixed

  • handout: memory_profiling (b) should sweep over context length, not model size
  • handout: keyword argument in FA2 Triton starter code from tile_shape -> block_shape
  • handout: minor typo in flash_backward problem statement
  • handout: launch all-gather example using uv run

[1.0.2] - 2025-04-22

Added

  • handout: clarify interface for flash autograd function
  • handout: clarify submission for attention benchmarking

Fixed

  • handout: fix small notation issues in the flash algorithm
  • handout: fix math error and explanation for flash backward savings
  • handout: change parameters used for attention benchmarking to be sensible

Removed

  • handout: some memory benchmarking that is incompatible with modern PyTorch

[1.0.1] - 2025-04-17

Added

  • code: Include tests for logsumexp in flash forward implementation
  • handout: Clarify interface for flash autograd function
  • code: Test causal=True for forward as well as backward

[1.0.0] - 2025-04-16

Added

  • handout/code: add FlashAttention2
  • handout: add proper profiling with Nsight Systems
  • handout: add communication accounting
  • code: tests for additional content

Changed

  • handout: greatly improve demo example for Triton
  • handout: remove busywork for communication

Fixed

  • handout: clarify that ddp_bucketed_benchmarking doesn't require the full grid of runs.

[0.0.4] - 2024-04-23

Added

Changed

  • code: remove try-finally blocks in DDP tests.

Fixed

  • handout: remove outdated mention of a problem that doesn't exist on the assignment
  • handout: fix Slurm environment variables in examples.
  • handout: clarify assumptions in ddp_bucketed_benchmarking (b).

[0.0.3] - 2024-04-21

Added

Changed

  • code: remove humanfriendly from requirements.txt, add matplotlib
  • handout: modify problem distributed_communication_multi_node to specify that multinode measurements should be 2x1, 2x2, and 2x3.
  • handout: clarify that torch.cuda.synchronize() is necessary for timing collective communication ops, even when they are called with async_op=False.

Fixed

  • handout: fixed cut off text in problem memory_profiling (a)
  • handout: fixed mismatch between slurm config and description text in section 3.2
  • code: fix ToyModelWithTiedWeights to actually tie weights.
  • handout: fix typo in bucketed DDP test command, should be pytest tests/test_ddp.py
  • handout: fix deliverable of ddp_overlap_individual_parameters_benchmarking (a) to not ask for communication time, only end-to-end step time.
  • handout: clarify analysis in optimizer_state_sharding_accounting (a).

[0.0.1] - 2024-04-17

Added

  • handout: added a short question about variability on problem benchmarking_script

Changed

Fixed

  • handout: fixed typo in problem triton_rmsnorm_forward. The adapters should return the classes, not the .apply attribute.
  • code: added -e flag to ./cs336-systems/'[test]'
  • handout: clarified recommendation about the timeit module
  • handout: clarified question about kernel with highest CUDA total

[0.0.0] - 2024-04-16

Initial release.