All changes we make to the assignment code or PDF will be documented in this file.
- handout: Fix bug in RoPE formulation
- code: Fix typing for adapters
- code: Relock dependencies
- code: Minor reformatting
- code: Add submission script, fix typos
- code: Use
uv_buildfor the package build system - handout: Fix RoPE indexing
- code: Fix test for truncated LM inputs
- code: Ruff reformatting and linting
- code: Simplify snapshot testing
- code: Make everything compatible with
tytyping
- handout: add guidance on parallelizing pretokenization and provide starter code for chunking
- handout: add guidance on removing special tokens before pretokenization (you should split on them!)
- handout: fix command for compiling model on MPS backend
- code: Test for removing special tokens when training BPE
- handout: Fixed RoPE off-by-one error
- code: Fix for Intel Macs, support 3.11 and PyTorch 2.2.2 when on MacOS x86_64
- code: Missing tests for Linear and Embedding
- handout: Fix RMSNorm interface
- handout: Add hints in RMSNorm and SwiGLU specifications to improve numerical stability
- handout: Clarify the BPE stylized example uses naive pretokenization by splitting on whitespace; your implementations should still use the provided regex
- handout: Fix docstring for RMSNorm
- code: Add some tolerance to RoPE tests
- code: Make sure loading happens on CPU for tests
- code: Fix docstring typing for readability of Tensor annotations in VSCode (pls fix, @Microsoft)
- code: uv environments
- code: Tensor type annotations
- code: Snapshot testing, use --update-snapshots to update (soln only)
- code/handout: SwiGLU
- code/handout: RoPE
- handout: einops
- code/handout: new ablations
- handout: low-resource tips
- handout: Linear and Embedding from scratch
- handout: complete reformatting, restructuring, rewording
- handout: redefine attention
- handout: GeLU -> SiLU
- handout: parameter initialization
- handout: remove dropout
- handout: redraw figures
- handout: new experiments
- handout: BPE explanations, systems, explain weird behaviors in solutions
- code: fix
test_get_batchto handle "AssertionError: Torch not compiled with CUDA enabled". - handout: clarify that gradient clipping norm is calculated over all the parameters.
- code: fix gradient clipping test comparing wrong tensors
- code: test skipping parameters with no gradient and properly computing norm with multiple parameters
- handout: edit expected TinyStories run time to 30-40 minutes.
- handout: add more details about how to use
np.memmapor themmap_modeflag tonp.load. - code: fix
get_tokenizer()docstring. - handout: specify that problem
main_experimentshould use the same settings as TinyStories. - code: replace mentions of layernorm with RMSNorm.
- handout: clarify example of preferring lexicographically greater merges to specify that we want tuple comparison.
- handout: fix expected number of training tokens for TinyStories, should be 327,680,000.
- code: fix typo in
run_get_lr_cosine_schedulereturn docstring. - code: fix typo in
test_tokenizer.py
- code: skip
Tokenizermemory-related tests on non-Linux systems, since support for RLIMIT_AS is inconsistent. - code: reduce increase atol on end-to-end Transformer forward pass tests.
- code: remove dropout in model-related tests to improve determinism across platforms.
- code: add
attn_pdroptorun_multihead_self_attentionadapter. - code: clarify
{q,k,v}_projdimension orders in the adapters. - code: increase atol on cross-entropy tests
- code: remove unnecessary warning in
test_get_lr_cosine_schedule
- handout: fix signature of
Tokenizer.__init__to includeself. - handout: mention that
Tokenizer.from_filesshould be a class method. - handout: clarify list of model hyperparameters listed in
adamwAccounting. - handout: clarify that
adamwAccounting(b) considers a GPT-2 XL-shaped model (with our architecture), not necessarily the literal GPT-2 XL model. - handout: moved softmax problem to where softmax is first mentioned (Scaled Dot-Product Attention, Section 3.4.3)
- handout: removed redundant initialization (t = 0) in AdamW pseudocode
- handout: added resources needed for BPE training
- handout: edit
adamWAccounting, part (d) to define MFU and mention that the backward pass is typically assumed to have twice the FLOPS of the forward pass. - handout: provide a hint about desired behavior when a user passes in input IDs
to
Tokenizer.decodethat correspond to invalid UTF-8 bytes.
- handout: added some more information about submitting to the leaderboard.
- code: add a note to README.md that pull requests and issues are welcome and encouraged.
- handout: edit motivation for pre-tokenization to include a note about desired behavior with tokens that differ only in punctuation.
- handout: remove total number of points after each section.
- handout: mention that large language models (e.g., LLaMA and GPT-3) often use AdamW betas of (0.9, 0.95) (in contrast to the PyTorch defaults of (0.9, 0.999)).
- handout: explicitly mention the deliverable in the
adamwproblem. - code: rename
test_serialization::test_checkpointtotest_serialization::test_checkpointingto match the handout. - code: slightly relax the time limit in
test_train_bpe_speed.
- code: fix an issue in the
train_bpetests where the expected merges and vocab did not properly reflect tiebreaking with the lexicographically greatest pair.- This occurred because our reference implementation (which checks against HF) follows the GPT-2 tokenizer in remapping bytes that aren't human-readable to printable unicode strings. To match the HF code, we were erroneously tiebreaking on this remapped unicode representation instead of the original bytes.
- handout: fix the expected number of non-embedding parameters for model with recommended TinyStories hyperparameters (section 7.2).
- handout: replace
<|endofsequence|>with<|endoftext|>in thedecodingproblem. - code: fix the setup command (
pip install -e .'[test]')to improve zsh compatibility. - handout: fix various trivial typos and formatting errors.
Initial release.