✅ Status: Assignment completed with all implementations and experiments finished.
📊 Validation Loss: 1.33 on TinyStories | 3.33508 on OpenWebText (Leaderboard)
This repository contains my complete implementation for CS336 Assignment 1, including a transformer-based language model built from scratch, BPE tokenizer, training infrastructure, and extensive experiments on model training and optimization.
- Assignment PDF: cs336_spring2025_assignment1_basics.pdf
- My Detailed Writeup: writup.pdf (Try to answer all questions in the assignment)
- Experiment Changelog: docs/CHANGELOG.md
- Others: docs/ (Quick reference extracted from assignment PDF)
This assignment implements a complete language modeling pipeline with the following components:
-
BPE Tokenizer (
cs336_basics/bpe.py)- Byte-Pair Encoding training with configurable vocabulary size
- Efficient encoding/decoding with regex-based pre-tokenization
-
Transformer Architecture (
cs336_basics/model.py)- Custom implementations:
Linear,Embedding,RMSNorm - Rotary Position Embeddings (RoPE)
- SwiGLU activation function
- Multi-head causal self-attention with KV-cache support
- Pre-norm decoder-only transformer blocks
- Custom implementations:
-
Training Infrastructure
- Custom optimizers: SGD, AdamW (
cs336_basics/optimizer.py) - Muon optimizer integration (
train_muon.py) - Cosine decay learning rate scheduling with warmup
- Gradient clipping and cross-entropy loss (
cs336_basics/nn_utils.py) - Multi-backend logging: WandB, TensorBoard, SwanLab (
cs336_basics/logger.py)
- Custom optimizers: SGD, AdamW (
-
Text Generation (
cs336_basics/generate.py)- Autoregressive generation with top-p (nucleus) sampling
- KV-cache optimization for efficient inference
- Generation testing and benchmarking (
check_gen.py)
-
Comprehensive Experiments (
scripts/)- Learning rate tuning on TinyStories and OpenWebText
- Batch size optimization studies
- Ablation studies: RoPE, RMSNorm, SwiGLU, Pre-norm
- Cross-dataset training comparisons (main experiment)
- Leaderboard submission experiments
cs336_basics/
├── bpe.py # BPE tokenizer implementation
├── model.py # Transformer components (Linear, Embedding, RMSNorm, SwiGLU, RoPE, Attention, TransformerLM)
├── nn_utils.py # Loss functions and utilities (cross_entropy, gradient_clipping)
├── optimizer.py # SGD, AdamW, Muon optimizers
├── data.py # Data loading utilities
├── generate.py # Text generation with KV-cache
├── checkpoint.py # Model checkpointing
├── logger.py # Multi-backend logging (WandB, TensorBoard, SwanLab)
└── config.py # Configuration dataclasses
train.py # Main training script with Hydra config
train_muon.py # Training with Muon optimizer
check_gen.py # Generation testing and benchmarking
conf/ # Hydra configuration files
scripts/ # Experiment scripts
data_utils/ # Data downloading and tokenization scripts
docs/ # Implementation notes (extracted from assignment PDF)
We use uv for environment management. Install it via:
pip install uv
# or
brew install uvRun any Python file with automatic environment setup:
uv run <python_file_path>Run all unit tests:
uv run pytestRun specific test categories:
uv run pytest -k test_linear
uv run pytest -k test_bpe
uv run pytest -k test_transformerDownload TinyStories and OpenWebText datasets:
bash data_utils/download_dataset.shData structure:
data/
├── TinyStoriesV2-GPT4-train.txt (2.1G)
├── TinyStoriesV2-GPT4-valid.txt (21M)
├── owt_train.txt (11G)
└── owt_valid.txt (277M)Tokenize datasets (see data_utils/ for scripts):
data/
├── tinystories/ # Tokenized OpenWebText (vocab_size=10000)
├── openwebtext/ # Tokenized OpenWebText (vocab_size=1000)
├── openwebtext-32k/ # Tokenized OpenWebText (vocab_size=32000)
└── ...[!NOTE] Tokenizer Implementation
While this assignment implements a custom BPE tokenizer from scratch (
cs336_basics/bpe.py) that passes all unit tests (uv run pytest), the actual dataset tokenization for training experiments uses HuggingFace's tokenizer library for efficiency and reliability on large datasets (TinyStories and OpenWebText). Tokenizers are saved inhf_tokenizer/directory.
Train on TinyStories with default config:
uv run train.pyTrain on OpenWebText:
uv run train.py \
model.vocab_size=32000 \
data.path=data/openwebtext-32k \
data.tokenizer_path=hf_tokenizer/openwebtext-32k/tokenizer.json \
training.batch_size=128 \
optimizer.max_lr=1e-2uv run train_muon.py \
model.vocab_size=32000 \
data.path=data/openwebtext-32k \
data.tokenizer_path=hf_tokenizer/openwebtext-32k/tokenizer.jsonTraining is configured via Hydra configs in conf/:
train_config.yaml- Main training configurationmodel/- Model architecture configsdata/- Dataset configsoptimizer/- Optimizer configslogger/- Logging backend configs
Override any config via command line:
uv run train.py model.d_model=512 optimizer.max_lr=3e-4 training.batch_size=64All experiments are tracked using Weights & Biases with comprehensive logging of:
- Training/validation losses
- Learning rates and gradient norms
- Entropy and perplexity metrics
- Wallclock time/Relative time (Process)
| Experiment | W&B Report | Description |
|---|---|---|
| Learning Rate | Link | Tune learning rate on TinyStories and OpenWebText datasets |
| Batch Size | Link | Impact of batch size on training performance (TinyStories) |
| Ablation Studies | Link | Component analysis: SwiGLU, RoPE, RMSNorm, Pre-norm (TinyStories) |
| Main | Link | Loss comparison between TinyStories and OpenWebText training |
| Muon | Link | Using Muon for better training performance (OpenWebText) |
| Leaderboard | Link | Final model training and leaderboard submission |
- ✅ Validation Loss: Achieved 1.33 on TinyStories (<1.45, meets requirement)
- 🎯 Optimal Learning Rate: 0.01 through comprehensive hyperparameter search
- 🚀 Optimal Batch Size: 128 achieves best validation performance in 8.75 minutes (10,000 iterations)
- 📊 Component Impact: Ablation studies show importance of RoPE, RMSNorm, SwiGLU, and Pre-norm
- 🏆 Leaderboard: Validation loss of 3.33508 on OpenWebText
If you want to reproduce my results here, please ensure you have set up the data and environment as described above.
Experiment scripts are located in scripts/:
# TinyStories experiments
bash scripts/tinystories_learning_rate.sh # Learning rate tuning
bash scripts/tinystories_batch_size.sh # Batch size experiments
bash scripts/tinystories_ablation.sh # Ablation studies
# OpenWebText experiments
bash scripts/openwebtext_learning_rate.sh # Learning rate tuning
bash scripts/openwebtext_muon.sh # Muon optimizer training
# Sync logs to WandB
bash scripts/wandb_sync.shTest generation quality and performance:
uv run check_gen.pyThis script:
- Loads trained checkpoints
- Generates text samples with different prompts
- Measures generation speed (tokens/sec) and memory usage
- Compares performance with/without KV-cache optimization
For detailed experimental results, analysis, and comprehensive writeup, see:
- Main Writeup Repository: donglinkang2021/cs336-assignment1-writeup
- Complete LaTeX report with all experiments
- Plotting scripts for visualization (
code/plot_*.py) - Experiment results data (
exps/)
The writeup is organized as follows:
- Section 1: Overview
- Section 2: BPE Tokenizer Implementation
- Section 3: Transformer Architecture
- Section 4: Language Model Training Objectives
- Section 5: Training Loop Implementation
- Section 6: Text Generation Methods
- Section 7: Comprehensive Experimental Results
- Appendix: Additional implementation details and code snippets
Create submission package:
bash make_submission.shThis generates cs336-spring2025-assignment-1-submission.zip with all code and test results.
- Assignment Repository: stanford-cs336/assignment1-basics
- Assignment Handout: cs336_spring2025_assignment1_basics.pdf
- Writeup Repository: donglinkang2021/cs336-assignment1-writeup
- Experiment Changelog: CHANGELOG.md
Thanks to the CS336 teaching staff for this comprehensive assignment and leaderboard! Special thanks for providing the infrastructure and test suite that made this learning experience possible.
If you find this repository useful, please cite it as:
@misc{dong2025cs336a1,
author = {Dong, Linkang},
title = {CS336 Assignment 1 Basics},
year = {2025},
howpublished = {\url{https://github.com/donglinkang2021/cs336-assignment1-basics}},
note = {GitHub repository},
}