Accept this assignment: GitHub Classroom Link
Due: February 13, 2026 at 11:59 PM EST
Click the link above to create your private repository for this assignment. Complete your work in Google Colab, then push your notebook to the repository before the deadline.
Timeline: 1 Week
In this capstone-style assignment, you will build a GPT (Generative Pre-trained Transformer) model from the ground up and train it to generate text. This assignment goes beyond using pre-trained models—you will implement the core components of a transformer architecture, understand how autoregressive language modeling works, and gain deep insights into what makes modern large language models tick.
This is a 1-week assignment designed to be ambitious yet achievable with modern GenAI tools (ChatGPT, Claude, GitHub Copilot) to assist with implementation, debugging, and optimization.
GPT models are decoder-only transformers that have revolutionized natural language processing. By building one yourself, you'll understand the intricate details of self-attention mechanisms, positional encodings, layer normalization, and the training dynamics that enable these models to generate coherent text.
This is an ambitious assignment that will challenge you to think deeply about architecture design, optimization, and evaluation. However, with modern tools and GenAI assistance, it's entirely achievable—and incredibly rewarding.
By the end of this assignment, you will be able to:
- Implement transformer architecture components including multi-head self-attention, feed-forward networks, layer normalization, and residual connections
- Understand autoregressive language modeling and how GPT models generate text one token at a time
- Implement causal (masked) self-attention to ensure the model can only attend to previous tokens
- Design and implement positional encodings to give the model a sense of token position
- Build a complete training pipeline including data loading, batching, loss computation, and optimization
- Apply different text generation strategies including greedy decoding, temperature sampling, top-k, and nucleus (top-p) sampling
- Analyze what the model has learned by visualizing attention patterns and token embeddings
- Compare your model's performance with established models like GPT-2
- Understand the trade-offs between model size, training time, and generation quality
The transformer architecture, introduced in the seminal paper "Attention is All You Need" (Vaswani et al., 2017), revolutionized sequence modeling by replacing recurrent neural networks with self-attention mechanisms. GPT uses only the decoder portion of the transformer, removing the cross-attention layers used in encoder-decoder models.
Self-Attention: The core mechanism that allows each token to attend to all previous tokens (in the case of GPT, due to causal masking). The attention mechanism computes:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
Multi-Head Attention: Instead of a single attention mechanism, GPT uses multiple attention heads in parallel, each learning different types of relationships between tokens.
Causal Masking: Unlike BERT (which uses bidirectional attention), GPT applies a causal mask during attention computation to ensure that when predicting token i, the model can only attend to tokens at positions j < i. This makes the model autoregressive.
Position Encodings: Since the attention mechanism is position-agnostic, we must explicitly encode position information. This can be done with sinusoidal encodings (as in the original paper) or learned embeddings (as in GPT).
Feed-Forward Networks: After attention, each position is processed by a position-wise feed-forward network (two linear transformations with a non-linearity in between).
Layer Normalization and Residual Connections: These help with training stability and gradient flow through deep networks.
- GPT (Radford et al., 2018): Demonstrated that pre-training + fine-tuning works well for NLP
- GPT-2 (Radford et al., 2019): Showed that larger models with more data can perform well zero-shot
- GPT-3 (Brown et al., 2020): Scaled to 175B parameters and exhibited impressive few-shot learning
- GPT-4 and beyond: Continued scaling with improved architectures and training techniques
You will implement, train, and analyze a GPT-style model. The assignment is divided into several interconnected tasks:
Build the core components of the GPT architecture:
a) Multi-Head Self-Attention with Causal Masking
- Implement the scaled dot-product attention mechanism
- Add causal masking to prevent attention to future tokens
- Support multiple attention heads
- Include dropout for regularization
b) Position-wise Feed-Forward Network
- Implement a two-layer MLP with GELU or ReLU activation
- Typically expands to 4x the model dimension and back
c) Positional Encoding
- Implement either learned positional embeddings (like GPT) or sinusoidal encodings (like the original Transformer)
- Justify your choice
d) Transformer Block
- Combine attention and feed-forward layers with residual connections and layer normalization
- Consider whether to use pre-norm or post-norm architecture
e) GPT Model
- Stack multiple transformer blocks
- Add token embedding layer
- Add final linear layer to project to vocabulary size
- Implement the forward pass with causal language modeling objective
a) Data Preparation
- Choose a dataset (see suggestions below)
- Implement tokenization (you can use a pre-built tokenizer like GPT-2's BPE)
- Create data loaders with proper batching
- Handle variable-length sequences appropriately
b) Training Loop
- Implement the cross-entropy loss for language modeling
- Use AdamW optimizer with learning rate scheduling (consider cosine decay with warmup)
- Implement gradient clipping to prevent instability
- Track training and validation loss
- Save model checkpoints
c) Hyperparameter Selection
- Justify your choices for:
- Model size (number of layers, hidden dimension, number of heads)
- Batch size and sequence length
- Learning rate and schedule
- Dropout rates
- Training steps/epochs
Implement and compare different text generation methods:
a) Greedy Decoding: Always pick the most likely next token
b) Temperature Sampling: Sample from the probability distribution with adjustable temperature (higher = more random)
c) Top-k Sampling: Sample from only the k most likely tokens
d) Nucleus (Top-p) Sampling: Sample from the smallest set of tokens whose cumulative probability exceeds p
Generate samples with each method and analyze the quality, diversity, and coherence of the outputs. What works best for your model and dataset?
a) Attention Pattern Visualization
- Extract and visualize attention weights from different layers and heads
- Analyze what patterns different heads learn (e.g., do some attend to previous token, to same word type, to syntactic patterns?)
- Create visualizations showing attention patterns for interesting examples
b) Token Embedding Analysis
- Extract the learned token embeddings
- Visualize them in 2D/3D space using UMAP or t-SNE
- Identify clusters and analyze which tokens are embedded close together
- Does the model learn meaningful semantic relationships?
c) Probing Tasks (Optional)
- Design simple probing classifiers to test what linguistic information is encoded in the representations
- Examples: POS tagging, syntactic dependencies, semantic relationships
a) Load a Pre-trained Model
- Load a comparable pre-trained model (e.g., GPT-2 small, DistilGPT-2)
- Ensure fair comparison (similar size, similar data domain if possible)
b) Quantitative Comparison
- Compare perplexity on a held-out test set
- Compare generation quality using metrics like:
- Perplexity
- Self-BLEU (for diversity)
- Human evaluation (optional but recommended)
c) Qualitative Analysis
- Generate text from the same prompts with both models
- Analyze differences in coherence, creativity, factual accuracy, and fluency
- What are the strengths and weaknesses of your model?
Choose one of the following datasets or propose your own:
a) Code Generation
- Python code from GitHub (filtered for quality)
- Allows you to build a code-generation model
- Can evaluate on code completion tasks
b) Stories/Creative Writing
- WritingPrompts dataset from Reddit
- Children's stories
- Fan fiction or short stories
c) Domain-Specific Text
- Scientific papers (arXiv abstracts)
- News articles
- Wikipedia articles
- Poetry or song lyrics
d) Dialogue
- Movie scripts
- Reddit conversations
- Chat logs (appropriately anonymized)
e) Shakespeare or Classic Literature
- Smaller dataset but rich language
- Good for initial testing and rapid iteration
Choose a dataset that interests you and that's appropriate for the model size you can feasibly train. Remember: you don't need billions of tokens to build something impressive!
You have flexibility in how you approach this assignment:
Implement everything from first principles using PyTorch or JAX:
- Full control over every component
- Deepest learning experience
- More time-intensive but most rewarding
Andrej Karpathy's nanoGPT provides a clean, minimal implementation:
- Start with the codebase and understand every line
- Modify and extend it for your needs
- Add analysis and visualization components
- Good balance of learning and efficiency
Use the HuggingFace library but implement key components yourself:
- Use their data loading and training utilities
- Implement your own attention mechanism or transformer block
- Focus more on experiments and analysis
- Faster to get results but less deep implementation experience
Whichever option you choose, you must demonstrate deep understanding of how the model works. Simply calling library functions without explanation is insufficient.
Your model should have at minimum:
- 4-12 transformer layers (start small, scale up if resources allow)
- 4-8 attention heads
- Hidden dimension of 256-768 (depending on your computational resources)
- Vocabulary size appropriate for your tokenizer (typically 10K-50K)
- Context window of at least 128 tokens (256-512 preferred)
- Train for enough steps to see meaningful convergence (exact number depends on dataset and model size)
- Monitor both training and validation loss
- Implement early stopping or checkpointing based on validation performance
- Track key metrics: loss, perplexity, tokens/second
- Should be trainable on Google Colab with GPU (use gradient accumulation if needed for larger effective batch sizes)
- Well-organized, modular code with clear function/class names
- Docstrings for all major functions and classes
- Type hints where appropriate
- Efficient implementation (vectorized operations, no unnecessary loops)
- Memory-efficient (handle large datasets, consider gradient checkpointing if needed)
Submit a Google Colaboratory notebook that includes:
- Complete, working implementation of the GPT model
- All transformer components clearly implemented and explained
- Training pipeline with proper data loading, optimization, and checkpointing
- Text generation functions with multiple sampling strategies
- Code should run without errors on a fresh Colab instance
- Training curves showing loss over time
- Validation metrics demonstrating the model learns
- Hyperparameter choices with justifications
- Discussion of training dynamics (convergence, stability, any issues encountered)
- Final model checkpoint or clear instructions to reproduce training
- Multiple examples of generated text using different sampling strategies
- Analysis of generation quality, coherence, and diversity
- Comparison between different sampling methods
- Discussion of failure modes and interesting behaviors
- Attention pattern visualizations with interpretation
- Token embedding visualizations with analysis
- Evidence of what linguistic patterns the model learned
- Insights into what different layers and heads specialize in
- Quantitative comparison with a pre-trained model (perplexity, other metrics)
- Qualitative comparison of generated text
- Honest assessment of your model's strengths and limitations
- Discussion of what would improve performance
- Clear explanations of your approach and design decisions
- Discussion of challenges faced and how you overcame them
- Reflections on what you learned
- All visualizations should have captions and interpretations
Your assignment will be evaluated on:
-
Correctness of Implementation (30%)
- Model architecture correctly implements GPT design
- Attention mechanisms properly use causal masking
- Training loop correctly computes loss and updates parameters
- No critical bugs or errors
-
Quality of Training and Results (25%)
- Model successfully trains and converges
- Reasonable hyperparameter choices
- Generated text demonstrates learning
- Proper evaluation on validation set
-
Depth of Analysis (25%)
- Thoughtful examination of attention patterns
- Meaningful visualization and interpretation
- Insightful comparison with baseline model
- Understanding of model behavior and limitations
-
Code Quality and Documentation (10%)
- Clean, well-organized code
- Comprehensive markdown explanations
- Clear documentation of design choices
- Reproducible results
-
Creativity and Insight (10%)
- Interesting dataset choice or experiments
- Novel visualizations or analyses
- Thoughtful discussion of results
- Extensions beyond basic requirements
If you finish the core assignment and want to push further, consider these extensions:
- Different positional encodings: Try RoPE (Rotary Position Embeddings) or ALiBi
- Alternative architectures: Implement sparse attention, sliding window attention, or mixture of experts
- Model scaling: Try different model sizes and analyze scaling laws
- Architecture improvements: Add techniques like Flash Attention, grouped query attention, or RMSNorm
- Curriculum learning: Start with shorter sequences and gradually increase length
- Mixed precision training: Use fp16 or bfloat16 for faster training
- Distributed training: Train across multiple GPUs if available
- Data filtering: Implement quality filtering for your dataset
- Fine-tuning: Fine-tune your model on a downstream task
- RLHF (Reinforcement Learning from Human Feedback): Implement a simple version with a reward model
- Instruction tuning: Add instruction-following capabilities
- Benchmark evaluation: Test on standard NLP benchmarks
- Conditional generation: Add control codes for style, topic, or format
- Retrieval augmented generation: Combine with a retrieval system
- Multimodal: Add image inputs (very ambitious!)
- Attention intervention: Modify attention patterns and observe effects
- Causal tracing: Identify which components are responsible for specific predictions
- Feature visualization: What patterns activate specific neurons?
-
"Attention is All You Need" (Vaswani et al., 2017) - The transformer paper
-
"Improving Language Understanding by Generative Pre-Training" (Radford et al., 2018) - Original GPT
-
"Language Models are Unsupervised Multitask Learners" (Radford et al., 2019) - GPT-2
-
"Language Models are Few-Shot Learners" (Brown et al., 2020) - GPT-3
-
Andrej Karpathy's "Let's build GPT: from scratch, in code, spelled out"
- YouTube Video
- Excellent step-by-step implementation guide
-
HuggingFace Course: Transformer Models
- Course Link
- Comprehensive coverage of transformers and applications
-
nanoGPT by Andrej Karpathy
- GitHub
- Clean, minimal GPT implementation in PyTorch
-
minGPT by Andrej Karpathy
- GitHub
- Another minimal GPT implementation with good documentation
-
HuggingFace Transformers
-
The Illustrated Transformer by Jay Alammar
- Blog Post
- Excellent visual explanations
-
The Illustrated GPT-2 by Jay Alammar
- Blog Post
- Visual guide to GPT-2 architecture
-
"BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2018)
- Good contrast with GPT's unidirectional approach
- Paper
-
"The Pile: An 800GB Dataset of Diverse Text" (Gao et al., 2020)
- Understanding pre-training datasets
- Paper
-
"Scaling Laws for Neural Language Models" (Kaplan et al., 2020)
- Understanding how model size affects performance
- Paper
-
"Training Compute-Optimal Large Language Models" (Hoffmann et al., 2022) - Chinchilla
- Optimal model size vs. training tokens trade-offs
- Paper
- PyTorch: pytorch.org
- HuggingFace Transformers: huggingface.co/transformers
- Tokenizers: huggingface.co/docs/tokenizers
- Weights & Biases: For experiment tracking wandb.ai
- BERTViz: For visualizing attention github.com/jessevig/bertviz
- Use GenAI aggressively: Claude, ChatGPT, and GitHub Copilot are your friends. Use them to:
- Implement transformer components you haven't built before
- Debug errors and understand PyTorch behavior
- Write boilerplate code and data processing pipelines
- Explain unfamiliar concepts or papers
- Start small: Begin with a tiny model (2-4 layers, small hidden size) to debug your implementation quickly
- Validate components: Test each component individually before assembling the full model
- Monitor carefully: Watch for NaN losses, exploding gradients, or other training instabilities
- Use gradient clipping: This prevents exploding gradients in early training
- Overfit a small batch first: Ensure your model can memorize a tiny amount of data before scaling up
- Compare with references: If results seem off, compare your implementation with nanoGPT or other references
- Save often: Checkpointing is critical—you don't want to lose hours of training
- Document everything: Future you (and the grader) will thank you for clear explanations
- Have fun: This is a challenging but incredibly rewarding assignment!
This assignment is submitted via GitHub Classroom. Follow these steps:
-
Accept the assignment: Click the assignment link provided in Canvas or by your instructor
- Repository: github.com/ContextLab/gpt-llm-course
- This creates your own private repository for the assignment
-
Clone your repository:
git clone https://github.com/ContextLab/gpt-llm-course-YOUR_USERNAME.git
-
Complete your work:
- Work in Google Colab, Jupyter, or your preferred environment
- Save your notebook to the repository
-
Commit and push your changes:
git add . git commit -m "Complete GPT assignment" git push
-
Verify submission: Check that your latest commit appears in your GitHub repository before the deadline
Deadline: February 13, 2026 at 11:59 PM EST
Submit a single Google Colaboratory notebook that:
- Runs without errors on a clean Colab instance with GPU runtime
- Automatically downloads/installs any required dependencies
- Can load your trained model checkpoint (upload to Google Drive or HuggingFace Hub)
- Contains comprehensive markdown cells explaining every step
- Includes all code for implementation, training, generation, and analysis
- Shows all visualizations and results inline
- Demonstrates clear understanding of transformer architecture and training dynamics
Note on Training: If your model takes too long to train in the notebook, you can train it separately and load the checkpoint in your notebook. However, your notebook should include all training code and explain your training process thoroughly.
You may:
- Use GenAI tools (Claude, ChatGPT, Copilot) to help with implementation and understanding
- Reference implementations like nanoGPT as learning resources
- Collaborate with classmates on conceptual understanding
- Ask questions in office hours or on discussion forums
You must:
- Write and understand all code you submit
- Properly cite any code you adapt from other sources
- Do your own analysis and write your own explanations
- Ensure your trained model is your own work
Do not:
- Copy entire implementations without understanding them
- Submit someone else's analysis or visualizations as your own
- Plagiarize explanations from other sources
This assignment is about learning. Use all available tools to learn deeply, but ensure the final submission represents your own understanding and effort.
Good luck! Building a GPT model is a rite of passage for anyone serious about understanding modern AI. Enjoy the journey, and don't hesitate to reach out if you get stuck.