Reproducing the GPT-2 (124M) Transformer model from scratch using PyTorch. This project is built for learning, experimentation, and extending transformer models. We are taking reference for learning from Andrej Karpathy's Zero-To-Hero Series.
- Full Transformer architecture (multi-head attention, layer norm, residuals, position embeddings).
- BPE tokenizer implementation.
- Training pipeline with configs, logging, and checkpointing.
- Experiments on small datasets (Shakespeare, WikiText) and scaling toward OpenWebText.