GPT from Scratch — Decoder-Only Transformer (Character-Level)

A clean, from-scratch implementation of a decoder-only Transformer (GPT-style) language model in PyTorch, trained on Tiny Shakespeare at the character level.

This project focuses on understanding and implementing the core architecture of GPT, including masked self-attention, multi-head attention, residual connections, layer normalization, MLP blocks, weight tying, and autoregressive generation.

It is designed as a learning-oriented yet structurally organized implementation, inspired by:

Attention Is All You Need (Vaswani et al., 2017)
GPT / GPT-2 architecture (Radford et al.)
Andrej Karpathy’s nanoGPT and lecture series

1. Architecture Overview

This implementation builds a decoder-only Transformer with:

Token embeddings
Learned positional embeddings
Stacked Transformer blocks
Multi-head masked self-attention
Feedforward MLP with GELU
Residual connections
Pre-layer normalization
Dropout regularization
Weight tying between input embeddings and output projection
Autoregressive generation with temperature and top-k sampling

Model Configuration

@dataclass
class GPTConfig:
    vocab_size: int
    block_size: int 
    n_layer: int = 6
    n_head: int = 6
    n_embd: int = 384
    dropout: float = 0.2
    bias: bool = True

This corresponds to a ~10M parameter GPT model when trained on Tiny Shakespeare.

2. Dataset

Character-level modeling on Tiny Shakespeare.

Dataset stats:

Total characters (tokens): ~1.1M
Vocabulary size: 65 unique characters
Train split: 90%
Validation split: 10%

Since this is character-level:

1 character = 1 token No BPE or subword tokenization is used.

3. Transformer Block Structure

Each block contains:

Multi-Head Masked Self-Attention
Feedforward MLP (4× expansion)
Residual connections
Pre-LayerNorm

Attention Head

Linear projections for Q, K, V
Causal mask via lower-triangular matrix
Scaled dot-product attention
Softmax + dropout
Concatenation across heads
Output projection

MLP

Linear: n_embd → 4 * n_embd
GELU activation
Linear: 4 * n_embd → n_embd
Dropout

4. Implemented Features

Core Architecture

Training

Generation

Autoregressive decoding
Context cropping to block_size
Temperature scaling
Top-k sampling

5. Project Structure

.
├── data
│   └── shakespeare_char
│       └── input.txt
├── notebooks
│   ├── lm1.ipynb
│   └── lm2.ipynb
├── src
│   └── gpt_trainer
│       ├── data
│       │   └── prepare.py
│       ├── models
│       │   ├── bigram.py
│       │   ├── gpt.py
│       │   └── __init__.py
│       └── train.py
├── pyproject.toml
├── requirements.txt
└── README.md

Where to Look

models/gpt.py → Full Transformer architecture
train.py → Training loop and evaluation
data/prepare.py → Dataset processing
bigram.py → Minimal baseline language model

6. Parameter Estimation (~10M Model)

Given:

n_layer = 6
n_head = 6
n_embd = 384
vocab_size = 65
block_size = 256

Approximate parameter count:

Transformer blocks: ~10.6M
Token embeddings: ~25K
Positional embeddings: ~98K
LM head: tied with embedding

Total ≈ 10.7M parameters

Most parameters come from:

MLP layers
Attention projection matrices

7. Concepts Covered

This project reinforces:

Autoregressive language modeling
Causal masking
Multi-head attention mechanics
Parameter sharing (weight tying)
Transformer scaling intuition
Initialization strategy for deep networks
Optimizer behavior (AdamW)
Batch construction for language modeling
Loss estimation for train/val

8. Roadmap / TODO (Towards nanoGPT Level)

Architectural Upgrades

Replace per-head Linear projections with single QKV projection
Implement efficient CausalSelfAttention module
Add Flash Attention support (PyTorch ≥ 2.0)
Add optional bias-free LayerNorm

Tokenization

Implement BPE tokenizer
Integrate tiktoken
Compare char-level vs subword modeling

Training Improvements

Scaling

GPT-2 small configuration
Parameter count benchmarking
Throughput benchmarking

Engineering

Separate ModelConfig and TrainingConfig
CLI interface for training
Structured logging
Model export

9. Key Insight

This implementation builds the conceptual foundation of GPT:

Transformer decoder stack
Causal masked attention
Deep residual architecture
Autoregressive token prediction

It prioritizes architectural understanding over production optimization.

References

Vaswani et al., Attention Is All You Need (2017)
Karpathy, nanoGPT
Karpathy, Neural Networks: Zero to Hero
PyTorch Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data/shakespeare_char		data/shakespeare_char
notebooks		notebooks
src/gpt_trainer		src/gpt_trainer
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT from Scratch — Decoder-Only Transformer (Character-Level)

1. Architecture Overview

Model Configuration

2. Dataset

3. Transformer Block Structure

Attention Head

MLP

4. Implemented Features

Core Architecture

Training

Generation

5. Project Structure

Where to Look

6. Parameter Estimation (~10M Model)

7. Concepts Covered

8. Roadmap / TODO (Towards nanoGPT Level)

Architectural Upgrades

Tokenization

Training Improvements

Scaling

Engineering

9. Key Insight

References

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPT from Scratch — Decoder-Only Transformer (Character-Level)

1. Architecture Overview

Model Configuration

2. Dataset

3. Transformer Block Structure

Attention Head

MLP

4. Implemented Features

Core Architecture

Training

Generation

5. Project Structure

Where to Look

6. Parameter Estimation (~10M Model)

7. Concepts Covered

8. Roadmap / TODO (Towards nanoGPT Level)

Architectural Upgrades

Tokenization

Training Improvements

Scaling

Engineering

9. Key Insight

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages