A character-level language model implementing a multi-layer perceptron (MLP) neural network with embeddings for generating human-like names. This project advances beyond bigram models to demonstrate deep learning fundamentals including word embeddings, hidden layers, and mini-batch gradient descent optimization.
Makemore Part 2 builds upon the foundational bigram model by implementing a more sophisticated neural network architecture. Using a multi-layer perceptron with character embeddings, the model learns to predict the next character in a sequence by considering a context window of 3 preceding characters. This approach captures richer patterns and dependencies in the data compared to simple bigram statistics.
- Character Embeddings: 2D and 10D learned representations mapping characters to continuous vector spaces
- Multi-Layer Architecture: Hidden layer with 100-200 neurons using tanh activation
- Context Window: Block size of 3 characters for richer context modeling
- Mini-Batch Training: Efficient gradient descent with batch size of 32
- Train/Dev/Test Split: Proper 80/10/10 dataset partitioning for model validation
- Learning Rate Scheduling: Experimentation with learning rate optimization
- Embedding Visualization: 2D scatter plots showing learned character relationships
- Name Generation: Sampling from the trained model to create new plausible names
makemore2/
├── makemore2.ipynb # Main Jupyter notebook with MLP implementation
├── names.txt # Dataset of 32,033 names
└── README.md # This file
The model is trained on names.txt, containing 32,033 names with varying lengths. The dataset is properly split into:
- Training set: 80% (182,424 examples) - for parameter optimization
- Development/Validation set: 10% (22,836 examples) - for hyperparameter tuning
- Test set: 10% (22,886 examples) - for final model evaluation
# Build vocabulary of characters and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s: i+1 for i, s in enumerate(chars)}
stoi['.'] = 0 # Special start/end token
itos = {i: s for s, i in stoi.items()}block_size = 3 # Context length: how many characters to predict next one
context = [0] * block_size # Initialize with special tokens
# Build dataset of (context, target) pairsInput Layer:
- Character embeddings: 27 characters × 10-dimensional vectors
- Context of 3 characters → 30-dimensional input
Hidden Layer:
- 200 neurons with tanh activation
- Weight matrix W1: (30, 200)
- Bias vector b1: (200,)
Output Layer:
- 27 output neurons (one per character)
- Weight matrix W2: (200, 27)
- Bias vector b2: (27,)
- Softmax activation for probability distribution
Total Parameters: 11,897 learnable parameters
emb = C[X] # Lookup embeddings
h = torch.tanh(emb.view(-1, 30) @ W1 + b1) # Hidden layer
logits = h @ W2 + b2 # Output logits
loss = F.cross_entropy(logits, Y) # Cross-entropy loss- Optimizer: Mini-batch stochastic gradient descent
- Batch Size: 32 examples per iteration
- Learning Rate: 0.1 (after experimentation)
- Iterations: 200,000 training steps
- Loss Function: Cross-entropy (negative log-likelihood)
- Training Loss: 2.13 (after 200k iterations)
- Validation Loss: 2.22 (slight overfitting observed)
The model shows strong learning with the validation loss remaining close to training loss, indicating good generalization with minimal overfitting.
carmahzati
hari
kemili
taty
halaysee
mahiel
amerahti
aqui
ner
keah
maiiv
kaleigph
bmania
kengan
shove
alian
qui
jero
dearisi
jaceelinsa
- Word Embeddings: Learning continuous vector representations of discrete characters
- Multi-Layer Perceptrons: Building deeper neural networks with hidden layers
- Activation Functions: Understanding tanh non-linearity for hidden representations
- Mini-Batch Training: Efficient gradient computation using batches of examples
- Proper Data Splitting: Train/dev/test methodology for model validation
- Learning Rate Tuning: Finding optimal learning rate through experimentation
- Gradient Descent: Iterative parameter optimization
- Backpropagation: Automatic differentiation with PyTorch
- Loss Monitoring: Tracking training progress and detecting overfitting
- Random Seed Control: Reproducibility in neural network initialization
- Tensor Indexing: Efficient embedding lookups with
C[X] - View Reshaping: Flattening context windows with
.view(-1, 30) - Cross-Entropy Loss: Using
F.cross_entropy()for classification - Softmax Sampling: Generating from probability distributions
- Generator Objects: Controlled randomness for reproducibility
The notebook includes 2D visualization of learned character embeddings, revealing interesting patterns:
- Similar characters (vowels, consonants) cluster together in embedding space
- Frequently co-occurring characters are positioned closer together
- The special start/end token '.' occupies a distinct region
This project teaches:
- Implementing multi-layer neural networks with embedding layers
- Understanding word/character embeddings and their learned representations
- Proper dataset splitting for training, validation, and testing
- Mini-batch training for efficient gradient descent
- Learning rate experimentation and hyperparameter tuning
- Visualizing learned representations to understand model behavior
- Debugging overfitting through train/dev loss comparison
- Python 3.7+
- PyTorch
- NumPy
- Matplotlib
- Jupyter Notebook
# Clone the repository
git clone https://github.com/Jaloch-glitch/makemore2.git
cd makemore2
# Install dependencies
pip install torch numpy matplotlib jupyter# Start Jupyter Notebook
jupyter notebook makemore2.ipynbRun all cells to:
- Load and split the dataset (80/10/10)
- Build character vocabulary and context windows
- Initialize neural network parameters
- Train the MLP model with mini-batch gradient descent
- Evaluate on validation set
- Visualize learned character embeddings
- Generate new names from the trained model
def build_dataset(words):
block_size = 3
X, Y = [], []
for w in words:
context = [0] * block_size
for ch in w + '.':
ix = stoi[ch]
X.append(context)
Y.append(ix)
context = context[1:] + [ix] # Sliding window
return torch.tensor(X), torch.tensor(Y)
# Split dataset: 80% train, 10% dev, 10% test
random.shuffle(words)
n1 = int(0.8 * len(words))
n2 = int(0.9 * len(words))
Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xte, Yte = build_dataset(words[n2:])for i in range(200000):
# Mini-batch construction
ix = torch.randint(0, Xtr.shape[0], (32,))
# Forward pass
emb = C[Xtr[ix]]
h = torch.tanh(emb.view(-1, 30) @ W1 + b1)
logits = h @ W2 + b2
loss = F.cross_entropy(logits, Ytr[ix])
# Backward pass
for p in parameters:
p.grad = None
loss.backward()
# Update parameters
lr = 0.1
for p in parameters:
p.data += -lr * p.gradg = torch.Generator().manual_seed(2147483647 + 10)
for _ in range(20):
out = []
context = [0] * block_size
while True:
emb = C[torch.tensor([context])]
h = torch.tanh(emb.view(1, -1) @ W1 + b1)
logits = h @ W2 + b2
probs = F.softmax(logits, dim=1)
ix = torch.multinomial(probs, num_samples=1, generator=g).item()
context = context[1:] + [ix]
out.append(ix)
if ix == 0:
break
print(''.join(itos[i] for i in out))| Aspect | Part 1 (Bigram) | Part 2 (MLP) |
|---|---|---|
| Context | 1 character | 3 characters |
| Architecture | Single linear layer | Multi-layer with embeddings |
| Parameters | 729 (27×27) | 11,897 |
| Loss | ~2.49 | ~2.13 (train) / 2.22 (dev) |
| Embeddings | None | 10D learned embeddings |
| Training | Full batch | Mini-batch (32) |
| Validation | None | Proper train/dev/test split |
The notebook includes:
- Training loss curves: Monitoring convergence over 200k iterations
- Learning rate experiments: Testing different learning rates (0.001 to 1.0)
- 2D embedding scatter plots: Visualizing character relationships
- Character distributions: Analyzing patterns in the dataset
- Increase context window (block_size > 3) for longer-range dependencies
- Experiment with deeper architectures (multiple hidden layers)
- Implement dropout for better regularization
- Add batch normalization for faster training
- Try different embedding dimensions (50D, 100D)
- Implement learning rate scheduling/decay
- Add temperature sampling for controlled generation
- Explore RNN/LSTM architectures for variable-length contexts
- Implement beam search for better generation quality
This project is ideal for:
- Understanding character embeddings and word2vec concepts
- Learning multi-layer neural network architecture
- Practicing proper machine learning methodology (train/dev/test)
- Gaining intuition for hyperparameter tuning
- Visualizing learned representations
- Building deeper PyTorch models from scratch
- Debugging and monitoring neural network training
This implementation is inspired by Andrej Karpathy's "makemore" tutorial series, which teaches neural networks through practical implementation of character-level language models. Andrej Karpathy is a renowned AI researcher, former Director of AI at Tesla, and founding member of OpenAI, known for his exceptional teaching of deep learning fundamentals.
Resources:
- Andrej Karpathy's YouTube: Neural Networks: Zero to Hero
- Original makemore repository
- Neural Network tutorials and educational content
Felix Onyango
- GitHub: @Jaloch-glitch
- Location: Kenya, East Africa
This project is open source and available for educational purposes.
- Andrej Karpathy: This implementation follows Andrej Karpathy's excellent "makemore" tutorial series (Part 2), which teaches multi-layer perceptrons and character embeddings from first principles
- Dataset of names from public sources
- Built as part of deep learning fundamentals education
- Thanks to the PyTorch team for an excellent deep learning framework
Note: This is an educational project focusing on understanding neural network fundamentals including embeddings, multi-layer architectures, and proper training methodology. For production-grade language models, consider transformer-based architectures like GPT, which use attention mechanisms and much larger scale.