Makemore Part 2 - Multi-Layer Perceptron (MLP) Character-Level Language Model

A character-level language model implementing a multi-layer perceptron (MLP) neural network with embeddings for generating human-like names. This project advances beyond bigram models to demonstrate deep learning fundamentals including word embeddings, hidden layers, and mini-batch gradient descent optimization.

Overview

Makemore Part 2 builds upon the foundational bigram model by implementing a more sophisticated neural network architecture. Using a multi-layer perceptron with character embeddings, the model learns to predict the next character in a sequence by considering a context window of 3 preceding characters. This approach captures richer patterns and dependencies in the data compared to simple bigram statistics.

Features

Character Embeddings: 2D and 10D learned representations mapping characters to continuous vector spaces
Multi-Layer Architecture: Hidden layer with 100-200 neurons using tanh activation
Context Window: Block size of 3 characters for richer context modeling
Mini-Batch Training: Efficient gradient descent with batch size of 32
Train/Dev/Test Split: Proper 80/10/10 dataset partitioning for model validation
Learning Rate Scheduling: Experimentation with learning rate optimization
Embedding Visualization: 2D scatter plots showing learned character relationships
Name Generation: Sampling from the trained model to create new plausible names

Project Structure

makemore2/
├── makemore2.ipynb      # Main Jupyter notebook with MLP implementation
├── names.txt            # Dataset of 32,033 names
└── README.md           # This file

Dataset

The model is trained on names.txt, containing 32,033 names with varying lengths. The dataset is properly split into:

Training set: 80% (182,424 examples) - for parameter optimization
Development/Validation set: 10% (22,836 examples) - for hyperparameter tuning
Test set: 10% (22,886 examples) - for final model evaluation

Technical Implementation

1. Character Vocabulary & Encoding

# Build vocabulary of characters and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s: i+1 for i, s in enumerate(chars)}
stoi['.'] = 0  # Special start/end token
itos = {i: s for s, i in stoi.items()}

2. Context Window Construction

block_size = 3  # Context length: how many characters to predict next one
context = [0] * block_size  # Initialize with special tokens
# Build dataset of (context, target) pairs

3. Neural Network Architecture

Input Layer:

Character embeddings: 27 characters × 10-dimensional vectors
Context of 3 characters → 30-dimensional input

Hidden Layer:

200 neurons with tanh activation
Weight matrix W1: (30, 200)
Bias vector b1: (200,)

Output Layer:

27 output neurons (one per character)
Weight matrix W2: (200, 27)
Bias vector b2: (27,)
Softmax activation for probability distribution

Total Parameters: 11,897 learnable parameters

4. Forward Pass

emb = C[X]                                    # Lookup embeddings
h = torch.tanh(emb.view(-1, 30) @ W1 + b1)   # Hidden layer
logits = h @ W2 + b2                          # Output logits
loss = F.cross_entropy(logits, Y)            # Cross-entropy loss

5. Training Process

Optimizer: Mini-batch stochastic gradient descent
Batch Size: 32 examples per iteration
Learning Rate: 0.1 (after experimentation)
Iterations: 200,000 training steps
Loss Function: Cross-entropy (negative log-likelihood)

Model Performance

Loss Metrics

Training Loss: 2.13 (after 200k iterations)
Validation Loss: 2.22 (slight overfitting observed)

The model shows strong learning with the validation loss remaining close to training loss, indicating good generalization with minimal overfitting.

Sample Generated Names

carmahzati
hari
kemili
taty
halaysee
mahiel
amerahti
aqui
ner
keah
maiiv
kaleigph
bmania
kengan
shove
alian
qui
jero
dearisi
jaceelinsa

Key Concepts Demonstrated

Deep Learning Fundamentals

Word Embeddings: Learning continuous vector representations of discrete characters
Multi-Layer Perceptrons: Building deeper neural networks with hidden layers
Activation Functions: Understanding tanh non-linearity for hidden representations
Mini-Batch Training: Efficient gradient computation using batches of examples
Proper Data Splitting: Train/dev/test methodology for model validation

Neural Network Training

Learning Rate Tuning: Finding optimal learning rate through experimentation
Gradient Descent: Iterative parameter optimization
Backpropagation: Automatic differentiation with PyTorch
Loss Monitoring: Tracking training progress and detecting overfitting
Random Seed Control: Reproducibility in neural network initialization

PyTorch Advanced Techniques

Tensor Indexing: Efficient embedding lookups with C[X]
View Reshaping: Flattening context windows with .view(-1, 30)
Cross-Entropy Loss: Using F.cross_entropy() for classification
Softmax Sampling: Generating from probability distributions
Generator Objects: Controlled randomness for reproducibility

Embedding Visualization

The notebook includes 2D visualization of learned character embeddings, revealing interesting patterns:

Similar characters (vowels, consonants) cluster together in embedding space
Frequently co-occurring characters are positioned closer together
The special start/end token '.' occupies a distinct region

Learning Outcomes

This project teaches:

Implementing multi-layer neural networks with embedding layers
Understanding word/character embeddings and their learned representations
Proper dataset splitting for training, validation, and testing
Mini-batch training for efficient gradient descent
Learning rate experimentation and hyperparameter tuning
Visualizing learned representations to understand model behavior
Debugging overfitting through train/dev loss comparison

Requirements

Python 3.7+
PyTorch
NumPy
Matplotlib
Jupyter Notebook

Installation

# Clone the repository
git clone https://github.com/Jaloch-glitch/makemore2.git
cd makemore2

# Install dependencies
pip install torch numpy matplotlib jupyter

Usage

# Start Jupyter Notebook
jupyter notebook makemore2.ipynb

Run all cells to:

Load and split the dataset (80/10/10)
Build character vocabulary and context windows
Initialize neural network parameters
Train the MLP model with mini-batch gradient descent
Evaluate on validation set
Visualize learned character embeddings
Generate new names from the trained model

Code Highlights

Dataset Construction with Context Windows

def build_dataset(words):
    block_size = 3
    X, Y = [], []
    for w in words:
        context = [0] * block_size
        for ch in w + '.':
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
            context = context[1:] + [ix]  # Sliding window
    return torch.tensor(X), torch.tensor(Y)

# Split dataset: 80% train, 10% dev, 10% test
random.shuffle(words)
n1 = int(0.8 * len(words))
n2 = int(0.9 * len(words))

Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xte, Yte = build_dataset(words[n2:])

Mini-Batch Training Loop

for i in range(200000):
    # Mini-batch construction
    ix = torch.randint(0, Xtr.shape[0], (32,))

    # Forward pass
    emb = C[Xtr[ix]]
    h = torch.tanh(emb.view(-1, 30) @ W1 + b1)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Ytr[ix])

    # Backward pass
    for p in parameters:
        p.grad = None
    loss.backward()

    # Update parameters
    lr = 0.1
    for p in parameters:
        p.data += -lr * p.grad

Name Generation

g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):
    out = []
    context = [0] * block_size

    while True:
        emb = C[torch.tensor([context])]
        h = torch.tanh(emb.view(1, -1) @ W1 + b1)
        logits = h @ W2 + b2
        probs = F.softmax(logits, dim=1)
        ix = torch.multinomial(probs, num_samples=1, generator=g).item()
        context = context[1:] + [ix]
        out.append(ix)
        if ix == 0:
            break

    print(''.join(itos[i] for i in out))

Comparison with Part 1 (Bigram Model)

Aspect	Part 1 (Bigram)	Part 2 (MLP)
Context	1 character	3 characters
Architecture	Single linear layer	Multi-layer with embeddings
Parameters	729 (27×27)	11,897
Loss	~2.49	~2.13 (train) / 2.22 (dev)
Embeddings	None	10D learned embeddings
Training	Full batch	Mini-batch (32)
Validation	None	Proper train/dev/test split

Visualization

The notebook includes:

Training loss curves: Monitoring convergence over 200k iterations
Learning rate experiments: Testing different learning rates (0.001 to 1.0)
2D embedding scatter plots: Visualizing character relationships
Character distributions: Analyzing patterns in the dataset

Future Enhancements

Increase context window (block_size > 3) for longer-range dependencies
Experiment with deeper architectures (multiple hidden layers)
Implement dropout for better regularization
Add batch normalization for faster training
Try different embedding dimensions (50D, 100D)
Implement learning rate scheduling/decay
Add temperature sampling for controlled generation
Explore RNN/LSTM architectures for variable-length contexts
Implement beam search for better generation quality

Educational Value

This project is ideal for:

Understanding character embeddings and word2vec concepts
Learning multi-layer neural network architecture
Practicing proper machine learning methodology (train/dev/test)
Gaining intuition for hyperparameter tuning
Visualizing learned representations
Building deeper PyTorch models from scratch
Debugging and monitoring neural network training

References

This implementation is inspired by Andrej Karpathy's "makemore" tutorial series, which teaches neural networks through practical implementation of character-level language models. Andrej Karpathy is a renowned AI researcher, former Director of AI at Tesla, and founding member of OpenAI, known for his exceptional teaching of deep learning fundamentals.

Resources:

Author

Felix Onyango

GitHub: @Jaloch-glitch
Location: Kenya, East Africa

License

This project is open source and available for educational purposes.

Acknowledgments

Andrej Karpathy: This implementation follows Andrej Karpathy's excellent "makemore" tutorial series (Part 2), which teaches multi-layer perceptrons and character embeddings from first principles
Dataset of names from public sources
Built as part of deep learning fundamentals education
Thanks to the PyTorch team for an excellent deep learning framework

Note: This is an educational project focusing on understanding neural network fundamentals including embeddings, multi-layer architectures, and proper training methodology. For production-grade language models, consider transformer-based architectures like GPT, which use attention mechanisms and much larger scale.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.ipynb_checkpoints		.ipynb_checkpoints
README.md		README.md
makemore2.ipynb		makemore2.ipynb
names.txt		names.txt

Jaloch-glitch/makemore2

Folders and files

Latest commit

History

Repository files navigation