Skip to content

Jaloch-glitch/makemore2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Makemore Part 2 - Multi-Layer Perceptron (MLP) Character-Level Language Model

A character-level language model implementing a multi-layer perceptron (MLP) neural network with embeddings for generating human-like names. This project advances beyond bigram models to demonstrate deep learning fundamentals including word embeddings, hidden layers, and mini-batch gradient descent optimization.

Overview

Makemore Part 2 builds upon the foundational bigram model by implementing a more sophisticated neural network architecture. Using a multi-layer perceptron with character embeddings, the model learns to predict the next character in a sequence by considering a context window of 3 preceding characters. This approach captures richer patterns and dependencies in the data compared to simple bigram statistics.

Features

  • Character Embeddings: 2D and 10D learned representations mapping characters to continuous vector spaces
  • Multi-Layer Architecture: Hidden layer with 100-200 neurons using tanh activation
  • Context Window: Block size of 3 characters for richer context modeling
  • Mini-Batch Training: Efficient gradient descent with batch size of 32
  • Train/Dev/Test Split: Proper 80/10/10 dataset partitioning for model validation
  • Learning Rate Scheduling: Experimentation with learning rate optimization
  • Embedding Visualization: 2D scatter plots showing learned character relationships
  • Name Generation: Sampling from the trained model to create new plausible names

Project Structure

makemore2/
├── makemore2.ipynb      # Main Jupyter notebook with MLP implementation
├── names.txt            # Dataset of 32,033 names
└── README.md           # This file

Dataset

The model is trained on names.txt, containing 32,033 names with varying lengths. The dataset is properly split into:

  • Training set: 80% (182,424 examples) - for parameter optimization
  • Development/Validation set: 10% (22,836 examples) - for hyperparameter tuning
  • Test set: 10% (22,886 examples) - for final model evaluation

Technical Implementation

1. Character Vocabulary & Encoding

# Build vocabulary of characters and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s: i+1 for i, s in enumerate(chars)}
stoi['.'] = 0  # Special start/end token
itos = {i: s for s, i in stoi.items()}

2. Context Window Construction

block_size = 3  # Context length: how many characters to predict next one
context = [0] * block_size  # Initialize with special tokens
# Build dataset of (context, target) pairs

3. Neural Network Architecture

Input Layer:

  • Character embeddings: 27 characters × 10-dimensional vectors
  • Context of 3 characters → 30-dimensional input

Hidden Layer:

  • 200 neurons with tanh activation
  • Weight matrix W1: (30, 200)
  • Bias vector b1: (200,)

Output Layer:

  • 27 output neurons (one per character)
  • Weight matrix W2: (200, 27)
  • Bias vector b2: (27,)
  • Softmax activation for probability distribution

Total Parameters: 11,897 learnable parameters

4. Forward Pass

emb = C[X]                                    # Lookup embeddings
h = torch.tanh(emb.view(-1, 30) @ W1 + b1)   # Hidden layer
logits = h @ W2 + b2                          # Output logits
loss = F.cross_entropy(logits, Y)            # Cross-entropy loss

5. Training Process

  • Optimizer: Mini-batch stochastic gradient descent
  • Batch Size: 32 examples per iteration
  • Learning Rate: 0.1 (after experimentation)
  • Iterations: 200,000 training steps
  • Loss Function: Cross-entropy (negative log-likelihood)

Model Performance

Loss Metrics

  • Training Loss: 2.13 (after 200k iterations)
  • Validation Loss: 2.22 (slight overfitting observed)

The model shows strong learning with the validation loss remaining close to training loss, indicating good generalization with minimal overfitting.

Sample Generated Names

carmahzati
hari
kemili
taty
halaysee
mahiel
amerahti
aqui
ner
keah
maiiv
kaleigph
bmania
kengan
shove
alian
qui
jero
dearisi
jaceelinsa

Key Concepts Demonstrated

Deep Learning Fundamentals

  • Word Embeddings: Learning continuous vector representations of discrete characters
  • Multi-Layer Perceptrons: Building deeper neural networks with hidden layers
  • Activation Functions: Understanding tanh non-linearity for hidden representations
  • Mini-Batch Training: Efficient gradient computation using batches of examples
  • Proper Data Splitting: Train/dev/test methodology for model validation

Neural Network Training

  • Learning Rate Tuning: Finding optimal learning rate through experimentation
  • Gradient Descent: Iterative parameter optimization
  • Backpropagation: Automatic differentiation with PyTorch
  • Loss Monitoring: Tracking training progress and detecting overfitting
  • Random Seed Control: Reproducibility in neural network initialization

PyTorch Advanced Techniques

  • Tensor Indexing: Efficient embedding lookups with C[X]
  • View Reshaping: Flattening context windows with .view(-1, 30)
  • Cross-Entropy Loss: Using F.cross_entropy() for classification
  • Softmax Sampling: Generating from probability distributions
  • Generator Objects: Controlled randomness for reproducibility

Embedding Visualization

The notebook includes 2D visualization of learned character embeddings, revealing interesting patterns:

  • Similar characters (vowels, consonants) cluster together in embedding space
  • Frequently co-occurring characters are positioned closer together
  • The special start/end token '.' occupies a distinct region

Learning Outcomes

This project teaches:

  1. Implementing multi-layer neural networks with embedding layers
  2. Understanding word/character embeddings and their learned representations
  3. Proper dataset splitting for training, validation, and testing
  4. Mini-batch training for efficient gradient descent
  5. Learning rate experimentation and hyperparameter tuning
  6. Visualizing learned representations to understand model behavior
  7. Debugging overfitting through train/dev loss comparison

Requirements

  • Python 3.7+
  • PyTorch
  • NumPy
  • Matplotlib
  • Jupyter Notebook

Installation

# Clone the repository
git clone https://github.com/Jaloch-glitch/makemore2.git
cd makemore2

# Install dependencies
pip install torch numpy matplotlib jupyter

Usage

# Start Jupyter Notebook
jupyter notebook makemore2.ipynb

Run all cells to:

  1. Load and split the dataset (80/10/10)
  2. Build character vocabulary and context windows
  3. Initialize neural network parameters
  4. Train the MLP model with mini-batch gradient descent
  5. Evaluate on validation set
  6. Visualize learned character embeddings
  7. Generate new names from the trained model

Code Highlights

Dataset Construction with Context Windows

def build_dataset(words):
    block_size = 3
    X, Y = [], []
    for w in words:
        context = [0] * block_size
        for ch in w + '.':
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
            context = context[1:] + [ix]  # Sliding window
    return torch.tensor(X), torch.tensor(Y)

# Split dataset: 80% train, 10% dev, 10% test
random.shuffle(words)
n1 = int(0.8 * len(words))
n2 = int(0.9 * len(words))

Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xte, Yte = build_dataset(words[n2:])

Mini-Batch Training Loop

for i in range(200000):
    # Mini-batch construction
    ix = torch.randint(0, Xtr.shape[0], (32,))

    # Forward pass
    emb = C[Xtr[ix]]
    h = torch.tanh(emb.view(-1, 30) @ W1 + b1)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Ytr[ix])

    # Backward pass
    for p in parameters:
        p.grad = None
    loss.backward()

    # Update parameters
    lr = 0.1
    for p in parameters:
        p.data += -lr * p.grad

Name Generation

g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):
    out = []
    context = [0] * block_size

    while True:
        emb = C[torch.tensor([context])]
        h = torch.tanh(emb.view(1, -1) @ W1 + b1)
        logits = h @ W2 + b2
        probs = F.softmax(logits, dim=1)
        ix = torch.multinomial(probs, num_samples=1, generator=g).item()
        context = context[1:] + [ix]
        out.append(ix)
        if ix == 0:
            break

    print(''.join(itos[i] for i in out))

Comparison with Part 1 (Bigram Model)

Aspect Part 1 (Bigram) Part 2 (MLP)
Context 1 character 3 characters
Architecture Single linear layer Multi-layer with embeddings
Parameters 729 (27×27) 11,897
Loss ~2.49 ~2.13 (train) / 2.22 (dev)
Embeddings None 10D learned embeddings
Training Full batch Mini-batch (32)
Validation None Proper train/dev/test split

Visualization

The notebook includes:

  • Training loss curves: Monitoring convergence over 200k iterations
  • Learning rate experiments: Testing different learning rates (0.001 to 1.0)
  • 2D embedding scatter plots: Visualizing character relationships
  • Character distributions: Analyzing patterns in the dataset

Future Enhancements

  • Increase context window (block_size > 3) for longer-range dependencies
  • Experiment with deeper architectures (multiple hidden layers)
  • Implement dropout for better regularization
  • Add batch normalization for faster training
  • Try different embedding dimensions (50D, 100D)
  • Implement learning rate scheduling/decay
  • Add temperature sampling for controlled generation
  • Explore RNN/LSTM architectures for variable-length contexts
  • Implement beam search for better generation quality

Educational Value

This project is ideal for:

  • Understanding character embeddings and word2vec concepts
  • Learning multi-layer neural network architecture
  • Practicing proper machine learning methodology (train/dev/test)
  • Gaining intuition for hyperparameter tuning
  • Visualizing learned representations
  • Building deeper PyTorch models from scratch
  • Debugging and monitoring neural network training

References

This implementation is inspired by Andrej Karpathy's "makemore" tutorial series, which teaches neural networks through practical implementation of character-level language models. Andrej Karpathy is a renowned AI researcher, former Director of AI at Tesla, and founding member of OpenAI, known for his exceptional teaching of deep learning fundamentals.

Resources:

Author

Felix Onyango

License

This project is open source and available for educational purposes.

Acknowledgments

  • Andrej Karpathy: This implementation follows Andrej Karpathy's excellent "makemore" tutorial series (Part 2), which teaches multi-layer perceptrons and character embeddings from first principles
  • Dataset of names from public sources
  • Built as part of deep learning fundamentals education
  • Thanks to the PyTorch team for an excellent deep learning framework

Note: This is an educational project focusing on understanding neural network fundamentals including embeddings, multi-layer architectures, and proper training methodology. For production-grade language models, consider transformer-based architectures like GPT, which use attention mechanisms and much larger scale.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •