An implementation of automatic differentiation and neural network architectures from scratch, demonstrating deep understanding of the foundations of these fundamental deep learning concepts.
This project implements two core components of modern deep learning:
A complete reverse-mode autodiff system supporting:
- Computation graph construction with lazy evaluation
- Backpropagation through arbitrary graphs
- Matrix operations: matmul, solve, logdet
- Numerically stable functions: logsumexp (for softmax)
from autograd import Var, grad, matmul, inner, solve
# Define computation
def f(A, x, y):
return inner(solve(A, x), matmul(A, y))
# Get gradient function automatically
grad_f = grad(f)
grads = grad_f(A, x, y) # Returns [∂f/∂A, ∂f/∂x, ∂f/∂y]Eight architectures from simple to complex:
| Architecture | Type | Key Feature |
|---|---|---|
| Perceptron | Linear | Baseline classifier |
| Shallow MLP | MLP | Single hidden layer |
| Deep MLP | MLP | Vanishing gradient demo |
| Deep MLP + ReLU | MLP | ReLU vs tanh comparison |
| CNN | ConvNet | Basic convolutions |
| CNN + Dropout | ConvNet | Regularization |
| VGG-style | Deep CNN | Stacked 3x3 convs |
| ResNet | Residual | Skip connections |
Custom optimizer implementations:
- SGD: Vanilla stochastic gradient descent
- SGD + Momentum: Accelerated convergence
- Adam: Adaptive learning rates
Training 8 architectures × 3 optimizers on CIFAR-10 (grayscale):
- Depth matters, but activation matters more: Deep MLP with tanh suffers from vanishing gradients; ReLU enables deeper networks
- Residual connections help: ResNet trains more stably than VGG despite similar depth
- Adam is robust: Works well across architectures with minimal tuning
- Dropout effect varies: Helps most when model is prone to overfitting
git clone https://github.com/yourusername/deep-learning-from-scratch.git
cd deep-learning-from-scratch
pip install -r requirements.txtcd experiments
python autograd_demo.pycd experiments
python train_networks.pyThis will:
- Download CIFAR-10 automatically
- Train all architectures with all optimizers
- Generate comparison plots in
results/figures/
deep-learning-from-scratch/
├── autograd/
│ ├── __init__.py
│ └── engine.py # Autodiff implementation
├── neural_networks/
│ ├── __init__.py
│ ├── architectures.py # 8 network architectures
│ ├── optimizers.py # SGD, Momentum, Adam
│ ├── data.py # CIFAR-10 loading
│ └── training.py # Training utilities
├── experiments/
│ ├── autograd_demo.py # Autodiff demonstrations
│ └── train_networks.py # Full training experiments
├── results/
│ └── figures/ # Generated plots
├── requirements.txt
└── README.md
The autodiff engine implements reverse-mode automatic differentiation:
- Forward pass: Build computation graph, compute values
- Backward pass: Traverse in reverse topological order, accumulate gradients
Key operations and their gradients:
| Operation | Forward | Backward (VJP) |
|---|---|---|
add(x, y) |
x + y | (u, u) |
mul(x, y) |
x * y | (u·y, u·x) |
matmul(X, Y) |
XY | (u·Yᵀ, Xᵀ·u) |
solve(A, b) |
A⁻¹b | (-A⁻ᵀu·xᵀ, A⁻ᵀu) |
logdet(A) |
log|A| | u·A⁻ᵀ |
logsumexp(x) |
log Σeˣⁱ | u·softmax(x) |
All networks trained with:
- Loss: Cross-entropy
- Data: CIFAR-10 (grayscale, 32×32)
- Epochs: 20
- Batch size: 256
- Initialization: Xavier/Glorot for CNNs
Automatic Differentiation:
- Baydin et al., "Automatic Differentiation in Machine Learning: a Survey" (2018)
- Griewank & Walther, "Evaluating Derivatives" (2008)
Neural Network Architectures:
- LeCun et al., "Gradient-based learning applied to document recognition" (1998)
- Simonyan & Zisserman, "Very Deep Convolutional Networks" (VGG, 2014)
- He et al., "Deep Residual Learning for Image Recognition" (ResNet, 2015)
Optimization:
- Robbins & Monro, "A Stochastic Approximation Method" (1951) - SGD
- Polyak, "Some methods of speeding up convergence" (1964) - Momentum
- Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2014)
Weight Initialization:
- Glorot & Bengio, "Understanding the difficulty of training deep feedforward neural networks" (2010)
Building this project demonstrates understanding of:
- Calculus & Linear Algebra: Deriving gradients for matrix operations
- Graph Algorithms: Topological sorting for backpropagation
- Numerical Computing: Stable implementations (logsumexp)
- Deep Learning Foundations: How frameworks like PyTorch work internally
- Optimization Theory: Momentum, adaptive learning rates
- CNN Architectures: Convolutions, pooling, residual connections
MIT License - feel free to use for learning and projects!
