Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions content/course/submissions/scratch-1/jdvakil.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
---
title: "Scratch-1: jdvakil"
student: "jdvakil"
date: "2026-02-03"
---

# Scratch-1: The Transformer Backbone

## Training Loss

Trained the transformer backbone for 10 epochs on 10k discretized trajectories. Loss dropped fast in epoch 1 then slowly improved from there.
![Backbone Training Loss](../../../../src/assignments/scratch-1/training_loss.png)
Final loss was **~1.41** with perplexity **~4.09**. The curve looks stable with no weird spikes or divergence.

## Ablation Studies: RoPE vs Sinusoidal

Ran a comparison between RoPE and standard sinusoidal positional embeddings.
![Ablation Comparison](../../../../src/assignments/scratch-1/rope_vs_sinusoidal_ablation.png)

- **RoPE**: Hit loss of **1.98** (perplexity 7.25) in just 3 epochs
- **Sinusoidal**: Stuck around **~4.40** and basically didn't learn
RoPE works better here because it encodes relative positions directly in the attention computation rather than adding absolute position info to the embeddings. The sinusoidal results surprised me, expected it to at least converge somewhat, but it just sat there.

## Inference Benchmark

Tested generation speed with and without KV caching.
![Benchmark Speed](../../../../src/assignments/scratch-1/kv_cache_vs_native_benchmark.png)

- **With Cache**: 229.5 tokens/sec
- **No Cache**: 209.3 tokens/sec
- **Speedup**: ~1.10x
Not a huge difference for these short sequences, but would matter more for longer generation.

## Attention Visualization

Plotted attention maps from Layer 0 to see what the model learned.
![Attention Maps](../../../../src/assignments/scratch-1/attention_maps.png)
You can see the lower-triangular pattern from the causal mask. The heads are clearly attending to previous tokens, which is what we want for next-token prediction.

## The Audit: Removing the Causal Mask

Removed `torch.tril` to see what happens when the model can peek at future tokens.
Training loss dropped to ~3.1 (vs starting at 3.7) way faster than normal. But this is fake progress, at inference time there are no future tokens to look at, so the model is useless. It learned to copy instead of predict.

### Why the Model "Cheats"

Without the causal mask, token at position $t$ can see position $t+1$. The target for position $t$ is literally the value at $t+1$, so the model just copies it. No actual learning of dynamics happening.

## Code Highlights

Implemented RoPE with KV-caching support in `backbone.py`. Also wrote ablations for RoPE vs sinusoidal and KV cache benchmarking.
Collect data:

```
python generate_data.py --num_trajectories 10000 --seq_length 50 --output data/trajectories.pkl
```

Train:

```
python backbone.py
```

Ablations:

```
python ablations.py
```

Extra packages:

```
pip install pillow six seaborn
```

## Challenges and Solutions

**Problem**: Loss was flat at ~4.6 even though the code was correct.
Spent way too long debugging the model before I thought to check the data. Looked at `generate_data.py` and found the issue: signal was 0.01 and noise was 0.05, so SNR was terrible. Bumped signal to 0.1 and dropped noise to 0.001. Loss immediately started decreasing and converged to ~1.4.
65 changes: 65 additions & 0 deletions grading_reports/GRADING_REPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
![Chris-Bot](~/chris_robot.png)
### 🤖 Chris's Grading Assistant - Feedback Report

**Student:** @Jdvakil
**PR:** #49
**Branch:** `scratch-1-jdvakil`

Hi! I've reviewed your submission. Here's what I found:

---

## 📊 Component Feedback

### ✅ Causal Self-Attention

✅ Perfect! Your causal mask correctly prevents future token leakage.

✅ Test passed.

### ✅ RMSNorm

✅ RMSNorm implemented correctly with proper normalization and learnable scale.

✅ Test passed.

### ✅ Training Loop

✅ Excellent! Your model trains successfully and loss converges.

### ✅ RoPE Embeddings

✅ RoPE correctly applied to Q and K tensors.

### ✅ Model Architecture

✅ Model forward pass works end-to-end with correct output shapes.

✅ Model has the expected number of trainable parameters.

### ✅ Code Quality

Your code imports and runs cleanly. Nice! ✨

---

## 📝 Documentation & Analysis

✅ Report submitted! I found:
- `content/course/submissions/scratch-1/jdvakil.mdx`
- `README.md`

Your instructor will review the quality of your analysis.

---

## 🎯 Mastery Features Detected

I noticed you implemented:
- RoPE vs Sinusoidal ablation study

Great work going beyond the requirements! Your instructor will verify implementation quality.

---

> *Grading is automated but reviewed by an instructor. If you have questions, reach out on Slack!*
54 changes: 54 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
[project]
name = "vla-foundations"
version = "0.1.0"
description = "VLA Foundations Course - Private Instructor Repository"
readme = "README.md"
requires-python = ">=3.10,<3.14"
dependencies = [
"torch>=2.0.0",
"torchvision",
"numpy>=1.24.0",
"pytest>=7.0.0",
"pytest-html>=4.0.0",
"matplotlib>=3.5.0",
]

[[tool.uv.index]]
name = "pytorch-cpu"
url = "https://download.pytorch.org/whl/cpu"
explicit = true

[[tool.uv.index]]
name = "pytorch-cu118"
url = "https://download.pytorch.org/whl/cu118"
explicit = true

[tool.uv.sources]
torch = [
{ index = "pytorch-cpu", marker = "sys_platform == 'darwin'" },
{ index = "pytorch-cu118", marker = "sys_platform == 'linux'" }
]
torchvision = [
{ index = "pytorch-cpu", marker = "sys_platform == 'darwin'" },
{ index = "pytorch-cu118", marker = "sys_platform == 'linux'" }
]

[tool.hatch.build.targets.wheel]
packages = []

[tool.pytest.ini_options]
markers = [
"internal: internal grading tests (never public)",
"rigor: rigorous grading tests",
"gradient: gradient flow tests",
"fidelity: output comparison tests",
"training: training convergence tests",
"mastery: optional mastery-level features (DINOv2, KV-cache, etc.)",
]
testpaths = ["tests"]
python_files = ["test_*.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]

[dependency-groups]
dev = []
Loading
Loading