arpg · Jdvakil · Feb 4, 2026 · Feb 4, 2026 · Feb 17, 2026 · Feb 17, 2026
diff --git a/content/course/submissions/scratch-1/jdvakil.mdx b/content/course/submissions/scratch-1/jdvakil.mdx
@@ -0,0 +1,79 @@
+---
+title: "Scratch-1: jdvakil"
+student: "jdvakil"
+date: "2026-02-03"
+---
+
+# Scratch-1: The Transformer Backbone
+
+## Training Loss
+
+Trained the transformer backbone for 10 epochs on 10k discretized trajectories. Loss dropped fast in epoch 1 then slowly improved from there.
+![Backbone Training Loss](../../../../src/assignments/scratch-1/training_loss.png)
+Final loss was **~1.41** with perplexity **~4.09**. The curve looks stable with no weird spikes or divergence.
+
+## Ablation Studies: RoPE vs Sinusoidal
+
+Ran a comparison between RoPE and standard sinusoidal positional embeddings.
+![Ablation Comparison](../../../../src/assignments/scratch-1/rope_vs_sinusoidal_ablation.png)
+
+- **RoPE**: Hit loss of **1.98** (perplexity 7.25) in just 3 epochs
+- **Sinusoidal**: Stuck around **~4.40** and basically didn't learn
+  RoPE works better here because it encodes relative positions directly in the attention computation rather than adding absolute position info to the embeddings. The sinusoidal results surprised me, expected it to at least converge somewhat, but it just sat there.
+
+## Inference Benchmark
+
+Tested generation speed with and without KV caching.
+![Benchmark Speed](../../../../src/assignments/scratch-1/kv_cache_vs_native_benchmark.png)
+
+- **With Cache**: 229.5 tokens/sec
+- **No Cache**: 209.3 tokens/sec
+- **Speedup**: ~1.10x
+  Not a huge difference for these short sequences, but would matter more for longer generation.
+
+## Attention Visualization
+
+Plotted attention maps from Layer 0 to see what the model learned.
+![Attention Maps](../../../../src/assignments/scratch-1/attention_maps.png)
+You can see the lower-triangular pattern from the causal mask. The heads are clearly attending to previous tokens, which is what we want for next-token prediction.
+
+## The Audit: Removing the Causal Mask
+
+Removed `torch.tril` to see what happens when the model can peek at future tokens.
+Training loss dropped to ~3.1 (vs starting at 3.7) way faster than normal. But this is fake progress, at inference time there are no future tokens to look at, so the model is useless. It learned to copy instead of predict.
+
+### Why the Model "Cheats"
+
+Without the causal mask, token at position $t$ can see position $t+1$. The target for position $t$ is literally the value at $t+1$, so the model just copies it. No actual learning of dynamics happening.
+
+## Code Highlights
+
+Implemented RoPE with KV-caching support in `backbone.py`. Also wrote ablations for RoPE vs sinusoidal and KV cache benchmarking.
+Collect data:
+
+```
+python generate_data.py --num_trajectories 10000 --seq_length 50 --output data/trajectories.pkl
+```
+
+Train:
+
+```
+python backbone.py
+```
+
+Ablations:
+
+```
+python ablations.py
+```
+
+Extra packages:
+
+```
+pip install pillow six seaborn
+```
+
+## Challenges and Solutions
+
+**Problem**: Loss was flat at ~4.6 even though the code was correct.
+Spent way too long debugging the model before I thought to check the data. Looked at `generate_data.py` and found the issue: signal was 0.01 and noise was 0.05, so SNR was terrible. Bumped signal to 0.1 and dropped noise to 0.001. Loss immediately started decreasing and converged to ~1.4.
diff --git a/grading_reports/GRADING_REPORT.md b/grading_reports/GRADING_REPORT.md
@@ -0,0 +1,65 @@
+![Chris-Bot](~/chris_robot.png)
+### 🤖 Chris's Grading Assistant - Feedback Report
+
+**Student:** @Jdvakil
+**PR:** #49
+**Branch:** `scratch-1-jdvakil`
+
+Hi! I've reviewed your submission. Here's what I found:
+
+---
+
+## 📊 Component Feedback
+
+### ✅ Causal Self-Attention
+
+✅ Perfect! Your causal mask correctly prevents future token leakage.
+
+✅ Test passed.
+
+### ✅ RMSNorm
+
+✅ RMSNorm implemented correctly with proper normalization and learnable scale.
+
+✅ Test passed.
+
+### ✅ Training Loop
+
+✅ Excellent! Your model trains successfully and loss converges.
+
+### ✅ RoPE Embeddings
+
+✅ RoPE correctly applied to Q and K tensors.
+
+### ✅ Model Architecture
+
+✅ Model forward pass works end-to-end with correct output shapes.
+
+✅ Model has the expected number of trainable parameters.
+
+### ✅ Code Quality
+
+Your code imports and runs cleanly. Nice! ✨
+
+---
+
+## 📝 Documentation & Analysis
+
+✅ Report submitted! I found:
+- `content/course/submissions/scratch-1/jdvakil.mdx`
+- `README.md`
+
+Your instructor will review the quality of your analysis.
+
+---
+
+## 🎯 Mastery Features Detected
+
+I noticed you implemented:
+- RoPE vs Sinusoidal ablation study
+
+Great work going beyond the requirements! Your instructor will verify implementation quality.
+
+---
+
+> *Grading is automated but reviewed by an instructor. If you have questions, reach out on Slack!*
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,54 @@
+[project]
+name = "vla-foundations"
+version = "0.1.0"
+description = "VLA Foundations Course - Private Instructor Repository"
+readme = "README.md"
+requires-python = ">=3.10,<3.14"
+dependencies = [
+    "torch>=2.0.0",
+    "torchvision",
+    "numpy>=1.24.0",
+    "pytest>=7.0.0",
+    "pytest-html>=4.0.0",
+    "matplotlib>=3.5.0",
+]
+
+[[tool.uv.index]]
+name = "pytorch-cpu"
+url = "https://download.pytorch.org/whl/cpu"
+explicit = true
+
+[[tool.uv.index]]
+name = "pytorch-cu118"
+url = "https://download.pytorch.org/whl/cu118"
+explicit = true
+
+[tool.uv.sources]
+torch = [
+    { index = "pytorch-cpu", marker = "sys_platform == 'darwin'" },
+    { index = "pytorch-cu118", marker = "sys_platform == 'linux'" }
+]
+torchvision = [
+    { index = "pytorch-cpu", marker = "sys_platform == 'darwin'" },
+    { index = "pytorch-cu118", marker = "sys_platform == 'linux'" }
+]
+
+[tool.hatch.build.targets.wheel]
+packages = []
+
+[tool.pytest.ini_options]
+markers = [
+    "internal: internal grading tests (never public)",
+    "rigor: rigorous grading tests",
+    "gradient: gradient flow tests",
+    "fidelity: output comparison tests",
+    "training: training convergence tests",
+    "mastery: optional mastery-level features (DINOv2, KV-cache, etc.)",
+]
+testpaths = ["tests"]
+python_files = ["test_*.py"]
+python_classes = ["Test*"]
+python_functions = ["test_*"]
+
+[dependency-groups]
+dev = []