Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions content/course/submissions/scratch-1/heyang-huang.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: "Scratch-1 Submission: Heyang Huang"
student: "Heyang Huang"
date: "2026-02-01"
---

# Scratch-1: The Transformer Backbone

This submission implements a decoder-only Transformer from scratch for next-token prediction on synthetic robot trajectories. The model includes multi-head causal self-attention, Rotary Positional Embeddings (RoPE), RMSNorm, and an autoregressive training loop with gradient clipping.

---

## Loss Curve

![Training Loss](./images/loss_curve.png)

The training loss decreases rapidly from an initial value above 3.0 and converges smoothly to approximately **1.9–2.0** after several thousand optimization steps. This behavior matches the expected range for the provided synthetic trajectory dataset and indicates that the model successfully learns structured action patterns rather than memorizing noise.

## Attention Visualization

![Attention Maps](./images/attn_map_layer0_head0.png)

The attention map from the first Transformer layer exhibits a clear **lower-triangular structure**, confirming that the causal mask is correctly enforced. Attention mass is concentrated near the diagonal, indicating that early layers primarily attend to recent tokens and encode local temporal dependencies. No attention leakage to future positions is observed, validating the correctness of the causal self-attention implementation.

## The Audit: Removing the Causal Mask

When the causal mask is removed during training, the model’s training loss drops significantly faster and reaches an artificially low value. While this may appear to improve optimization, the resulting model fails to respect the autoregressive constraint required for next-token prediction.

Specifically, without the causal mask, each token is allowed to attend to future tokens in the sequence, including the ground-truth target token itself. This introduces information leakage during training and invalidates the intended learning objective.

### Why the Model "Cheats"

Without the causal mask, the attention mechanism can directly access future tokens, effectively collapsing the prediction task into a near-identity mapping. Instead of learning to model the conditional distribution
$P(s_t \mid s_{<t})$,
the model implicitly learns
$P(s_t \mid s_{\le T})$,
which includes the answer.

This results in deceptively low training loss but produces a model that performs poorly at inference time, where future tokens are not available. The causal mask is therefore essential to enforce the correct autoregressive structure and prevent this form of “cheating.”


## Code Highlights

- Implemented **Causal Self-Attention** with explicit lower-triangular masking applied before softmax.
- Used **RMSNorm** instead of LayerNorm for improved numerical stability and efficiency.
- Integrated **Rotary Positional Embeddings (RoPE)** to encode relative positional information without absolute embeddings.
- Applied **gradient clipping (max norm = 1.0)** to stabilize training.

## Challenges and Solutions

A primary challenge was aligning the action tokenization scheme with the assumptions of the learning objective. Early experiments with weakly structured action tokens resulted in poor convergence. Synchronizing the data generation process with the intended “direction + magnitude” structure produced a more learnable sequence distribution and led to stable convergence within the expected loss range.

Another challenge was debugging attention masking behavior. Visualizing attention maps proved essential for verifying that the causal constraint was correctly enforced.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
328 changes: 328 additions & 0 deletions content/textbook/audits/heyangmel.mdx

Large diffs are not rendered by default.

60 changes: 60 additions & 0 deletions grading_reports/GRADING_REPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
![Chris-Bot](~/chris_robot.png)
### 🤖 Chris's Grading Assistant - Feedback Report

**Student:** @Hhy903
**PR:** #41
**Branch:** `scratch-1-heyang`

Hi! I've reviewed your submission. Here's what I found:

---

## 📊 Component Feedback

### ✅ Causal Self-Attention

✅ Perfect! Your causal mask correctly prevents future token leakage.

✅ Test passed.

### ✅ RMSNorm

✅ RMSNorm implemented correctly with proper normalization and learnable scale.

✅ Test passed.

### ✅ Training Loop

✅ Excellent! Your model trains successfully and loss converges.

### ✅ RoPE Embeddings

✅ RoPE correctly applied to Q and K tensors.

### ✅ Model Architecture

✅ Model forward pass works end-to-end with correct output shapes.

✅ Model has the expected number of trainable parameters.

### ❌ Code Quality

✅ Code imports successfully.

✅ Test passed.

❌ Test failed.

---

## 📝 Documentation & Analysis

✅ Report submitted! I found:
- `content/course/submissions/scratch-1/heyang-huang.mdx`
- `README.md`

Your instructor will review the quality of your analysis.

---

> *Grading is automated but reviewed by an instructor. If you have questions, reach out on Slack!*
54 changes: 54 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
[project]
name = "vla-foundations"
version = "0.1.0"
description = "VLA Foundations Course - Private Instructor Repository"
readme = "README.md"
requires-python = ">=3.10,<3.14"
dependencies = [
"torch>=2.0.0",
"torchvision",
"numpy>=1.24.0",
"pytest>=7.0.0",
"pytest-html>=4.0.0",
"matplotlib>=3.5.0",
]

[[tool.uv.index]]
name = "pytorch-cpu"
url = "https://download.pytorch.org/whl/cpu"
explicit = true

[[tool.uv.index]]
name = "pytorch-cu118"
url = "https://download.pytorch.org/whl/cu118"
explicit = true

[tool.uv.sources]
torch = [
{ index = "pytorch-cpu", marker = "sys_platform == 'darwin'" },
{ index = "pytorch-cu118", marker = "sys_platform == 'linux'" }
]
torchvision = [
{ index = "pytorch-cpu", marker = "sys_platform == 'darwin'" },
{ index = "pytorch-cu118", marker = "sys_platform == 'linux'" }
]

[tool.hatch.build.targets.wheel]
packages = []

[tool.pytest.ini_options]
markers = [
"internal: internal grading tests (never public)",
"rigor: rigorous grading tests",
"gradient: gradient flow tests",
"fidelity: output comparison tests",
"training: training convergence tests",
"mastery: optional mastery-level features (DINOv2, KV-cache, etc.)",
]
testpaths = ["tests"]
python_files = ["test_*.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]

[dependency-groups]
dev = []
Loading
Loading