docs(notes): add capacity saturation diagnosis

lzwjava · lzwjava · commit d318b2ed6a8f · 2025-11-19T17:21:54.000+08:00
diff --git a/notes/2025-11-19-capacity-saturation-diagnosis-en.md b/notes/2025-11-19-capacity-saturation-diagnosis-en.md
@@ -0,0 +1,108 @@
+---
+audio: false
+generated: true
+image: false
+lang: en
+layout: post
+model: google/gemini-3-pro-preview
+title: Model Capacity Saturation Diagnosis
+translated: false
+type: note
+---
+
+This is a classic case of **model capacity saturation** disguised as a training issue.
+
+I have analyzed your logs and configuration deeply. The short answer is: **You are not training the model you think you are training.**
+
+Here is the "hard thinking" breakdown of why your loss has plateaued and how to fix it.
+
+-----
+
+### 1\. The Critical Configuration Error (Size Mismatch)
+
+You commented in your config:
+
+> `# Model (~125M parameters) – perfect for 12 GB VRAM`
+
+But your logs explicitly state:
+
+> `number of parameters: 40.56M`
+
+**Why this happened:**
+You set `n_embd = 384`.
+
+  * **Standard GPT-2 (124M):** `n_layer=12`, `n_head=12`, **`n_embd=768`**.
+  * **Your Config:** `n_layer=12`, `n_head=12`, **`n_embd=384`**.
+
+By halving the embedding dimension, you cut the parameter count by roughly a factor of 3-4 (because the majority of parameters are in the linear layers which scale quadratically with `n_embd`).
+
+**The Result:** You are training a **40M parameter model**. A loss of \~3.57 is actually quite decent for a model this tiny. It has plateaued because **it is full**. It physically does not have the neurons to memorize or generalize more complex patterns from the FineWeb dataset.
+
+### 2\. The "Val \< Train" Anomaly (Over-Regularization)
+
+Look at your Step 14500:
+
+  * `train loss`: 3.6415
+  * `val loss`: 3.5696
+
+**Diagnosis:** Your validation loss is *lower* than your training loss.
+This usually happens when `dropout` is active during training but turned off during validation. You have `dropout = 0.1`.
+
+**The Fix:** For a model this small (40M) trained on a dataset this massive (FineWeb), **you do not need dropout**. The model is effectively underfitting the infinite stream of data. Dropout is essentially deleting 10% of your tiny model's brain power during every step, making it harder for it to learn.
+
+  * **Action:** Set `dropout = 0.0`.
+
+### 3\. You have passed the "Chinchilla Optimal" Point
+
+Let's do the math on your training volume:
+
+  * **Batch size:** \~524,000 tokens per step.
+  * **Current Step:** 14,500.
+  * **Total Tokens Seen:** $14,500 \times 524,000 \approx 7.6 \text{ Billion tokens}$.
+
+**Chinchilla Scaling Laws state:**
+For a **40M** parameter model, the compute-optimal training volume is roughly $40M \times 20 \text{ tokens} = 800 \text{ Million tokens}$.
+
+You have trained this model on **7.6 Billion tokens**. You have fed it nearly **10x** the optimal amount of data.
+
+  * The plateau is expected. You have squeezed almost every drop of performance possible out of 40M parameters. Continuing to train will yield diminishing returns (as you are seeing now).
+
+### 4\. Performance Bottleneck (MFU is low)
+
+Your Model Flops Utilization (MFU) is hovering around **11-12%**.
+
+  * On an RTX 4070, with `compile=True`, you should aim for 30%+.
+  * The warning `Not enough SMs to use max_autotune_gemm` suggests your batch size or model architecture size doesn't align well with the GPU hardware causing padding/inefficiency.
+  * However, fixing MFU will only make the training *faster*, it will not lower the loss if the model is already saturated.
+
+-----
+
+### Solution: How to get Unstuck
+
+To get the loss down to \~3.0 or lower (standard for GPT-2 small), you need to increase the brain size.
+
+#### Step 1: Correct the Model Config
+
+To get the actual **124M** model you wanted:
+
+```python
+n_layer = 12
+n_head = 12
+n_embd = 768  # CHANGED from 384 to 768
+dropout = 0.0 # CHANGED from 0.1 (Data is large enough, don't throttle the model)
+```
+
+#### Step 2: Adjust Batch Size for VRAM
+
+Since you are increasing the model width (`n_embd`), your VRAM usage will go up. You are currently using `batch_size=16`.
+
+  * If you run out of memory (OOM) with `n_embd=768` on your 12GB card, reduce `batch_size` to `8` or `4` and increase `gradient_accumulation_steps` to keep the total tokens per step around 500k.
+
+#### Step 3: Reset and Retrain
+
+You cannot resume from the current checkpoint because the model architecture (embedding size) will change. You must start from `iter 0`.
+
+**What to expect:**
+With `n_embd=768`, you should see the loss drop significantly faster and break through the 3.5 barrier, likely settling around 3.2 or 3.0 by step 10,000.
+
+**Would you like me to calculate the exact `gradient_accumulation_steps` needed to maintain your training stability with the larger model size?**