Skip to content

Commit d318b2e

Browse files
committed
docs(notes): add capacity saturation diagnosis
1 parent 1810363 commit d318b2e

File tree

1 file changed

+108
-0
lines changed

1 file changed

+108
-0
lines changed
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
---
2+
audio: false
3+
generated: true
4+
image: false
5+
lang: en
6+
layout: post
7+
model: google/gemini-3-pro-preview
8+
title: Model Capacity Saturation Diagnosis
9+
translated: false
10+
type: note
11+
---
12+
13+
This is a classic case of **model capacity saturation** disguised as a training issue.
14+
15+
I have analyzed your logs and configuration deeply. The short answer is: **You are not training the model you think you are training.**
16+
17+
Here is the "hard thinking" breakdown of why your loss has plateaued and how to fix it.
18+
19+
-----
20+
21+
### 1\. The Critical Configuration Error (Size Mismatch)
22+
23+
You commented in your config:
24+
25+
> `# Model (~125M parameters) – perfect for 12 GB VRAM`
26+
27+
But your logs explicitly state:
28+
29+
> `number of parameters: 40.56M`
30+
31+
**Why this happened:**
32+
You set `n_embd = 384`.
33+
34+
* **Standard GPT-2 (124M):** `n_layer=12`, `n_head=12`, **`n_embd=768`**.
35+
* **Your Config:** `n_layer=12`, `n_head=12`, **`n_embd=384`**.
36+
37+
By halving the embedding dimension, you cut the parameter count by roughly a factor of 3-4 (because the majority of parameters are in the linear layers which scale quadratically with `n_embd`).
38+
39+
**The Result:** You are training a **40M parameter model**. A loss of \~3.57 is actually quite decent for a model this tiny. It has plateaued because **it is full**. It physically does not have the neurons to memorize or generalize more complex patterns from the FineWeb dataset.
40+
41+
### 2\. The "Val \< Train" Anomaly (Over-Regularization)
42+
43+
Look at your Step 14500:
44+
45+
* `train loss`: 3.6415
46+
* `val loss`: 3.5696
47+
48+
**Diagnosis:** Your validation loss is *lower* than your training loss.
49+
This usually happens when `dropout` is active during training but turned off during validation. You have `dropout = 0.1`.
50+
51+
**The Fix:** For a model this small (40M) trained on a dataset this massive (FineWeb), **you do not need dropout**. The model is effectively underfitting the infinite stream of data. Dropout is essentially deleting 10% of your tiny model's brain power during every step, making it harder for it to learn.
52+
53+
* **Action:** Set `dropout = 0.0`.
54+
55+
### 3\. You have passed the "Chinchilla Optimal" Point
56+
57+
Let's do the math on your training volume:
58+
59+
* **Batch size:** \~524,000 tokens per step.
60+
* **Current Step:** 14,500.
61+
* **Total Tokens Seen:** $14,500 \times 524,000 \approx 7.6 \text{ Billion tokens}$.
62+
63+
**Chinchilla Scaling Laws state:**
64+
For a **40M** parameter model, the compute-optimal training volume is roughly $40M \times 20 \text{ tokens} = 800 \text{ Million tokens}$.
65+
66+
You have trained this model on **7.6 Billion tokens**. You have fed it nearly **10x** the optimal amount of data.
67+
68+
* The plateau is expected. You have squeezed almost every drop of performance possible out of 40M parameters. Continuing to train will yield diminishing returns (as you are seeing now).
69+
70+
### 4\. Performance Bottleneck (MFU is low)
71+
72+
Your Model Flops Utilization (MFU) is hovering around **11-12%**.
73+
74+
* On an RTX 4070, with `compile=True`, you should aim for 30%+.
75+
* The warning `Not enough SMs to use max_autotune_gemm` suggests your batch size or model architecture size doesn't align well with the GPU hardware causing padding/inefficiency.
76+
* However, fixing MFU will only make the training *faster*, it will not lower the loss if the model is already saturated.
77+
78+
-----
79+
80+
### Solution: How to get Unstuck
81+
82+
To get the loss down to \~3.0 or lower (standard for GPT-2 small), you need to increase the brain size.
83+
84+
#### Step 1: Correct the Model Config
85+
86+
To get the actual **124M** model you wanted:
87+
88+
```python
89+
n_layer = 12
90+
n_head = 12
91+
n_embd = 768 # CHANGED from 384 to 768
92+
dropout = 0.0 # CHANGED from 0.1 (Data is large enough, don't throttle the model)
93+
```
94+
95+
#### Step 2: Adjust Batch Size for VRAM
96+
97+
Since you are increasing the model width (`n_embd`), your VRAM usage will go up. You are currently using `batch_size=16`.
98+
99+
* If you run out of memory (OOM) with `n_embd=768` on your 12GB card, reduce `batch_size` to `8` or `4` and increase `gradient_accumulation_steps` to keep the total tokens per step around 500k.
100+
101+
#### Step 3: Reset and Retrain
102+
103+
You cannot resume from the current checkpoint because the model architecture (embedding size) will change. You must start from `iter 0`.
104+
105+
**What to expect:**
106+
With `n_embd=768`, you should see the loss drop significantly faster and break through the 3.5 barrier, likely settling around 3.2 or 3.0 by step 10,000.
107+
108+
**Would you like me to calculate the exact `gradient_accumulation_steps` needed to maintain your training stability with the larger model size?**

0 commit comments

Comments
 (0)