Skip to content

fix: correct CSV extraction in scaling_laws.sh#580

Open
geopti wants to merge 2 commits intokarpathy:masterfrom
geopti:fix/scaling-laws-csv-extraction
Open

fix: correct CSV extraction in scaling_laws.sh#580
geopti wants to merge 2 commits intokarpathy:masterfrom
geopti:fix/scaling-laws-csv-extraction

Conversation

@geopti
Copy link

@geopti geopti commented Feb 28, 2026

Bug fixes

Two bugs in runs/scaling_laws.sh cause the results CSV to be silently wrong.

Bug 1 — parameter columns always empty

base_train.py prints parameter counts with 24-character key padding:

wte                     : 33,554,432
value_embeds            : 150,994,944
transformer_matrices    : 167,772,160
...

The grep patterns used "wte:", "value_embeds:" etc., which never match because there are spaces between the key and the colon. Some parameter columns in the CSV are silently empty.

Fix: use grep -P "wte\s+:" to handle the padding.

Bug 2 — tokens_trained always wrong

tokens_trained was computed as NUM_ITERS * 524288, hardcoding the default batch size. But base_train.py auto-computes the optimal batch size based on the FLOPs budget and model size — it may differ from 524288, especially for small models at small FLOPs budgets.

Fix: extract tokens_trained directly from the log line "Total number of training tokens: X".

Two bugs caused all parameter columns and tokens_trained to be silently
empty/wrong in the results CSV:

1. Parameter grep patterns did not account for the padded key format.
   base_train.py prints parameters as `{key:24s}: {value:,}`, e.g.
   `wte                     : 33,554,432`, so patterns like `grep "wte:"`
   never matched. Fixed by using `grep -P "wte\s+:"` to handle the spaces.

2. tokens_trained was hardcoded as `NUM_ITERS * 524288`, but the batch
   size is auto-computed by base_train.py and may differ from 524288
   depending on the FLOPs budget and model size. Fixed by extracting the
   actual value from the log line "Total number of training tokens: X".
@svlandeg svlandeg added potential_bug Needs investigation/confirmation whether or not it's a bug scripts Edits in the bash scripts labels Feb 28, 2026
…g batch size

Same bug as scaling_laws.sh: TOKENS_TRAINED was computed as NUM_ITERS * 524288,
hardcoding the default total batch size. When base_train auto-computes a different
batch size, the value is wrong. Fix by reading "Total number of training tokens:"
directly from the training log.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

potential_bug Needs investigation/confirmation whether or not it's a bug scripts Edits in the bash scripts

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants