-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Open
Labels
Description
Setting --target-param-data-ratio 0 causes a crash during the auto-computation of the optimal batch size. Since this ratio is used as a denominator in the Chinchilla scaling math, a zero value results in a ZeroDivisionError instead of a graceful exit or a helpful error message.
Steps to reproduce:
python -m scripts.base_train --target-param-data-ratio 0
Logs
(nanochat) ss@ss-Predator-PH315-53:~/nanochat$ python -m scripts.base_train --depth 1 --aspect-ratio 1 --target-param-data-ratio 0
█████ █████
░░███ ░░███
████████ ██████ ████████ ██████ ██████ ░███████ ██████ ███████
░░███░░███ ░░░░░███ ░░███░░███ ███░░███ ███░░███ ░███░░███ ░░░░░███░░░███░
░███ ░███ ███████ ░███ ░███ ░███ ░███░███ ░░░ ░███ ░███ ███████ ░███
░███ ░███ ███░░███ ░███ ░███ ░███ ░███░███ ███ ░███ ░███ ███░░███ ░███ ███
████ █████░░████████ ████ █████░░██████ ░░██████ ████ █████░░███████ ░░█████
░░░░ ░░░░░ ░░░░░░░░ ░░░░ ░░░░░ ░░░░░░ ░░░░░░ ░░░░ ░░░░░ ░░░░░░░░ ░░░░░
Autodetected device type: cuda
2026-02-28 13:00:51,784 - nanochat.common - INFO - Distributed world size: 1
2026-02-28 13:00:51,785 - nanochat.common - WARNING - Peak flops undefined for: NVIDIA GeForce RTX 3060 Laptop GPU, MFU will show as 0%
GPU: NVIDIA GeForce RTX 3060 Laptop GPU | Peak FLOPS (BF16): inf
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Flash Attention 3 not available, using PyTorch SDPA fallback
WARNING: Training will be less efficient without FA3
WARNING: SDPA has no support for sliding window attention (window_pattern='SSSL'). Your GPU utilization will be terrible.
WARNING: Recommend using --window-pattern L for full context attention without alternating sliding window patterns.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/home/ss/nanochat/nanochat/tokenizer.py:405: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
token_bytes = torch.load(f, map_location=device)
Vocab size: 32,768
Model config:
{
"sequence_len": 2048,
"vocab_size": 32768,
"n_layer": 1,
"n_head": 1,
"n_kv_head": 1,
"n_embd": 128,
"window_pattern": "SSSL"
}
Parameter counts:
wte : 4,194,304
value_embeds : 4,194,304
lm_head : 4,194,304
transformer_matrices : 196,640
scalars : 2
total : 12,779,554
Estimated FLOPs per token: 2.949139e+07
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/ss/nanochat/scripts/base_train.py", line 277, in <module>
batch_size_ratio = target_tokens / D_REF
ZeroDivisionError: float division by zero
Expected behavior:
The script should either ignore the zero (using a default) or exit with an assertion error explaining that the ratio must be positive or -1.
Reactions are currently unavailable