Please read SETUP_INTRUCTIONS for detailed guide.
Goal: Fastest Time to train 8M tokens
| # | Date | Train Loss | Val Loss | Time | Tokens Used | User | Notes |
|---|---|---|---|---|---|---|---|
| 1 | 2025-12-21 | 4.7487 | 4.8466 | 1m 44s 79ms | 8,011,776 | Vuk Rosić | Hyperparam search: batch size doubled 4 to 8, n_layers 32 to 22 to fit into memory, muon lr 0.015 to 0.024 and adamw_lr from 0.001 to 0.006 |
| 2 | 2025-12-22 | 4.7479 | 4.8467 | 1m 29s 209ms | 8,011,776 | Vuk Rosić | Squared ReLU instead of SwiGLU, one less linear layer in feedforward |
| 3 | 2025-12-22 | 4.7286 | 4.8363 | 1m 28s 664ms | 8,011,776 | GitHub ToheedAkhtar01 | Polar Muon - it replaces Muon’s Newton-Schulz iteration with a fixed-coefficient iterative scheme for faster, numerically stable orthogonalization. |
| 4 | 2025-12-23 | 4.7333 | 4.8366 | 1m 27s 856ms | 8,011,776 | GitHub | Fused AdamW |
| 5 | 2025-12-23 | 4.7409 | 4.8403 | 1m 26s 178ms | 8,011,776 | bigwolfeman | Cast model into bf16 - model = model.to(device, dtype=torch.bfloat16), Note: Optimizers might require higher precision for longer runs |
| 5 (new eval) | 2025-12-24 | 4.7408 | 4.8387 | 1m 59s 44ms | 8,011,776 | - | Included evaluations during training to plot loss curve. Training setup unchanged from #5. |
Record Repeatability / Noise:
- Run 1: 1m 27s 856ms, Train Loss: 4.7333, Val Loss: 4.8366
- Run 2: 1m 28s 275ms, Train Loss: 4.7397, Val Loss: 4.8373
Goal: Fastest Time to train 20M tokens
| # | Date | Train Loss | Val Loss | Time | Tokens Used | User | Notes |
|---|---|---|---|---|---|---|---|
| 1 | 2025-12-22 | 4.2004 | 4.2021 | 4m 8s 168ms | 20,004,864 | Vuk Rosić | Hyperparam search: batch size doubled 4 to 8, n_layers 32 to 22 to fit into memory, muon lr 0.015 to 0.024 and adamw_lr from 0.001 to 0.006 |
| 2 | 2025-12-22 | 4.2118 | 4.2087 | 3m 32s 156ms | 20,004,864 | Vuk Rosić | Squared ReLU instead of SwiGLU, one less linear layer in feedforward |
| 3 | 2025-12-22 | 4.1952 | 4.2056 | 3m 29s 308ms | 20,004,864 | ToheedAkhtar01 GitHub | Polar Muon - it replaces Muon’s Newton-Schulz iteration with a fixed-coefficient iterative scheme for faster, numerically stable orthogonalization. |
| 4 | 2025-12-23 | 4.2049 | 4.2075 | 3m 28s 591ms | 20,004,864 | GitHub | Fused AdamW |
| 5 | 2025-12-23 | 4.1701 | 4.1791 | 3m 19s 165ms | 20,004,864 | bigwolfeman | Cast model into bf16 - model = model.to(device, dtype=torch.bfloat16), Note: Optimizers might require higher precision for longer runs |
| 5 (new eval) | 2025-12-24 | 4.1631 | 4.1756 | 3m 50s 276ms | 20,004,864 | - | Included evaluations during training to plot loss curve. Training setup unchanged from #5. |
Record Repeatability / Noise:
- Run 1: 3m 28s 591ms, Train Loss: 4.2049, Val Loss: 4.2075
- Run 2: 3m 28s 871ms, Train Loss: 4.2049, Val Loss: 4.2075
Goal: Fastest Time to train 100M tokens
| # | Date | Train Loss | Val Loss | Time | Tokens Used | User | Notes |
|---|---|---|---|---|---|---|---|
| 1 | 2025-12-22 | 3.7212 | 3.7492 | 20m 27s 988ms | 100,007,936 | Vuk Rosić | Hyperparam search: batch size doubled 4 to 8, n_layers 32 to 22 to fit into memory, muon lr 0.015 to 0.024 and adamw_lr from 0.001 to 0.006 |
| 2 | 2025-12-22 | 3.7370 | 3.7526 | 17m 27s 59ms | 100,007,936 | Vuk Rosić | Squared ReLU instead of SwiGLU, one less linear layer in feedforward |
| 3 | 2025-12-22 | 3.7439 | 3.7609 | 17m 8s 637ms | 100,007,936 | ToheedAkhtar01 GitHub Polar; GitHub AdamW | Fused AdamW; Polar Muon - it replaces Muon’s Newton-Schulz iteration with a fixed-coefficient iterative scheme for faster, numerically stable orthogonalization. |
| 4 | 2025-12-23 | 3.6700 | 3.7094 | 16m 17s 221ms | 100,007,936 | bigwolfeman | Cast model into bf16 - model = model.to(device, dtype=torch.bfloat16), Note: Optimizers might require higher precision for longer runs |
| 4 (new eval) | 2025-12-24 | 3.6568 | 3.7108 | 16m 44s 139ms | 100,007,936 | - | Included evaluations during training to plot loss curve. Training setup unchanged from #4. |
Goal: Best Model @ 1B Tokens (GPT-1)
| # | Date | Train Loss | Val Loss | Time | Tokens Used | User | Notes |
|---|---|---|---|---|---|---|---|
| 1 | 2025-12-23 | 3.4747 | 3.3580 | 2h 51m 31s | 1,000,007,680 | Vuk Rosić, ToheedAkhtar01, GitHub #67, GitHub #56 | n_layers 32→22, optimized LRs (Muon 0.024, AdamW 0.006), Squared ReLU, Fused AdamW, Polar Muon |
| 2 | 2025-12-23 | 3.4946 | 3.3583 | 2h 42m 49s | 1,000,007,680 | bigwolfeman | Cast model into bf16 - model = model.to(device, dtype=torch.bfloat16), Note: For 1B tokens higher precision in optimizer might be better. |