-
Notifications
You must be signed in to change notification settings - Fork 649
Open
Description
Opening High-Score (HS) Track
Rationale: The NanoGPT speedrun has been effective in optimizing training speed, but at the expense of code readability. For instance, without a comprehensive understanding of float precision, one might struggle to comprehend how the following code operates:
acc_m_u32 = (acc_bf16_view_u16.to(torch.uint32) << 16) | mantissa.to(torch.uint32)
acc_m_u32.view(torch.float32).mul_(1 - eff_weight_decay)
acc_m_u32.view(torch.float32).add_(other=v, alpha=-eff_lr)
acc_bf16_view_u16.copy_((acc_m_u32 >> 16).to(torch.uint16))
mantissa.copy_(acc_m_u32.to(torch.uint16))It is even more unclear why this implementation is faster than a direct approach.
I propose opening a High-Score (HS) track aimed at balancing legibility and efficiency. This is my draft:
- Models must be trained on a predefined number
xof tokens (e.g., 2 billion). These tokens must appear in the same sequence during training. Early exiting, skipping data, or using any piece of data more than once is prohibited. - The total number of active parameters must not exceed
ymillion (yM). - The total training time must not exceed
zminutes. The value ofzshould be slightly higher than the typical runtime of a standard training run (without NanoGPT-specific optimizations). Runs that exceed the time limit without processing all tokens will be disqualified. - Evaluate the model on
wpredefined downstream NLP benchmarks. The score will be calculated as the average accuracy across these benchmarks. - (Optional) Penalize 0.001% of the score for every valid line of code. This can serve as a Kolmogorov complexity penalty term, encouraging concise and efficient implementations.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels