Skip to content

fix: add retry with exponential backoff to download_file_with_lock#555

Open
icenfly wants to merge 1 commit intokarpathy:masterfrom
icenfly:master
Open

fix: add retry with exponential backoff to download_file_with_lock#555
icenfly wants to merge 1 commit intokarpathy:masterfrom
icenfly:master

Conversation

@icenfly
Copy link

@icenfly icenfly commented Feb 22, 2026

Fixes #554

Impact

Prevents transient network failures (DNS resolution, connection timeout, etc.) from crashing multi-GPU training runs. Without this fix, a single network hiccup during eval_bundle.zip download at step 2000 causes all 8 ranks to hang and abort via NCCL watchdog, losing hours of training progress.

Change

Added retry logic with exponential backoff and explicit timeout to download_file_with_lock in nanochat/common.py, matching the existing pattern used by download_single_file in nanochat/dataset.py:
• max_attempts = 5 with exponential backoff (2^attempt seconds)
• timeout=30 on urlopen (same as dataset.py's requests.get..., timeout=30))
• Partial file cleanup on failure
• Consistent error message format: Attempt {attempt}/{max_attempts} failed for {filename}: {e}

On a healthy network, the download succeeds on the first attempt and behavior is identical to the original. The existing FileLock mechanism for multi-rank co ordination is unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] download_file_with_lock crashes multi-GPU training on transient network failures (no retry/timeout)

2 participants