Skip to content

Fix model pull timeouts with retry, resume, and stall detection#1

Open
darrylmorley wants to merge 3 commits intomainfrom
fix/model-pull-retry-and-resume
Open

Fix model pull timeouts with retry, resume, and stall detection#1
darrylmorley wants to merge 3 commits intomainfrom
fix/model-pull-retry-and-resume

Conversation

@darrylmorley
Copy link
Copy Markdown
Collaborator

Summary

  • Add --resume-download flag to huggingface-cli download so interrupted downloads resume from where they left off
  • Add retry logic (up to 3 attempts with 5s delay) when the download process exits non-zero
  • Add a 60-second stall watchdog that kills the process if no stderr output is received, then retries
  • Surface retry status to the UI via PullProgress events (e.g. "Retrying... (attempt 2/3)")
  • Clean up partial downloads only after all retries are exhausted

Test plan

  • All 60 existing tests pass
  • Manual test: ollmlx pull a large model on a slow/throttled connection
  • Manual test: interrupt network mid-download, verify it resumes on retry
  • Manual test: verify UI shows retry status messages

🤖 Generated with Claude Code

darrylmorley and others added 3 commits March 30, 2026 15:30
…rashes

Concurrent API requests could each independently trigger stop/start cycles,
killing processes mid-startup. Added ModelSwitchCoordinator actor to serialize
switches: same-model requests coalesce, different-model requests use last-writer-wins
with cancellation, and a reentrant-safe loop prevents duplicate switch tasks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…on macOS 26

huggingface-hub 1.8.0 installs the CLI as `hf` instead of `huggingface-cli`,
breaking install_mlx_lm.sh on clean machines. The script now checks for both
binaries and creates a symlink when only `hf` is present.

Also fixes an NSHostingView constraint crash on macOS 26 Tahoe by setting
explicit contentSize on the NSPopover, adding bounded frames to all views,
and moving the Pull Model sheet to a standalone NSWindow (sheets on popovers
deadlock and crash on Tahoe).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Downloads on slow connections would stall indefinitely with no retry.
Now uses --resume-download to continue interrupted downloads, retries
up to 3 times with 5s delay, and kills stalled processes after 60s
of no progress output. Retry status is surfaced via PullProgress events.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant