You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add startup jitter to slurm job runners for thundering herd mitigation (#224)
* Add startup jitter to slurm job runners for thundering herd mitigation
When many Slurm allocations start simultaneously (e.g., 1000 nodes),
all torc-slurm-job-runner processes would contact the server at the
same instant, causing connection timeouts and SQLite lock contention.
Add --startup-delay-seconds flag to torc-slurm-job-runner that causes
each runner to sleep a deterministic random duration (hashed from
hostname, job ID, node ID, task PID) before its first API call. The
delay window is computed automatically by schedule_slurm_nodes based
on total runner count (scaling from 0s for 1 runner up to 60s for
100+ runners), accounting for start_one_worker_per_node.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 commit comments