Skip to content

Conversation

@ArjunJagdale
Copy link
Contributor

Fixes #2532

Adds a retry loop in WorkerExecutor.heartbeat() to handle the case where the heartbeat is sent before the JobDocument has been persisted.

  • Retries up to 6 times
  • 0.5s delay between attempts
  • Only retries if the known error string "JobDocument matching query does not exist" is detected
  • Other exceptions still raise immediately

Fixes huggingface#2532

When a worker sends a heartbeat immediately after job assignment, it may race with job persistence, resulting in:
`JobDocument matching query does not exist`.

This patch adds retry logic to the `WorkerExecutor.heartbeat()` method in `executor.py`, retrying the heartbeat a few times (with delay) if this specific error occurs. This avoids false negatives and unnecessary worker shutdowns on transient states.

Retries: 6 attempts  
Delay: 0.5s between retries

Only the specific "JobDocument not found" error is retried — other exceptions still raise immediately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix worker crash when first heartbeat conflicts with job start

1 participant