Skip to content

Jobs end up in inconsistent state both completed and with a failed execution when completed during a graceful worker shutdownΒ #568

@doctomarculescu

Description

@doctomarculescu

Each time a solid queue supervisor receives a TERM signal the supervisor initiates graceful termination. By doing so, it sends a TERM signal to every worker process it supervises. This triggers the shutdown of the worker process :

  • before_shutdown
  • shutdown
  • after_shutdown

Any Registrable process (workers are Registrable processes) execute stop_heartbeat in the shutdown hook before_shutdown
Only after that the worker shuts down the executor pool and waits for the shutdown_timeout for a graceful exits. This means workers stop heartbeating before shutting down. So during the graceful termination, they try to complete claimed jobs in flight, but other supervisors see them as dead and initiate pruning of a dead process, therefore failing jobs claimed by the worker.

The visible consequence of this bug is that we end up with completed jobs which also have an entry in failed executions. The occurrences can be easily detected by the following query in a rails console:

SolidQueue::Job.joins(:failed_execution)
               .where.not(finished_at: nil)
               .where.not(failed_execution: nil)

The fix seems very simple, move stop_heartbeat from the before_shutdown hook to after_shutdown. We validated the fix in our deployments by patching the module Registrable.

I can push the PR if this is acknowledged as the correct solution. Not entirely sure how to test it because there are no tests for the heartbeat functionality, I am thinking to add a test in the integration process_lifecycle_test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions