Skip to content

Conversation

@adk-swisstopo
Copy link
Contributor

As per #3385 this allows to distinguish between graceful and abnormal termination and gives a chance to workers to do something about it (e.g. to log a stack trace).

For backward compatibility, we keep the old 2 signals model if quick_shutdown_timeout is set to 0.

As per benoitc#3385 this allows to
distinguish between graceful and abnormal termination and gives a chance to
workers to do something about it (e.g. to log a stack trace).

For backward compatibility, we keep the old 2 signals model if
quick_shutdown_timeout is set to 0.
@pajod
Copy link
Contributor

pajod commented Apr 16, 2025

I like it, completely alleviates the concern that people might have depended on the previous behavior.


Note to reviewers: This patch carefully retains two questionable behaviors, which should probably be changed separately:

  • the undocumented (and annoying, in CI-tests) minimum shutdown delay of 0.1 seconds
  • the use of time.time() where time.monotonic() would be appropriate

However:

  • sending SIGINT or SIGQUIT (though handled the same) might need to be retained as well

@zffocussss
Copy link
Contributor

interesting try!

@adk-swisstopo adk-swisstopo requested a review from pajod October 6, 2025 07:20
@pajod
Copy link
Contributor

pajod commented Oct 17, 2025

You requested a review, but I have nothing substantial to add. It appears to work as intended, and whatever could be improved can safely be done so in separate merges.

e.g. I would recommend eventually changing:

  • remove the 0.1 minimum interval in sleep_until
  • discard the 0.1 minimum shutdown interval from the worker, such that graceful shutdown does not typically trigger the entirely unnecessary Worker (pid:1337) was sent SIGKILL! Perhaps out of memory? hint when really nothing went wrong besides cfg.quick_shutdown_timeout being exceeded by 0.008 and thus SIGCHLD not having been processed on arbiter yet.

My naive approach at doing both introduces more ways to create unexpected behavior when actually using very small values for the 3 timeouts, so maybe that is something easier to get right after applying something like #3148.

In the previous commit, I changed that signal from SIGQUIT to SIGINT without
apparent reason. Either should work as they are meant to be used the same way
for the same purpose. Still, there was no reason to change so let's revert.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants