Skip to content

Fix dispatcher crash on gevent LoopExit exceptions#2107

Merged
zmc merged 1 commit intoceph:mainfrom
deepssin:fix-gevent-loopexit-dispatcher-crash
Nov 18, 2025
Merged

Fix dispatcher crash on gevent LoopExit exceptions#2107
zmc merged 1 commit intoceph:mainfrom
deepssin:fix-gevent-loopexit-dispatcher-crash

Conversation

@deepssin
Copy link
Copy Markdown
Contributor

Handle gevent LoopExit exceptions gracefully to prevent dispatcher crashes. Add exception handling in main loop and lock_machines() call, with loop exit counter (max 10) to prevent infinite restarts. Isolate child processes using start_new_session=True so job supervisors continue running independently if dispatcher encounters exceptions.

Handle gevent LoopExit exceptions gracefully to prevent dispatcher
crashes. Add exception handling in main loop and lock_machines() call,
with loop exit counter (max 10) to prevent infinite restarts. Isolate
child processes using start_new_session=True so job supervisors
continue running independently if dispatcher encounters exceptions.

Signed-off-by: deepssin <deepssin@redhat.com>
@deepssin deepssin requested review from kshtsk and zmc November 11, 2025 13:44
@deepssin deepssin requested a review from a team as a code owner November 11, 2025 13:44
@deepssin deepssin requested review from VallariAg and removed request for a team November 11, 2025 13:44
@kshtsk
Copy link
Copy Markdown
Contributor

kshtsk commented Nov 12, 2025

Do we have any reference ticket in tracker for this issue?

Copy link
Copy Markdown
Member

@zmc zmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a really well thought out changeset! Considering the dispatcher is difficult to fully validate with unit tests, can you describe briefly how you've tested this so far? Thanks!

@deepssin
Copy link
Copy Markdown
Contributor Author

Do we have any reference ticket in tracker for this issue?
@kshtsk There isn’t an existing tracker ticket for this issue — I discovered it while debugging repeated dispatcher crashes in my local OpenStack-based Teuthology environment. The crashes were consistently triggered by gevent.LoopExit surfacing from the main loop when the dispatch cycle completed unexpectedly.

@deepssin
Copy link
Copy Markdown
Contributor Author

This looks like a really well thought out changeset! Considering the dispatcher is difficult to fully validate with unit tests, can you describe briefly how you've tested this so far? Thanks!

@zmc For testing, I validated the changes in my OpenStack setup by:

  1. Running multiple dispatcher cycles with varied queue states (empty queue, repeated job failures).
  2. Reproducing the earlier crash scenario and confirming that the dispatcher no longer exits on LoopExit.
  3. Verifying that child processes continue independently and the supervisor remains stable with the added loop-exit counter logic.
    Given that dispatcher behaviour is difficult to cover via unit tests, these manual integration tests helped ensure stability.

@zmc
Copy link
Copy Markdown
Member

zmc commented Nov 17, 2025

This looks like a really well thought out changeset! Considering the dispatcher is difficult to fully validate with unit tests, can you describe briefly how you've tested this so far? Thanks!

@zmc For testing, I validated the changes in my OpenStack setup by:

1. Running multiple dispatcher cycles with varied queue states (empty queue, repeated job failures).

2. Reproducing the earlier crash scenario and confirming that the dispatcher no longer exits on LoopExit.

3. Verifying that child processes continue independently and the supervisor remains stable with the added loop-exit counter logic.
   Given that dispatcher behaviour is difficult to cover via unit tests, these manual integration tests helped ensure stability.

Excellent - thanks for testing so thoroughly!

@zmc zmc merged commit 960b9b2 into ceph:main Nov 18, 2025
15 of 17 checks passed
@zmc
Copy link
Copy Markdown
Member

zmc commented Nov 18, 2025

Deployed just now on teuthology.front.sepia.ceph.com without issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants