fix(daemon): abort shmem listener tasks on dataflow finish#1379
Open
suhr25 wants to merge 2 commits intodora-rs:mainfrom
Open
fix(daemon): abort shmem listener tasks on dataflow finish#1379suhr25 wants to merge 2 commits intodora-rs:mainfrom
suhr25 wants to merge 2 commits intodora-rs:mainfrom
Conversation
Each shmem node spawned 4 fire-and-forget tasks (one per shmem channel) with no abort handles, so they were never cleaned up when a dataflow finished. Worse, each task internally called spawn_blocking which blocked forever in pthread_cond_wait(INFINITE) if the node process crashed (SIGKILL/OOM) without signalling the disconnect event. With enough node crashes Tokio's blocking thread pool (default 512 threads) exhausted, deadlocking the daemon. Fix in three parts: - Wrap the 4 shmem listener_loop spawns in a single parent task via tokio::join! and return its abort_handle so RunningDataflow picks it up via the existing _listener_tasks mechanism. - Change ShmemServer::listen() to poll with a 50 ms timeout instead of blocking indefinitely, so each call returns Err on timeout. - In the spawn_blocking loop, retry listen() on Err and check rx.is_disconnected() between retries; when the parent task is aborted the flume Sender drops, is_disconnected() returns true, and the blocking thread exits cleanly within one poll interval. Signed-off-by: suhr25 <suhridmarwah07@gmail.com>
Contributor
Author
|
Hi @phil-opp, |
Contributor
Author
|
Hi @phil-opp, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a critical resource leak in the shared-memory (shmem) listener path where crashed nodes could permanently exhaust Tokio’s blocking thread pool.
Previously, the shmem branch in
spawn_listener_loop()spawned four fire-and-forget listener tasks and returnedNone, meaning they were never tracked or cancelled:Each of these tasks internally used
spawn_blocking()and called:If a node crashed (e.g.
SIGKILL, OOM), the disconnect signal was never sent, causing:Fix
This PR introduces three key changes to make the shmem listener robust and cancellable:
1. Bounded polling instead of infinite blocking
Replaced infinite wait with a short polling interval:
This ensures the listener periodically wakes up and can react to cancellation or disconnection.
2. Retry loop + cooperative shutdown in blocking thread
The
spawn_blockingloop now:3. Single task + AbortHandle for all shmem listeners
Instead of 4 untracked spawns, all listeners are now grouped under a single parent task, and its
AbortHandleis returned:Returned as:
This integrates with existing
_listener_taskscleanup logic and ensures proper shutdown.Verification
Crash scenario tested: simulating node crashes (
kill -9)Tokio behavior
spawn_blockingthreads exit once channel is droppedFunctional validation
Expected outcome