fix(event-source): handle websocket stream termination and add watchdog#4428
Open
hitchhooker wants to merge 1 commit intoinformalsystems:masterfrom
Open
Conversation
the tokio::select! in websocket event source run_loop only matched on Some(_) for both branches, meaning if the subscription stream silently terminated (returned None without an error), neither branch would match and the loop could hang indefinitely. this caused packets to be missed when the websocket connection became stale without triggering an error. this patch: - explicitly handles None from batches.next() to detect stream termination - explicitly handles None from rx_err.recv() to detect driver channel closure - adds watchdog timeout using existing link::TIMEOUT (5 min) to detect stale connections that appear connected but are not receiving events - adds new error variants (StreamTerminated, WatchdogTimeout, DriverChannelClosed) - propagates errors before reconnection so supervisor clears pending packets - re-exports TIMEOUT from link module for use in websocket event source when any of these conditions are detected, an error is propagated to trigger packet clearing, then reconnection is initiated to restore event subscription.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
fixes silent websocket connection failures that could cause packets to be missed indefinitely.
the
tokio::select!in websocket event sourcerun_looponly matched onSome(_)for both branches. if the subscription stream silently terminated (returnedNonewithout an error), neither branch would match and the loop could hang indefinitely, causing packets to be missed when the websocket connection became stale.Changes
Nonefrombatches.next()to detect stream terminationNonefromrx_err.recv()to detect driver channel closurelink::TIMEOUT(5 min) to detect stale connectionsStreamTerminated,WatchdogTimeout,DriverChannelClosed)TIMEOUTfrom link module for reusewhen any of these conditions are detected, an error is propagated to trigger packet clearing, then reconnection is initiated to restore event subscription.
Related Issues
potentially related to stuck packet issues between chains (e.g. cosmos hub <-> penumbra) where websocket subscriptions silently fail without triggering reconnection or packet clearing.
Test Plan
cargo check -p ibc-relayerpassescargo test -p ibc-relayer --libpasses (72 tests)