Skip to content

fix(event-source): handle websocket stream termination and add watchdog#4428

Open
hitchhooker wants to merge 1 commit intoinformalsystems:masterfrom
hitchhooker:fix/websocket-stream-termination-handling
Open

fix(event-source): handle websocket stream termination and add watchdog#4428
hitchhooker wants to merge 1 commit intoinformalsystems:masterfrom
hitchhooker:fix/websocket-stream-termination-handling

Conversation

@hitchhooker
Copy link

Summary

fixes silent websocket connection failures that could cause packets to be missed indefinitely.

the tokio::select! in websocket event source run_loop only matched on Some(_) for both branches. if the subscription stream silently terminated (returned None without an error), neither branch would match and the loop could hang indefinitely, causing packets to be missed when the websocket connection became stale.

Changes

  • explicitly handle None from batches.next() to detect stream termination
  • explicitly handle None from rx_err.recv() to detect driver channel closure
  • add watchdog timeout using existing link::TIMEOUT (5 min) to detect stale connections
  • add new error variants (StreamTerminated, WatchdogTimeout, DriverChannelClosed)
  • propagate errors before reconnection so supervisor clears pending packets
  • re-export TIMEOUT from link module for reuse

when any of these conditions are detected, an error is propagated to trigger packet clearing, then reconnection is initiated to restore event subscription.

Related Issues

potentially related to stuck packet issues between chains (e.g. cosmos hub <-> penumbra) where websocket subscriptions silently fail without triggering reconnection or packet clearing.

Test Plan

  • cargo check -p ibc-relayer passes
  • cargo test -p ibc-relayer --lib passes (72 tests)
  • manual testing on affected channels recommended

the tokio::select! in websocket event source run_loop only matched on
Some(_) for both branches, meaning if the subscription stream silently
terminated (returned None without an error), neither branch would match
and the loop could hang indefinitely. this caused packets to be missed
when the websocket connection became stale without triggering an error.

this patch:
- explicitly handles None from batches.next() to detect stream termination
- explicitly handles None from rx_err.recv() to detect driver channel closure
- adds watchdog timeout using existing link::TIMEOUT (5 min) to detect stale
  connections that appear connected but are not receiving events
- adds new error variants (StreamTerminated, WatchdogTimeout, DriverChannelClosed)
- propagates errors before reconnection so supervisor clears pending packets
- re-exports TIMEOUT from link module for use in websocket event source

when any of these conditions are detected, an error is propagated to trigger
packet clearing, then reconnection is initiated to restore event subscription.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant