Skip to content

Conversation

@JimmyAustin
Copy link
Contributor

Why

We've had persistent timeout errors in AI-Infra, and I suspect that it's related to not handling bumps in the connection correctly.

What changed

  • WebSocket drops or send failures left _streams populated, so any in-flight RPC hung until the session fully shut down. That meant clients didn’t see an abort signal and could block indefinitely even though the transport was already defunct.
  • Added _abort_all_streams() in src/replit_river/session.py#L289 and call it from both client_session.serve() and server_session.serve() on ConnectionClosed, FailedSendingMessageException, or any other unexpected exception (src/replit_river/client_session.py#L95, src/replit_river/server_session.py#L82). This immediately closes every active channel and clears _streams, ensuring callers are notified right away when the socket dies so they can retry or surface an error.

Test plan

CI/CD, ran against an internal branch with no issues 3x without flake.

@JimmyAustin JimmyAustin marked this pull request as ready for review October 29, 2025 03:05
@JimmyAustin JimmyAustin requested a review from a team as a code owner October 29, 2025 03:05
@JimmyAustin JimmyAustin requested review from jackyzha0 and removed request for a team October 29, 2025 03:05
Copy link
Contributor

@ryantm ryantm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to fix the regression test I made that was consistently hanging!

@JimmyAustin JimmyAustin merged commit c837786 into main Oct 29, 2025
5 checks passed
@JimmyAustin JimmyAustin deleted the 20251028-TestAbortingStreams branch October 29, 2025 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants