Skip to content

Conversation

@jackyzha0
Copy link
Member

@jackyzha0 jackyzha0 commented Mar 18, 2025

Why

deadlock in the server-side reconnect case:

  1. establish handshake acquires session lock: https://github.com/replit/river-python/blob/main/src/replit_river/server_transport.py#L231
  2. if there is a session mismatch, we close the old session: https://github.com/replit/river-python/blob/main/src/replit_river/server_transport.py#L272
  3. session.close calls _close_session_callback https://github.com/replit/river-python/blob/main/src/replit_river/session.py#L562 which is bound to transport._delete_session
  4. _delete_session also tries to acquire the session lock https://github.com/replit/river-python/blob/main/src/replit_river/transport.py#L44

What changed

split up critical sections in server transport similar to client side

  1. make _get_session acquire lock, dont need to lock for creating the session until we call _set_session
  2. close session also acquires its own lock

Test plan

added a test

Notes

I have a draft of a more in-depth approach on this branch which uses a lock-ownership-transfer based approach that should catch it more generically but ran into ownership problems lol

Seeing as we are planning on migrating chat service to Node anyways so we only have one River server implementation, we hopefully don't have to maintain this surface for much longer 🤞

@jackyzha0 jackyzha0 force-pushed the jackyzha0/patch-reconnect-deadlock branch from 5cd0d32 to 0d6f2c0 Compare March 18, 2025 04:47
@jackyzha0 jackyzha0 marked this pull request as ready for review March 18, 2025 04:58
@jackyzha0 jackyzha0 requested a review from a team as a code owner March 18, 2025 04:58
@jackyzha0 jackyzha0 requested review from blast-hardcheese and lhchavez and removed request for a team March 18, 2025 04:58
Comment on lines +157 to 158
async with self._session_lock:
self._set_session(new_session)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for safety, one property that the old code had that this one doesn't is that there could be two concurrent calls to this function (with the same id?) and then we would close the old session twice and leak one of the new sessions!

to prevent that what we can do is to rename _set_session to _set_session_locked (and add a docstring comment that it needs the session_lock to operate). and have _set_session_locked detect this mid-air collision and ... do something with the preexisting one (close it???)

with that out of the way, this relaxing of the locks' critical sections seems to be safe, because the Session by itself is safe to create twice (as long as we close it, otherwise we leak tasks!!!)

# If the session status is mismatched, we should close the old session
# and let the retry logic to create a new session.
await old_session.close()
await self._delete_session(old_session)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol dupes

@blast-hardcheese
Copy link
Contributor

@jackyzha0 I applied this delta overtop #147, available at jacky-deadlock-patch, so we can keep moving

@jackyzha0 jackyzha0 closed this Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants