Skip to content

Fix Redis to reconnect in Sentinel (Chris Staite)#2190

Merged
MarcusSorealheis merged 11 commits intomainfrom
redis_reconnect
Mar 6, 2026
Merged

Fix Redis to reconnect in Sentinel (Chris Staite)#2190
MarcusSorealheis merged 11 commits intomainfrom
redis_reconnect

Conversation

@MarcusSorealheis
Copy link
Collaborator

@MarcusSorealheis MarcusSorealheis commented Feb 26, 2026

Description

The switch from Fred to Redis-rs caused auto-reconnect in Sentinel to not occur when the primary fails over. Instead, we have to manually take care of detecting a ReadOnly error and performing the re-discovery of the primary and re-connecting.

Re-work the Redis store to catch the ReadOnly error in update_data and reconnect. Take a UUID for each connection to ensure we don't spam reconnect attempts from multiple tasks at the same time. Switch the pubsub over to using the ConnectionManager and manage the psubscribe there.

Type of change

Please delete options that aren't relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

  • bazel test //...
  • redis_store_tester with src/bin/docker-compose.store-tester.yaml and knocking out the primary node

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@MarcusSorealheis
Copy link
Collaborator Author

We clearly need to roll something back or fix some test that I broke. My apologies @chrisstaite

chrisstaite-menlo and others added 5 commits March 2, 2026 16:00
The switch from Fred to Redis-rs caused auto-reconnect in Sentinel to not occur when the primary fails over.  Instead, we have to manually take care of detecting a ReadOnly error and performing the re-discovery of the primary and re-connecting.

Re-work the Redis store to catch the ReadOnly error in update_data and reconnect.  Take a UUID for each connection to ensure we don't spam reconnect attempts from multiple tasks at the same time.  Switch the pubsub over to using the ConnectionManager and manage the psubscribe there.
@palfrey
Copy link
Member

palfrey commented Mar 2, 2026

@chrisstaite-menlo FYI, this starts from your work and then adds in some extra testing I've built that demos the Sentinel fallover case. My test calls into update which hasn't got the reconnect logic in there yet, but looking at that now.

@MarcusSorealheis
Copy link
Collaborator Author

@chrisstaite-menlo finally, this looks like it works. It's never worked fully until now. Next up, several hot path optimizations.

Copy link
Collaborator

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome jobs with the tests, thanks.

@chrisstaite-menlo made 2 comments.
Reviewable status: 0 of 1 LGTMs obtained, and 0 of 16 files reviewed.


nativelink-store/src/redis_store.rs line 266 at r11 (raw file):

        };
        for subscription in subscriptions {
            connection_manager.psubscribe(&subscription).await?;

One subscription failing can cause issues with half reconnected. Probably should have better error handling here.

Copy link
Collaborator

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisstaite-menlo reviewed 16 files and all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and all files reviewed.

@MarcusSorealheis
Copy link
Collaborator Author

Thank your for the review brother @chrisstaite-menlo

@MarcusSorealheis MarcusSorealheis merged commit 8783134 into main Mar 6, 2026
30 of 31 checks passed
@MarcusSorealheis MarcusSorealheis deleted the redis_reconnect branch March 6, 2026 08:39
@MarcusSorealheis
Copy link
Collaborator Author

I'm going to cut a new release

@MarcusSorealheis
Copy link
Collaborator Author

One subscription failing can cause issues with half reconnected. Probably should have better error handling here.

I'll try to fix this on the weekend because we have a Chrome customers that need the PR, though that shouldn't break anyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants