Skip to content

Conversation

@zhijun42
Copy link
Contributor

@zhijun42 zhijun42 commented Nov 6, 2025

Summary

This PR handles a network race condition that would cause a replica to read stale cluster state and then incorrectly promote itself to an empty primary within an existing shard.

Issues

When multiple replicas attempt to synchronize from the same primary concurrently, the first replica to issue PSYNC triggers an RDB child fork (RDB_CHILD_TYPE_SOCKET) for replication.

Any replicas connecting shortly after the fork miss the join window — their PING commands are accepted but not replied to until the primary server's fork completes. As a result, those replicas remain blocked in REPL_STATE_RECEIVE_PING_REPLY on their main thread, unable to process any cluster/client traffic, effectively dead to the outside world.

If a failover occurs during this period (e.g., the primary goes down and another replica is elected leader), the blocked replica will:

  1. Time out after a while (server.repl_syncio_timeout is 5s by default) and resume its main thread, reconnecting with other nodes
  2. Receive fresh PONG reply from the new leader.
  3. Then receive delayed, stale PING messages from the new leader buffered before failover (didn't reach replica earlier, because replica was "dead").
  4. Misinterpret that stale message as current truth, think it's now a sub-replica, and decide to follow the old primary again, and—after detecting it as FAILED—start and win a new election.

This results in two primaries in the same shard, with one being empty (0 slots) but still considered authoritative by all cluster servers.

For a concrete example, read the test case test_blocked_replica_stale_state_race and its comment in file tests/unit/cluster/replica-migration.tcl.

Overall there're two issues:

  • Replicas waiting for replication reply will block on main thread and stop all activities. This is true even with rdb-key-save-delay = 0, because the underlying cause is replication enrollment timing, not artificial delay.
  • The blocked replica can go through the events above and become empty primary.

This PR handles the second issue. Notice that this issue is flaky - If event [3] happens before [2] (inbound and outbound links are independent, so we can't guarantee the ordering), the replica won't become sub-replica and we won't have the empty primary issue.

I added guardrail logic in case such race condition does happen, and I also wrote a new test case to test my code, but since this issue can't be reliably reproduced (runs fine on my machine, but not on CI), this test case is disabled for now.

Fix

There could be different approaches to solving the problem. The way I do it is:

  • Replica shouldn't trigger failover when replication offset is 0.
  • Replica has a failed primary. Then it receives PING/PONG packet from a sender claiming it's the primary server in the same shard. Replica should accept and follow it.

File tests/unit/cluster/replica-migration.tcl is mostly meant to test the logic in function clusterUpdateSlotsConfigWith, so I add my new test case here. And since the test file has some duplicates, I extract them into helper functions to simplify the code.

@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

❌ Patch coverage is 93.93939% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.24%. Comparing base (7043c0f) to head (0a0b948).
⚠️ Report is 21 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 93.75% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2811      +/-   ##
============================================
- Coverage     72.49%   72.24%   -0.26%     
============================================
  Files           128      128              
  Lines         71624    70282    -1342     
============================================
- Hits          51927    50777    -1150     
+ Misses        19697    19505     -192     
Files with missing lines Coverage Δ
src/replication.c 86.05% <ø> (+0.20%) ⬆️
src/socket.c 94.21% <100.00%> (-0.04%) ⬇️
src/cluster_legacy.c 87.43% <93.75%> (-0.16%) ⬇️

... and 102 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant