Cluster: Fix sub-replica and prevent empty primary when replica is blocked at replication and has inbound/outbound link race #2811
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR handles a network race condition that would cause a replica to read stale cluster state and then incorrectly promote itself to an empty primary within an existing shard.
Issues
When multiple replicas attempt to synchronize from the same primary concurrently, the first replica to issue PSYNC triggers an RDB child fork (
RDB_CHILD_TYPE_SOCKET) for replication.Any replicas connecting shortly after the fork miss the join window — their PING commands are accepted but not replied to until the primary server's fork completes. As a result, those replicas remain blocked in
REPL_STATE_RECEIVE_PING_REPLYon their main thread, unable to process any cluster/client traffic, effectively dead to the outside world.If a failover occurs during this period (e.g., the primary goes down and another replica is elected leader), the blocked replica will:
server.repl_syncio_timeoutis 5s by default) and resume its main thread, reconnecting with other nodesThis results in two primaries in the same shard, with one being empty (0 slots) but still considered authoritative by all cluster servers.
For a concrete example, read the test case
test_blocked_replica_stale_state_raceand its comment in filetests/unit/cluster/replica-migration.tcl.Overall there're two issues:
This PR handles the second issue. Notice that this issue is flaky - If event [3] happens before [2] (inbound and outbound links are independent, so we can't guarantee the ordering), the replica won't become sub-replica and we won't have the empty primary issue.
I added guardrail logic in case such race condition does happen, and I also wrote a new test case to test my code, but since this issue can't be reliably reproduced (runs fine on my machine, but not on CI), this test case is disabled for now.
Fix
There could be different approaches to solving the problem. The way I do it is:
File
tests/unit/cluster/replica-migration.tclis mostly meant to test the logic in functionclusterUpdateSlotsConfigWith, so I add my new test case here. And since the test file has some duplicates, I extract them into helper functions to simplify the code.