Cluster: Fix sub-replica and prevent empty primary when replica is blocked at replication and has inbound/outbound link race #2811

zhijun42 · 2025-11-06T10:14:11Z

Summary

This PR handles a network race condition that would cause a replica to read stale cluster state and then incorrectly promote itself to an empty primary within an existing shard.

Issues

When multiple replicas attempt to synchronize from the same primary concurrently, the first replica to issue PSYNC triggers an RDB child fork (RDB_CHILD_TYPE_SOCKET) for replication.

Any replicas connecting shortly after the fork miss the join window — their PING commands are accepted but not replied to until the primary server's fork completes. As a result, those replicas remain blocked in REPL_STATE_RECEIVE_PING_REPLY on their main thread, unable to process any cluster/client traffic, effectively dead to the outside world.

If a failover occurs during this period (e.g., the primary goes down and another replica is elected leader), the blocked replica will:

Time out after a while (server.repl_syncio_timeout is 5s by default) and resume its main thread, reconnecting with other nodes
Receive fresh PONG reply from the new leader.
Then receive delayed, stale PING messages from the new leader buffered before failover (didn't reach replica earlier, because replica was "dead").
Misinterpret that stale message as current truth, think it's now a sub-replica, and decide to follow the old primary again, and—after detecting it as FAILED—start and win a new election.

This results in two primaries in the same shard, with one being empty (0 slots) but still considered authoritative by all cluster servers.

For a concrete example, read the test case test_blocked_replica_stale_state_race and its comment in file tests/unit/cluster/replica-migration.tcl.

Overall there're two issues:

Replicas waiting for replication reply will block on main thread and stop all activities. This is true even with rdb-key-save-delay = 0, because the underlying cause is replication enrollment timing, not artificial delay.
The blocked replica can go through the events above and become empty primary.

This PR handles the second issue. Notice that this issue is flaky - If event [3] happens before [2] (inbound and outbound links are independent, so we can't guarantee the ordering), the replica won't become sub-replica and we won't have the empty primary issue.

I added guardrail logic in case such race condition does happen, and I also wrote a new test case to test my code, but since this issue can't be reliably reproduced (runs fine on my machine, but not on CI), this test case is disabled for now.

Fix

There could be different approaches to solving the problem. The way I do it is:

Replica shouldn't trigger failover when replication offset is 0.
Replica has a failed primary. Then it receives PING/PONG packet from a sender claiming it's the primary server in the same shard. Replica should accept and follow it.

File tests/unit/cluster/replica-migration.tcl is mostly meant to test the logic in function clusterUpdateSlotsConfigWith, so I add my new test case here. And since the test file has some duplicates, I extract them into helper functions to simplify the code.

Signed-off-by: Zhijun <[email protected]>

codecov · 2025-11-06T12:02:12Z

Codecov Report

❌ Patch coverage is 93.93939% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.24%. Comparing base (7043c0f) to head (0a0b948).
⚠️ Report is 21 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	93.75%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2811      +/-   ##
============================================
- Coverage     72.49%   72.24%   -0.26%     
============================================
  Files           128      128              
  Lines         71624    70282    -1342     
============================================
- Hits          51927    50777    -1150     
+ Misses        19697    19505     -192

Files with missing lines	Coverage Δ
src/replication.c	`86.05% <ø> (+0.20%)`	⬆️
src/socket.c	`94.21% <100.00%> (-0.04%)`	⬇️
src/cluster_legacy.c	`87.43% <93.75%> (-0.16%)`	⬇️

... and 102 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zhijun42 added 9 commits November 5, 2025 15:52

Fix the challenging issue

86bfb9d

Signed-off-by: Zhijun <[email protected]>

Spelling

13ef533

Signed-off-by: Zhijun <[email protected]>

slot_owner might be uninitialized

72fc2d1

Signed-off-by: Zhijun <[email protected]>

I shouldn't use Java style

9b7bb1f

Signed-off-by: Zhijun <[email protected]>

Another format fix

f04de59

Signed-off-by: Zhijun <[email protected]>

Turns out CLion IDE uses different style of pointer

2ef8dca

Signed-off-by: Zhijun <[email protected]>

Small twist

c6b48e2

Signed-off-by: Zhijun <[email protected]>

Turn off the new test

d1b2423

Signed-off-by: Zhijun <[email protected]>

Fix manual-failover test

021915f

Signed-off-by: Zhijun <[email protected]>

github-actions bot assigned zhijun42 Nov 6, 2025

zhijun42 added 2 commits November 6, 2025 18:21

Add reference to get_current_ts

0999ce6

Signed-off-by: Zhijun <[email protected]>

Trigger rerunning codecov test

0a0b948

Signed-off-by: Zhijun <[email protected]>

zuiderkwast requested review from PingXie, enjoy-binbin and hpatro November 10, 2025 01:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster: Fix sub-replica and prevent empty primary when replica is blocked at replication and has inbound/outbound link race #2811

Cluster: Fix sub-replica and prevent empty primary when replica is blocked at replication and has inbound/outbound link race #2811

Uh oh!

zhijun42 commented Nov 6, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Cluster: Fix sub-replica and prevent empty primary when replica is blocked at replication and has inbound/outbound link race #2811

Are you sure you want to change the base?

Cluster: Fix sub-replica and prevent empty primary when replica is blocked at replication and has inbound/outbound link race #2811

Uh oh!

Conversation

zhijun42 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issues

Fix

Uh oh!

codecov bot commented Nov 6, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhijun42 commented Nov 6, 2025 •

edited

Loading