Skip to content

Commit 88dcf0c

Browse files
committed
Greatly reduce race condition on .MakeMaster() calls at low-latency.
This was breaking integration tests but affects all zero-latency server situations. There's a good description in code, but the basics are a race between a broadcast (which goes first) and the post-SLAVEOF reconfig (which loses) creates a situation where the proper topology isn't seen until the next reconfig time span pass. We now prevent that broadcast (we're broadcasting to ourselves) from running on ourselves because we're about to run another one anyway. This doesn't 100% eliminate the race because there's a minute chance of a pub/sub landing between the 2 if blocks here in an external Interlock. But the chance is crazy small now, at least. Note: the logging of *why* a reconfig didn't run is now enhanced as well. We log what was trying to run and what was blocking it in the output for much easier debugging next time.
1 parent e11171d commit 88dcf0c

File tree

1 file changed

+14
-1
lines changed

1 file changed

+14
-1
lines changed

StackExchange.Redis/StackExchange/Redis/ConnectionMultiplexer.cs

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -369,6 +369,13 @@ internal void MakeMaster(ServerEndPoint server, ReplicationChangeOptions options
369369
server.QueueDirectFireAndForget(msg, ResultProcessor.DemandOK);
370370
}
371371

372+
// There's an inherent race here in zero-lantency environments (e.g. when Redis is on localhost) when a broadcast is specified
373+
// The broadast can get back from redis and trigger a reconfigure before we get a chance to get to ReconfigureAsync() below
374+
// This results in running an outdated reconfig and the .CompareExchange() (due to already running a reconfig) failing...making our needed reconfig a no-op.
375+
// If we don't block *that* run, then *our* run (at low latency) gets blocked. Then we're waiting on the
376+
// ConfigurationOptions.ConfigCheckSeconds interval to identify the current (created by this method call) topology correctly.
377+
var blockingReconfig = Interlocked.CompareExchange(ref activeConfigCause, "Block: Pending Master Reconfig", null) == null;
378+
372379
// try and broadcast this everywhere, to catch the maximum audience
373380
if ((options & ReplicationChangeOptions.Broadcast) != 0 && ConfigurationChangedChannel != null
374381
&& CommandMap.IsAvailable(RedisCommand.PUBLISH))
@@ -397,6 +404,12 @@ internal void MakeMaster(ServerEndPoint server, ReplicationChangeOptions options
397404

398405
// and reconfigure the muxer
399406
LogLocked(log, "Reconfiguring all endpoints...");
407+
// Yes, there is a tiny latency race possible between this code and the next call, but it's far more minute than before.
408+
// The effective gap between 0 and > 0 (likely off-box) latency is something that may never get hit here by anyone.
409+
if (blockingReconfig)
410+
{
411+
Interlocked.Exchange(ref activeConfigCause, null);
412+
}
400413
if (!ReconfigureAsync(false, true, log, srv.EndPoint, "make master").ObserveErrors().Wait(5000))
401414
{
402415
LogLocked(log, "Verifying the configuration was incomplete; please verify");
@@ -1216,7 +1229,7 @@ internal async Task<bool> ReconfigureAsync(bool first, bool reconfigureAll, Text
12161229

12171230
if (!ranThisCall)
12181231
{
1219-
LogLocked(log, "Reconfiguration was already in progress");
1232+
LogLocked(log, "Reconfiguration was already in progress due to: " + activeConfigCause + ", attempted to run for: " + cause);
12201233
return false;
12211234
}
12221235
Trace("Starting reconfiguration...");

0 commit comments

Comments
 (0)