Skip to content

Commit 54a633f

Browse files
authored
Feature: Add HeartbeatConsistencyChecks (#2656)
This is a new feature for allowing keepalive commands to be sent every heartbeat regardless of if they're needed for actual keepalives. I've gotten a ping about a network drop that went undetected for ~8 minutes because of usage only in strings. We see this use case not uncommonly, and because of that string type being the only thing used, we don't detect a protocol fault from subsequent string commands. The symptoms here are partial/unexpected string payloads but ultimately the wrong answers to the wrong query. Given we don't know at what layer this drop happens (this appears to be extremely rare, 1 in quadrillions as far as we know), the best we can currently do is react to it ASAP. There may be a better way to accomplish what this is after - discussing with Marc soon as we're both online but prepping this PR in case it's best path.
1 parent 6392871 commit 54a633f

File tree

9 files changed

+88
-12
lines changed

9 files changed

+88
-12
lines changed

docs/Configuration.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,8 @@ Additional code-only options:
115115
- The thread pool to use for scheduling work to and from the socket connected to Redis, one of...
116116
- `SocketManager.Shared`: Use a shared dedicated thread pool for _all_ multiplexers (defaults to 10 threads) - best balance for most scenarios.
117117
- `SocketManager.ThreadPool`: Use the build-in .NET thread pool for scheduling. This can perform better for very small numbers of cores or with large apps on large machines that need to use more than 10 threads (total, across all multiplexers) under load. **Important**: this option isn't the default because it's subject to thread pool growth/starvation and if for example synchronous calls are waiting on a redis command to come back to unblock other threads, stalls/hangs can result. Use with caution, especially if you have sync-over-async work in play.
118+
- HeartbeatConsistencyChecks - Default: `false`
119+
- Allows _always_ sending keepalive checks even if a connection isn't idle. This trades extra commands (per `HeartbeatInterval` - default 1 second) to check the network stream for consistency. If any data was lost, the result won't be as expected and the connection will be terminated ASAP. This is a check to react to any data loss at the network layer as soon as possible.
118120
- HeartbeatInterval - Default: `1000ms`
119121
- Allows running the heartbeat more often which importantly includes timeout evaluation for async commands. For example if you have a 50ms async command timeout, we're only actually checking it during the heartbeat (once per second by default), so it's possible 50-1050ms pass _before we notice it timed out_. If you want more fidelity in that check and to observe that a server failed faster, you can lower this to run the heartbeat more often to achieve that.
120122
- **Note: heartbeats are not free and that's why the default is 1 second. There is additional overhead to running this more often simply because it does some work each time it fires.**

docs/ReleaseNotes.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Current package versions:
99
## Unreleased
1010

1111
- Fix [#2653](https://github.com/StackExchange/StackExchange.Redis/issues/2653): Client library metadata should validate contents ([#2654](https://github.com/StackExchange/StackExchange.Redis/pull/2654) by mgravell)
12+
- Add `HeartbeatConsistencyChecks` option (opt-in) to enabled per-heartbeat (defaults to once per second) checks to be sent to ensure no network stream corruption has occured ([#2656 by NickCraver](https://github.com/StackExchange/StackExchange.Redis/pull/2656))
1213

1314
## 2.7.20
1415

@@ -25,7 +26,7 @@ Current package versions:
2526

2627
- Fix [#2593](https://github.com/StackExchange/StackExchange.Redis/issues/2593): `EXPIRETIME` and `PEXPIRETIME` miscategorized as `PrimaryOnly` commands causing them to fail when issued against a read-only replica ([#2593 by slorello89](https://github.com/StackExchange/StackExchange.Redis/pull/2593))
2728
- Fix [#2591](https://github.com/StackExchange/StackExchange.Redis/issues/2591): Add `HELLO` to Sentinel connections so they can support RESP3 ([#2601 by NickCraver](https://github.com/StackExchange/StackExchange.Redis/pull/2601))
28-
- Fix [#2595](https://github.com/StackExchange/StackExchange.Redis/issues/2595): Add detection handling for dead sockets that the OS says are okay, seen especially in Linux environments (https://github.com/StackExchange/StackExchange.Redis/pull/2610)
29+
- Fix [#2595](https://github.com/StackExchange/StackExchange.Redis/issues/2595): Add detection handling for dead sockets that the OS says are okay, seen especially in Linux environments ([#2610 by NickCraver](https://github.com/StackExchange/StackExchange.Redis/pull/2610))
2930

3031
## 2.7.4
3132

src/StackExchange.Redis/Configuration/DefaultOptionsProvider.cs

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,12 @@ public static DefaultOptionsProvider GetProvider(EndPoint endpoint)
139139
/// <remarks>Be aware setting this very low incurs additional overhead of evaluating the above more often.</remarks>
140140
public virtual TimeSpan HeartbeatInterval => TimeSpan.FromSeconds(1);
141141

142+
/// <summary>
143+
/// Whether to enable ECHO checks on every heartbeat to ensure network stream consistency.
144+
/// This is a rare measure to react to any potential network traffic drops ASAP, terminating the connection.
145+
/// </summary>
146+
public virtual bool HeartbeatConsistencyChecks => false;
147+
142148
/// <summary>
143149
/// Should exceptions include identifiable details? (key names, additional .Data annotations)
144150
/// </summary>

src/StackExchange.Redis/ConfigurationOptions.cs

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -154,7 +154,7 @@ public static string TryNormalize(string value)
154154

155155
private DefaultOptionsProvider? defaultOptions;
156156

157-
private bool? allowAdmin, abortOnConnectFail, resolveDns, ssl, checkCertificateRevocation,
157+
private bool? allowAdmin, abortOnConnectFail, resolveDns, ssl, checkCertificateRevocation, heartbeatConsistencyChecks,
158158
includeDetailInExceptions, includePerformanceCountersInExceptions, setClientLibrary;
159159

160160
private string? tieBreaker, sslHost, configChannel, user, password;
@@ -402,6 +402,16 @@ public Version DefaultVersion
402402
/// </remarks>
403403
public EndPointCollection EndPoints { get; init; } = new EndPointCollection();
404404

405+
/// <summary>
406+
/// Whether to enable ECHO checks on every heartbeat to ensure network stream consistency.
407+
/// This is a rare measure to react to any potential network traffic drops ASAP, terminating the connection.
408+
/// </summary>
409+
public bool HeartbeatConsistencyChecks
410+
{
411+
get => heartbeatConsistencyChecks ?? Defaults.HeartbeatConsistencyChecks;
412+
set => heartbeatConsistencyChecks = value;
413+
}
414+
405415
/// <summary>
406416
/// Controls how often the connection heartbeats. A heartbeat includes:
407417
/// - Evaluating if any messages have timed out

src/StackExchange.Redis/PhysicalBridge.cs

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -359,9 +359,13 @@ internal void IncrementOpCount()
359359
Interlocked.Increment(ref operationCount);
360360
}
361361

362-
internal void KeepAlive()
362+
/// <summary>
363+
/// Sends a keepalive message (ECHO or PING) to keep connections alive and check validity of response.
364+
/// </summary>
365+
/// <param name="forceRun">Whether to run even then the connection isn't idle.</param>
366+
internal void KeepAlive(bool forceRun = false)
363367
{
364-
if (!(physical?.IsIdle() ?? false)) return; // don't pile on if already doing something
368+
if (!forceRun && !(physical?.IsIdle() ?? false)) return; // don't pile on if already doing something
365369

366370
var commandMap = Multiplexer.CommandMap;
367371
Message? msg = null;
@@ -596,6 +600,15 @@ internal void OnHeartbeat(bool ifConnectedOnly)
596600
checkConfigSeconds = ServerEndPoint.ConfigCheckSeconds;
597601

598602
if (state == (int)State.ConnectedEstablished && ConnectionType == ConnectionType.Interactive
603+
&& tmp.BridgeCouldBeNull?.Multiplexer.RawConfig.HeartbeatConsistencyChecks == true)
604+
{
605+
// If HeartbeatConsistencyChecks are enabled, we're sending a PING (expecting PONG) or ECHO (expecting UniqueID back) every single
606+
// heartbeat as an opt-in measure to react to any network stream drop ASAP to terminate the connection as faulted.
607+
// If we don't get the expected response to that command, then the connection is terminated.
608+
// This is to prevent the case of things like 100% string command usage where a protocol error isn't otherwise encountered.
609+
KeepAlive(forceRun: true);
610+
}
611+
else if (state == (int)State.ConnectedEstablished && ConnectionType == ConnectionType.Interactive
599612
&& checkConfigSeconds > 0 && ServerEndPoint.LastInfoReplicationCheckSecondsAgo >= checkConfigSeconds
600613
&& ServerEndPoint.CheckInfoReplication())
601614
{
@@ -614,13 +627,13 @@ internal void OnHeartbeat(bool ifConnectedOnly)
614627
tmp.Dispose(); // Cleanup the existing connection/socket if any, otherwise it will wait reading indefinitely
615628
}
616629
}
617-
else if (writeEverySeconds <= 0 && tmp.IsIdle()
630+
else if (writeEverySeconds <= 0
631+
&& tmp.IsIdle()
618632
&& tmp.LastWriteSecondsAgo > 2
619633
&& tmp.GetSentAwaitingResponseCount() != 0)
620634
{
621-
// there's a chance this is a dead socket; sending data will shake that
622-
// up a bit, so if we have an empty unsent queue and a non-empty sent
623-
// queue, test the socket
635+
// There's a chance this is a dead socket; sending data will shake that up a bit,
636+
// so if we have an empty unsent queue and a non-empty sent queue, test the socket.
624637
KeepAlive();
625638
}
626639
else if (timedOutThisHeartbeat > 0

src/StackExchange.Redis/PublicAPI/PublicAPI.Shipped.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,8 @@ StackExchange.Redis.ConfigurationOptions.DefaultVersion.get -> System.Version!
232232
StackExchange.Redis.ConfigurationOptions.DefaultVersion.set -> void
233233
StackExchange.Redis.ConfigurationOptions.EndPoints.get -> StackExchange.Redis.EndPointCollection!
234234
StackExchange.Redis.ConfigurationOptions.EndPoints.init -> void
235+
StackExchange.Redis.ConfigurationOptions.HeartbeatConsistencyChecks.get -> bool
236+
StackExchange.Redis.ConfigurationOptions.HeartbeatConsistencyChecks.set -> void
235237
StackExchange.Redis.ConfigurationOptions.HeartbeatInterval.get -> System.TimeSpan
236238
StackExchange.Redis.ConfigurationOptions.HeartbeatInterval.set -> void
237239
StackExchange.Redis.ConfigurationOptions.HighPrioritySocketThreads.get -> bool
@@ -1797,6 +1799,7 @@ virtual StackExchange.Redis.Configuration.DefaultOptionsProvider.DefaultVersion.
17971799
virtual StackExchange.Redis.Configuration.DefaultOptionsProvider.GetDefaultClientName() -> string!
17981800
virtual StackExchange.Redis.Configuration.DefaultOptionsProvider.GetDefaultSsl(StackExchange.Redis.EndPointCollection! endPoints) -> bool
17991801
virtual StackExchange.Redis.Configuration.DefaultOptionsProvider.GetSslHostFromEndpoints(StackExchange.Redis.EndPointCollection! endPoints) -> string?
1802+
virtual StackExchange.Redis.Configuration.DefaultOptionsProvider.HeartbeatConsistencyChecks.get -> bool
18001803
virtual StackExchange.Redis.Configuration.DefaultOptionsProvider.HeartbeatInterval.get -> System.TimeSpan
18011804
virtual StackExchange.Redis.Configuration.DefaultOptionsProvider.IncludeDetailInExceptions.get -> bool
18021805
virtual StackExchange.Redis.Configuration.DefaultOptionsProvider.IncludePerformanceCountersInExceptions.get -> bool

src/StackExchange.Redis/ServerEndPoint.cs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -567,7 +567,7 @@ internal string GetProfile()
567567

568568
internal string? GetStormLog(Message message) => GetBridge(message)?.GetStormLog();
569569

570-
internal Message GetTracerMessage(bool assertIdentity)
570+
internal Message GetTracerMessage(bool checkResponse)
571571
{
572572
// Different configurations block certain commands, as can ad-hoc local configurations, so
573573
// we'll do the best with what we have available.
@@ -576,7 +576,7 @@ internal Message GetTracerMessage(bool assertIdentity)
576576
var map = Multiplexer.CommandMap;
577577
Message msg;
578578
const CommandFlags flags = CommandFlags.NoRedirect | CommandFlags.FireAndForget;
579-
if (assertIdentity && map.IsAvailable(RedisCommand.ECHO))
579+
if (checkResponse && map.IsAvailable(RedisCommand.ECHO))
580580
{
581581
msg = Message.Create(-1, flags, RedisCommand.ECHO, (RedisValue)Multiplexer.UniqueId);
582582
}
@@ -588,7 +588,7 @@ internal Message GetTracerMessage(bool assertIdentity)
588588
{
589589
msg = Message.Create(-1, flags, RedisCommand.TIME);
590590
}
591-
else if (!assertIdentity && map.IsAvailable(RedisCommand.ECHO))
591+
else if (!checkResponse && map.IsAvailable(RedisCommand.ECHO))
592592
{
593593
// We'll use echo as a PING substitute if it is all we have (in preference to EXISTS)
594594
msg = Message.Create(-1, flags, RedisCommand.ECHO, (RedisValue)Multiplexer.UniqueId);

tests/StackExchange.Redis.Tests/ConnectCustomConfigTests.cs

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
using Xunit;
1+
using System;
2+
using System.Threading.Tasks;
3+
using Xunit;
24
using Xunit.Abstractions;
35

46
namespace StackExchange.Redis.Tests;
@@ -89,4 +91,38 @@ public void TiebreakerIncorrectType()
8991
var ex = Assert.Throws<RedisServerException>(() => db.StringGet(tiebreakerKey));
9092
Assert.Contains("WRONGTYPE", ex.Message);
9193
}
94+
95+
[Theory]
96+
[InlineData(true, 5, 15)]
97+
[InlineData(false, 0, 0)]
98+
public async Task HeartbeatConsistencyCheckPingsAsync(bool enableConsistencyChecks, int minExpected, int maxExpected)
99+
{
100+
var options = new ConfigurationOptions()
101+
{
102+
HeartbeatConsistencyChecks = enableConsistencyChecks,
103+
HeartbeatInterval = TimeSpan.FromMilliseconds(100),
104+
};
105+
options.EndPoints.Add(TestConfig.Current.PrimaryServerAndPort);
106+
107+
using var conn = await ConnectionMultiplexer.ConnectAsync(options, Writer);
108+
109+
var db = conn.GetDatabase();
110+
db.Ping();
111+
Assert.True(db.IsConnected(default));
112+
113+
var preCount = conn.OperationCount;
114+
Log("OperationCount (pre-delay): " + preCount);
115+
116+
// Allow several heartbeats to happen, but don't need to be strict here
117+
// e.g. allow thread pool starvation flex with the test suite's load (just check for a few)
118+
await Task.Delay(TimeSpan.FromSeconds(1));
119+
120+
var postCount = conn.OperationCount;
121+
Log("OperationCount (post-delay): " + postCount);
122+
123+
var opCount = postCount - preCount;
124+
Log("OperationCount (diff): " + opCount);
125+
126+
Assert.True(minExpected <= opCount && opCount >= minExpected, $"Expected opcount ({opCount}) between {minExpected}-{maxExpected}");
127+
}
92128
}

tests/StackExchange.Redis.Tests/DefaultOptionsTests.cs

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ public class TestOptionsProvider : DefaultOptionsProvider
3232
public override int ConnectRetry => 123;
3333
public override Version DefaultVersion => new Version(1, 2, 3, 4);
3434
protected override string GetDefaultClientName() => "TestPrefix-" + base.GetDefaultClientName();
35+
public override bool HeartbeatConsistencyChecks => true;
36+
public override TimeSpan HeartbeatInterval => TimeSpan.FromMilliseconds(500);
3537
public override bool IsMatch(EndPoint endpoint) => endpoint is DnsEndPoint dnsep && dnsep.Host.EndsWith(_domainSuffix);
3638
public override TimeSpan KeepAliveInterval => TimeSpan.FromSeconds(125);
3739
public override ILoggerFactory? LoggerFactory => NullLoggerFactory.Instance;
@@ -99,6 +101,9 @@ private static void AssertAllOverrides(ConfigurationOptions options)
99101
Assert.Equal(123, options.ConnectRetry);
100102
Assert.Equal(new Version(1, 2, 3, 4), options.DefaultVersion);
101103

104+
Assert.True(options.HeartbeatConsistencyChecks);
105+
Assert.Equal(TimeSpan.FromMilliseconds(500), options.HeartbeatInterval);
106+
102107
Assert.Equal(TimeSpan.FromSeconds(125), TimeSpan.FromSeconds(options.KeepAlive));
103108
Assert.Equal(NullLoggerFactory.Instance, options.LoggerFactory);
104109
Assert.Equal(Proxy.Twemproxy, options.Proxy);

0 commit comments

Comments
 (0)