Skip to content

Commit f8f796c

Browse files
DOC-5665 start health check details
1 parent effa51a commit f8f796c

File tree

1 file changed

+126
-11
lines changed

1 file changed

+126
-11
lines changed

content/develop/clients/jedis/failover.md

Lines changed: 126 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -21,37 +21,41 @@ the concepts and describes how to configure Jedis for failover and failback.
2121

2222
## Concepts
2323

24-
You may have [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}})
24+
You may have several [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}})
2525
or independent Redis servers that are all suitable to serve your app.
2626
Typically, you would prefer some database endpoints over others for a particular
2727
instance of your app (perhaps the ones that are closest geographically to the app server
2828
to reduce network latency). However, if the best endpoint is not available due
2929
to a failure, it is generally better to switch to another, suboptimal endpoint
3030
than to let the app fail completely.
3131

32-
*Failover* is the technique of actively checking for connection failures and
33-
automatically switching to another endpoint when a failure is detected.
32+
*Failover* is the technique of actively checking for connection failures or
33+
unacceptably slow connections and
34+
automatically switching to another endpoint when they occur. The
35+
diagram below shows this process:
3436

3537
{{< image filename="images/failover/failover-client-reconnect.svg" alt="Failover and client reconnection" >}}
3638

3739
The complementary technique of *failback* then involves checking the original
3840
endpoint periodically to see if it has recovered, and switching back to it
39-
when it is available again.
41+
when it is available again:
4042

4143
{{< image filename="images/failover/failover-client-failback.svg" alt="Failback: client switches back to original server" width="75%" >}}
4244

4345
### Detecting a failed connection
4446

4547
Jedis uses the [resilience4j](https://resilience4j.readme.io/docs/getting-started)
46-
to detect connection failures using a
48+
library to detect connection problems using a
4749
[circuit breaker design pattern](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern).
4850

49-
The circuit breaker is a software component that tracks recent connection
50-
attempts in sequence, recording which ones have succeeded and which have failed.
51-
(Note that many connection failures are transient, so before recording a failure,
52-
the first response should usually be just to retry the connection a few times.)
51+
The circuit breaker is a software component that tracks the sequence of recent
52+
Redis connection attempts and commands, recording which ones have succeeded and
53+
which have failed.
54+
(Note that many command failures are caused by transient errors such as timeouts,
55+
so before recording a failure, the first response should usually be just to retry
56+
the command a few times.)
5357

54-
The status of the connection attempts is kept in a "sliding window", which
58+
The status of the attempted command calls is kept in a "sliding window", which
5559
is simply a buffer where the least recent item is dropped as each new
5660
one is added.
5761

@@ -74,12 +78,123 @@ Given that the original endpoint had some geographical or other advantage
7478
over the failover target, you will generally want to fail back to it as soon
7579
as it recovers. To detect when this happens, Jedis periodically
7680
runs a "health check" on the server. This can be as simple as
77-
sending a Redis [`ECHO`]({{< relref "/commands/echo" >}})) command and checking
81+
sending a Redis [`ECHO`]({{< relref "/commands/echo" >}}) command and checking
7882
that it gives a response.
7983

8084
You can also configure Jedis to run health checks on the current target
8185
server during periods of inactivity. This can help to detect when the
8286
server has failed and a failover is needed even when your app is not actively
8387
using it.
8488

89+
## Configure Jedis for failover
8590

91+
The example below shows a simple case with a list of two servers,
92+
`redis-east` and `redis-west`, where `redis-east` is the preferred
93+
target. If `redis-east` fails, Jedis should fail over to
94+
`redis-west`.
95+
96+
First, create some simple configuration for the client and
97+
[connection pool]({{< relref "/develop/clients/jedis/connect#connect-with-a-connection-pool" >}}),
98+
as you would for a standard connection.
99+
100+
```java
101+
JedisClientConfig config = DefaultJedisClientConfig.builder().user("<username>").password("<password>")
102+
.socketTimeoutMillis(5000).connectionTimeoutMillis(5000).build();
103+
104+
ConnectionPoolConfig poolConfig = new ConnectionPoolConfig();
105+
poolConfig.setMaxTotal(8);
106+
poolConfig.setMaxIdle(8);
107+
poolConfig.setMinIdle(0);
108+
poolConfig.setBlockWhenExhausted(true);
109+
poolConfig.setMaxWait(Duration.ofSeconds(1));
110+
poolConfig.setTestWhileIdle(true);
111+
poolConfig.setTimeBetweenEvictionRuns(Duration.ofSeconds(1));
112+
```
113+
114+
Supply the weighted list of endpoints as an array of `ClusterConfig`
115+
objects. Use the basic configuration objects created above and
116+
use the `weight` option to order the endpoints, with the highest
117+
weight being tried first.
118+
119+
```java
120+
MultiClusterClientConfig.ClusterConfig[] clusterConfigs = new MultiClusterClientConfig.ClusterConfig[2];
121+
122+
HostAndPort east = new HostAndPort("redis-east.example.com", 14000);
123+
clusterConfigs[0] = ClusterConfig.builder(east, config).connectionPoolConfig(poolConfig).weight(1.0f).build();
124+
125+
HostAndPort west = new HostAndPort("redis-west.example.com", 14000);
126+
clusterConfigs[1] = ClusterConfig.builder(west, config).connectionPoolConfig(poolConfig).weight(0.5f).build();
127+
```
128+
129+
Pass the `clusterConfigs` array when you create the `MultiClusterClientConfig` builder.
130+
The builder lets you add several options to configure the
131+
[circuit breaker](#circuit-breaker-configuration) behavior
132+
and [retries](#retry-configuration) (these are explained in more detail below).
133+
134+
```java
135+
MultiClusterClientConfig.Builder builder = new MultiClusterClientConfig.Builder(clusterConfigs);
136+
137+
builder.circuitBreakerSlidingWindowSize(10); // Sliding window size in number of calls
138+
builder.circuitBreakerSlidingWindowMinCalls(1);
139+
builder.circuitBreakerFailureRateThreshold(50.0f); // percentage of failures to trigger circuit breaker
140+
141+
builder.failbackSupported(true); // Enable failback
142+
builder.failbackCheckInterval(1000); // Check every second the unhealthy cluster to see if it has recovered
143+
builder.gracePeriod(10000); // Keep cluster disabled for 10 seconds after it becomes unhealthy
144+
145+
// Optional: configure retry settings
146+
builder.retryMaxAttempts(3); // Maximum number of retry attempts (including the initial call)
147+
builder.retryWaitDuration(500); // Number of milliseconds to wait between retry attempts
148+
builder.retryWaitDurationExponentialBackoffMultiplier(2); // Exponential backoff factor multiplied against wait duration between retries
149+
150+
// Optional: configure fast failover
151+
builder.fastFailover(true); // Force closing connections to unhealthy cluster on failover
152+
builder.retryOnFailover(false); // Do not retry failed commands during failover
153+
```
154+
155+
Finally, build the `MultiClusterClientConfig` and use it to create a `MultiClusterPooledConnectionProvider`. You can now pass this to
156+
the standard `UnifiedJedis` constructor to establish the client connection
157+
(see [Basic connection]({{< relref "/develop/clients/jedis/connect#basic-connection" >}})
158+
for an example).
159+
160+
```java
161+
MultiClusterPooledConnectionProvider provider = new MultiClusterPooledConnectionProvider(builder.build());
162+
163+
UnifiedJedis jedis = new UnifiedJedis(provider);
164+
```
165+
166+
When you use the `UnifiedJedis` instance, Jedis will handle the connection
167+
management and failover transparently.
168+
169+
### Circuit breaker configuration
170+
171+
The `MultiClusterClientConfig` builder lets you pass several options to configure
172+
the circuit breaker:
173+
174+
| Builder method | Default value | Description|
175+
| --- | --- | --- |
176+
| `circuitBreakerSlidingWindowType()` | `COUNT_BASED` | Type of sliding window. `COUNT_BASED` uses a sliding window based on the number of calls, while `TIME_BASED` uses a sliding window based on time. |
177+
| `circuitBreakerSlidingWindowSize()` | `100` | Size of the sliding window in number of calls or time in seconds, depending on the sliding window type. |
178+
| `circuitBreakerSlidingWindowMinCalls()` | `10` | Minimum number of calls required (per sliding window period) before the circuit breaker will start calculating the error rate or slow call rate. |
179+
| `circuitBreakerFailureRateThreshold()` | `50.0f` | Percentage of failures to trigger the circuit breaker. |
180+
| `circuitBreakerSlowCallRateThreshold()` | `100.0f` | Percentage of slow calls to trigger the circuit breaker. |
181+
| `circuitBreakerSlowCallDurationThreshold()` | `60000` | Duration in milliseconds to consider a call as slow. |
182+
| `circuitBreakerIncludedExceptionList()` | See description | `List` of `Throwable` classes that should be considered as failures. By default, it includes just `JedisConnectionException`. |
183+
| `circuitBreakerIgnoreExceptionList()` | `null` | `List` of `Throwable` classes that should be ignored for failure rate calculation. |
184+
185+
### Retry configuration
186+
187+
The `MultiClusterClientConfig` builder has the following options to configure retries:
188+
189+
| Builder method | Default value | Description|
190+
| --- | --- | --- |
191+
| `retryMaxAttempts()` | `3` | Maximum number of retry attempts (including the initial call). |
192+
| `retryWaitDuration()` | `500` | Number of milliseconds to wait between retry attempts. |
193+
| `retryWaitDurationExponentialBackoffMultiplier()` | `2` | [Exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff) factor multiplied against wait duration between retries. For example, with a wait duration of 1 second and a multiplier of 2, the retries would occur after 1s, 2s, 4s, 8s, 16s, and so on. |
194+
| `retryIncludedExceptionList()` | See description | `List` of `Throwable` classes that should be considered as failures to be retried. By default, it includes just `JedisConnectionException`. |
195+
| `retryIgnoreExceptionList()` | `null` | `List` of `Throwable` classes that should be ignored for retry. |
196+
197+
### Health check configuration
198+
199+
The general strategy for health checks is to ask the Redis server for a
200+
response that it could only give if it is healthy.

0 commit comments

Comments
 (0)