Skip to content

Commit 57e0062

Browse files
DOC-5849 added draft redis-py failover page
1 parent 1b92ce9 commit 57e0062

File tree

1 file changed

+381
-0
lines changed

1 file changed

+381
-0
lines changed
Lines changed: 381 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,381 @@
1+
---
2+
categories:
3+
- docs
4+
- develop
5+
- stack
6+
- oss
7+
- rs
8+
- rc
9+
- oss
10+
- kubernetes
11+
- clients
12+
description: Improve reliability using the failover/failback features of redis-py.
13+
linkTitle: Failover/failback
14+
title: Failover and failback
15+
weight: 65
16+
bannerText: This feature is currently in preview and may be subject to change.
17+
---
18+
19+
redis-py supports [failover and failback](https://en.wikipedia.org/wiki/Failover)
20+
to improve the availability of connections to Redis databases. This page explains
21+
the concepts and describes how to configure redis-py for failover and failback.
22+
23+
## Concepts
24+
25+
You may have several [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}})
26+
or independent Redis servers that are all suitable to serve your app.
27+
Typically, you would prefer to use some database endpoints over others for a particular
28+
instance of your app (perhaps the ones that are closest geographically to the app server
29+
to reduce network latency). However, if the best endpoint is not available due
30+
to a failure, it is generally better to switch to another, suboptimal endpoint
31+
than to let the app fail completely.
32+
33+
*Failover* is the technique of actively checking for connection failures or
34+
unacceptably slow connections and automatically switching to the best available endpoint
35+
when they occur. This requires you to specify a list of endpoints to try, ordered by priority. The diagram below shows this process:
36+
37+
{{< image filename="images/failover/failover-client-reconnect.svg" alt="Failover and client reconnection" >}}
38+
39+
The complementary technique of *failback* then involves periodically checking the health
40+
of all endpoints that have failed. If any endpoints recover, the failback mechanism
41+
automatically switches the connection to the one with the highest priority.
42+
This could potentially be repeated until the optimal endpoint is available again.
43+
44+
{{< image filename="images/failover/failover-client-failback.svg" alt="Failback: client switches back to original server" width="75%" >}}
45+
46+
### Detecting connection problems
47+
48+
redis-py detects connection problems using a
49+
[circuit breaker design pattern](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern).
50+
51+
The circuit breaker is a software component that tracks the sequence of recent
52+
Redis connection attempts and commands, recording which ones have succeeded and
53+
which have failed.
54+
(Note that many command failures are caused by transient errors such as timeouts,
55+
so before recording a failure, the first response should usually be just to retry
56+
the command a few times.)
57+
58+
The status of the attempted command calls is kept in a "sliding window", which
59+
is simply a buffer where the least recent item is dropped as each new
60+
one is added. The buffer can be configured to have a fixed number of failures and/or a failure ratio (specified as a percentage), both based on a time window.
61+
62+
{{< image filename="images/failover/failover-sliding-window.svg" alt="Sliding window of recent connection attempts" >}}
63+
64+
When the number of failures in the window exceeds a configured
65+
threshold, the circuit breaker declares the server to be unhealthy and triggers
66+
a failover.
67+
68+
### Selecting a failover target
69+
70+
Since you may have multiple Redis servers available to fail over to, redis-py
71+
lets you configure a list of endpoints to try, ordered by priority or
72+
"weight". When a failover is triggered, redis-py selects the highest-weighted
73+
endpoint that is still healthy and uses it for the temporary connection.
74+
75+
### Health checks
76+
77+
Given that the original endpoint had some geographical or other advantage
78+
over the failover target, you will generally want to fail back to it as soon
79+
as it recovers. In the meantime, another server might recover that is
80+
still better than the current failover target, so it might be worth
81+
failing back to that server even if it is not optimal.
82+
83+
redis-py periodically runs a "health check" on each server to see if it has recovered.
84+
The health check can be as simple as
85+
sending a Redis [`ECHO`]({{< relref "/commands/echo" >}}) command and ensuring
86+
that it gives the expected response.
87+
88+
You can also configure redis-py to run health checks on the current target
89+
server during periods of inactivity, even if no failover has occurred. This can
90+
help to detect problems even if your app is not actively using the server.
91+
92+
## Failover configuration
93+
94+
The example below shows a simple case with a list of two servers,
95+
`redis-east` and `redis-west`, where `redis-east` is the preferred
96+
target. If `redis-east` fails, redis-py should fail over to
97+
`redis-west`.
98+
99+
Supply the weighted endpoints using a list of `DatabaseConfig` objects.
100+
Use the `weight` option to order the endpoints, with the highest
101+
weight being tried first. Then, use the list to create a `MultiDbConfig` object,
102+
which you can pass to the `MultiDBClient` constructor to create the client.
103+
`MultiDBClient` implements the usual Redis commands using an internal
104+
`RedisClient` instance, but will also handle the connection management and failover transparently.
105+
106+
```py
107+
from redis.multidb.client import MultiDBClient
108+
from redis.multidb.config import MultiDbConfig, DatabaseConfig
109+
110+
db_configs = [
111+
DatabaseConfig(
112+
client_kwargs={"host": "redis-east.example.com", "port": "14000"},
113+
weight=1.0
114+
),
115+
DatabaseConfig(
116+
client_kwargs={"host": "redis-west.example.com", "port": "14000"},
117+
weight=0.5
118+
),
119+
]
120+
121+
cfg = MultiDbConfig(databases_config=db_configs)
122+
client = MultiDBClient(cfg)
123+
```
124+
125+
### Endpoint configuration
126+
127+
The `DatabaseConfig` class provides several options to configure each endpoint, as
128+
described in the table below. Supply the configurations for the whole set of
129+
endpoints by passing a list of `DatabaseConfig` objects to the `MultiDbConfig`
130+
constructor in the `databases_config` parameter.
131+
132+
| Option | Description |
133+
| --- | --- |
134+
| `client_kwargs` | Keyword parameters to pass to the internal `RedisClient` constructor for this endpoint. Use it to specify the host, port, username, password, and other connection parameters (see [Connect to the server]({{< relref "/develop/clients/redis-py/connect" >}}) for more information). |
135+
| `from_url` | Redis URL to connect to this endpoint, as an alternative to passing the host and port in `client_kwargs`. |
136+
| `from_pool` | A `ConnectionPool` to supply the endpoint connection (see [Connect with a connection pool]({{< relref "/develop/clients/redis-py/connect#connect-with-a-connection-pool" >}}) for more information) |
137+
| `weight` | Priority of the endpoint, with higher values being tried first. Default is `1.0`. |
138+
| `grace_period` | Duration in seconds to keep an unhealthy endpoint disabled before attempting a failback. Default is `60` seconds. |
139+
| `health_check_url` | URL for health checks that use the database's REST API (see [`LagAwareHealthCheck`](#lag-aware-health-check) for more information). |
140+
141+
### Retry configuration
142+
143+
`MultiDbConfig` provides the `command_retry` option to configure retries for failed commands. This follows the usual approach to configuring retries used with a standard
144+
`RedisClient` connection (see [Retries]({{< relref "/develop/clients/redis-py/produsage#retries" >}}) for more information).
145+
146+
```py
147+
cfg = MultiDbConfig(
148+
...
149+
# Retry failed commands up to three times using exponential backoff
150+
# with jitter between attempts.
151+
command_retry=Retry(
152+
retries=3,
153+
backoff=ExponentialWithJitterBackoff(base=1, cap=10),
154+
),
155+
...
156+
)
157+
```
158+
159+
### Health check configuration
160+
161+
Each health check consists of one or more separate "probes", each of which is a simple
162+
test (such as an [`ECHO`]({{< relref "/commands/echo" >}}) command) to determine if the database is available. The results of the separate probes are combined
163+
using a configurable policy to determine if the database is healthy. `MultiDbConfig` provides the following options to configure the health check behavior:
164+
165+
| Option | Description |
166+
| --- | --- |
167+
| `health_check_interval` | Time interval between successive health checks (each of which may consist of multiple probes). Default is `5` seconds. |
168+
| `health_check_probes` | Number of separate probes performed during each health check. Default is `3`. |
169+
| `health_check_probes_delay` | Delay between probes during a health check. Default is `0.5` seconds. |
170+
| `health_check_policy` | `HealthCheckPolicies` enum value to specify the policy for determining database health from the separate probes of a health check. The options are `HealthCheckPolicies.ALL` (all probes must succeed), `HealthCheckPolicies.ANY` (at least one probe must succeed), and `HealthCheckPolicies.MAJORITY` (more than half the probes must succeed). The default policy is `HealthCheckPolicies.MAJORITY`. |
171+
| `health_check` | Custom list of `HealthCheck` objects to specify how to perform each probe during a health check. This defaults to just the simple [`EchoHealthCheck`](#echohealthcheck-default). |
172+
173+
### Circuit breaker configuration
174+
175+
`MultiDbConfig` gives you several options to configure the circuit breaker:
176+
177+
| Option | Description |
178+
| --- | --- |
179+
| `failures_detection_window` | Duration in seconds to keep failures and successes in the sliding window. Default is `2` seconds. |
180+
| `min_num_failures` | Minimum number of failures that must occur to trigger a failover. Default is `1000`. |
181+
| `failure_rate_threshold` | Fraction of failed commands required to trigger a failover. Default is `0.1` (10%). |
182+
183+
### General failover configuration
184+
185+
There are also a few other options you can pass to the `MultiDbConfig` constructor to control the failover behavior:
186+
187+
| Option | Description |
188+
| --- | --- |
189+
| `failover_attempts` | Number of attempts to fail over to a new endpoint before giving up. Default is `10`. |
190+
| `failover_delay` | Time interval between successive failover attempts. Default is `12` seconds. |
191+
| `auto_fallback_interval` | Time interval between automatic failback attempts. Default is `30` seconds. |
192+
193+
## Health check strategies
194+
195+
There are several strategies available for health checks that you can configure using the
196+
`MultiClusterClientConfig` builder. The sections below explain these strategies
197+
in more detail.
198+
199+
### `EchoHealthCheck` (default)
200+
201+
The default strategy, `EchoHealthCheck`, periodically sends a Redis
202+
[`ECHO`]({{< relref "/commands/echo" >}}) command
203+
and checks that it gives the expected response. Any unexpected response
204+
or exception indicates an unhealthy server. Although `EchoHealthCheck` is
205+
very simple, it is a good basic approach for most Redis deployments.
206+
207+
### `LagAwareHealthCheck` (Redis Enterprise only) {#lag-aware-health-check}
208+
209+
`LagAwareHealthCheck` is designed specifically for
210+
Redis Enterprise [Active-Active]({{< relref "/operate/rs/databases/active-active" >}})
211+
deployments. It determines the health of the server by using the
212+
[REST API]({{< relref "/operate/rs/references/rest-api" >}}) to check the
213+
synchronization lag between a specific database and the others in the Active-Active
214+
setup. If the lag is within a specified tolerance, the server is considered healthy.
215+
216+
`LagAwareHealthCheck` uses the `health_check_url` value for the endpoint
217+
to connect to the database's REST API, so you must specify this in
218+
the `DatabaseConfig` for each endpoint:
219+
220+
```py
221+
db_configs = [
222+
DatabaseConfig(
223+
client_kwargs={"host": "redis-east.example.com", "port": "14000"},
224+
weight=1.0,
225+
health_check_url="https://health.redis-east.example.com"
226+
),
227+
DatabaseConfig(
228+
client_kwargs={"host": "redis-west.example.com", "port": "14000"},
229+
weight=0.5,
230+
health_check_url="https://health.redis-west.example.com"
231+
),
232+
]
233+
```
234+
235+
You must also add a `LagAwareHealthCheck` instance to the `health_check` list in
236+
the `MultiDbConfig` constructor:
237+
238+
```py
239+
cfg = MultiDbConfig(
240+
databases_config=db_configs,
241+
health_check=[LagAwareHealthCheck(
242+
rest_api_port=9443,
243+
lag_aware_tolerance=100, # ms
244+
verify_tls=True,
245+
# auth_basic=("user", "pass"),
246+
# ca_file="/path/ca.pem",
247+
# client_cert_file="/path/cert.pem",
248+
# client_key_file="/path/key.pem",
249+
)],
250+
...
251+
)
252+
253+
client = MultiDBClient(cfg)
254+
```
255+
256+
The `LagAwareHealthCheck` constructor accepts the following options:
257+
258+
| Option | Description |
259+
| --- | --- |
260+
| `rest_api_port` | Port number for Redis Enterprise REST API (default is 9443). |
261+
| `lag_aware_tolerance` | Tolerable synchronization lag between databases in milliseconds (default is 100ms). |
262+
| `timeout` | REST API request timeout in seconds (default is 30 seconds). |
263+
| `auth_basic` | Tuple of (username, password) for basic authentication. |
264+
| `verify_tls` | Whether to verify TLS certificates (defaults to `True`). |
265+
| `ca_file` | Path to CA certificate file for TLS verification. |
266+
| `ca_path` | Path to CA certificates directory for TLS verification. |
267+
| `ca_data` | CA certificate data as string or bytes. |
268+
| `client_cert_file` | Path to client certificate file for mutual TLS. |
269+
| `client_key_file` | Path to client private key file for mutual TLS. |
270+
| `client_key_password` | Password for encrypted client private key |
271+
272+
### Custom health check strategy
273+
274+
You can supply your own custom health check strategy by
275+
deriving a new class from the `AbstractHealthCheck` class.
276+
For example, you might use this to integrate with external monitoring tools or
277+
to implement checks that are specific to your application. Add an
278+
instance of your custom class to the `health_check` list in
279+
the `MultiDbConfig` constructor, as with [`LagAwareHealthCheck`](#lag-aware-health-check).
280+
281+
The example below
282+
shows a simple custom strategy that sends a Redis [`PING`]({{< relref "/commands/ping" >}})
283+
command and checks for the expected `PONG` response.
284+
285+
```py
286+
from redis.multidb.healthcheck import AbstractHealthCheck
287+
from redis.retry import Retry
288+
from redis.utils import dummy_fail
289+
290+
class PingHealthCheck(AbstractHealthCheck):
291+
def __init__(self, retry: Retry):
292+
super().__init__(retry=retry)
293+
def check_health(self, database) -> bool:
294+
return self._retry.call_with_retry(
295+
lambda: self._returns_pong(database),
296+
lambda _: dummy_fail()
297+
)
298+
def _returns_pong(self, database) -> bool:
299+
expected_message = ["PONG", b"PONG"]
300+
actual_message = database.client.execute_command("PING")
301+
return actual_message in expected_message
302+
303+
cfg = MultiDbConfig(
304+
...
305+
health_check=[PingHealthCheck(retry=Retry(retries=3))],
306+
...
307+
)
308+
309+
client = MultiDBClient(cfg)
310+
```
311+
312+
## Managing databases at runtime
313+
314+
Although you will typically configure all databases during the
315+
initial connection, you can also modify the configuration at runtime.
316+
You can add and remove database endpoints, update their weights,
317+
and manually set the active database rather than waiting for the
318+
failback mechanism:
319+
320+
```py
321+
from redis.multidb.client import MultiDBClient
322+
from redis.multidb.config import MultiDbConfig, DatabaseConfig
323+
from redis.multidb.database import Database
324+
from redis.multidb.circuit import PBCircuitBreakerAdapter
325+
import pybreaker
326+
from redis import Redis
327+
328+
cfg = MultiDbConfig(
329+
databases_config = [
330+
DatabaseConfig(
331+
client_kwargs={"host": "redis-east.example.com", "port": "14000"},
332+
weight=1.0
333+
),
334+
DatabaseConfig(
335+
client_kwargs={"host": "redis-west.example.com", "port": "14000"},
336+
weight=0.5
337+
),
338+
]
339+
)
340+
client = MultiDBClient(cfg)
341+
342+
# Add a database programmatically.
343+
other = Database(
344+
client=Redis.from_url("redis://redis-south.example.com/0"),
345+
circuit=PBCircuitBreakerAdapter(pybreaker.CircuitBreaker(reset_timeout=5.0)),
346+
weight=0.5,
347+
health_check_url=None,
348+
)
349+
client.add_database(other)
350+
351+
# Update the new database's weight.
352+
client.update_database_weight(other, 0.9)
353+
354+
# Manually set it as the active database.
355+
client.set_active_database(other)
356+
357+
# Remove the database from the failover set.
358+
client.remove_database(other)
359+
```
360+
361+
## Troubleshooting
362+
363+
This section lists some common problems and their solutions.
364+
365+
### Excessive or constant health check failures
366+
367+
If all health checks fail, you should first rule out authentication
368+
problems with the Redis server and also make sure there are no persistent
369+
network connectivity problems. If you are using
370+
[`LagAwareHealthCheck`](#lag-aware-health-check), check that the `health_check_url`
371+
is set correctly for each endpoint. You can also try increasing the timeout
372+
for health checks and the interval between them. See
373+
[Health check configuration](#health-check-configuration) and
374+
[Endpoint configuration](#endpoint-configuration) for more information about these options.
375+
376+
### Slow failback after recovery
377+
378+
If failback is too slow after a server recovers, you can try
379+
reducing the `health_check_interval` period and also reducing the `grace_period`
380+
before failback is attempted (see [Health check configuration](#health-check-configuration)
381+
for more information about these options).

0 commit comments

Comments
 (0)