Skip to content

Commit a2bf969

Browse files
Merge pull request #2294 from redis/DOC-5849-python-failover
DOC-5849 added draft redis-py failover page
2 parents 80f43c6 + 24d7f7b commit a2bf969

File tree

1 file changed

+394
-0
lines changed

1 file changed

+394
-0
lines changed
Lines changed: 394 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,394 @@
1+
---
2+
categories:
3+
- docs
4+
- develop
5+
- stack
6+
- oss
7+
- rs
8+
- rc
9+
- oss
10+
- kubernetes
11+
- clients
12+
description: Improve reliability using the failover/failback features of redis-py.
13+
linkTitle: Failover/failback
14+
title: Failover and failback
15+
weight: 65
16+
bannerText: This feature is currently in preview and may be subject to change.
17+
---
18+
19+
redis-py supports [failover and failback](https://en.wikipedia.org/wiki/Failover)
20+
to improve the availability of connections to Redis databases. This page explains
21+
the concepts and describes how to configure redis-py for failover and failback.
22+
23+
## Concepts
24+
25+
You may have several [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}})
26+
or independent Redis servers that are all suitable to serve your app.
27+
Typically, you would prefer to use some database endpoints over others for a particular
28+
instance of your app (perhaps the ones that are closest geographically to the app server
29+
to reduce network latency). However, if the best endpoint is not available due
30+
to a failure, it is generally better to switch to another, suboptimal endpoint
31+
than to let the app fail completely.
32+
33+
*Failover* is the technique of actively checking for connection failures or
34+
unacceptably slow connections and automatically switching to the best available endpoint
35+
when they occur. This requires you to specify a list of endpoints to try, ordered by priority. The diagram below shows this process:
36+
37+
{{< image filename="images/failover/failover-client-reconnect.svg" alt="Failover and client reconnection" >}}
38+
39+
The complementary technique of *failback* then involves periodically checking the health
40+
of all endpoints that have failed. If any endpoints recover, the failback mechanism
41+
automatically switches the connection to the one with the highest priority.
42+
This could potentially be repeated until the optimal endpoint is available again.
43+
44+
{{< image filename="images/failover/failover-client-failback.svg" alt="Failback: client switches back to original server" width="75%" >}}
45+
46+
### Detecting connection problems
47+
48+
redis-py detects connection problems using a
49+
[circuit breaker design pattern](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern).
50+
51+
The circuit breaker is a software component that tracks the sequence of recent
52+
Redis connection attempts and commands, recording which ones have succeeded and
53+
which have failed.
54+
(Note that many command failures are caused by transient errors such as timeouts,
55+
so before recording a failure, the first response should usually be just to retry
56+
the command a few times.)
57+
58+
The status of the attempted command calls is kept in a "sliding window", which
59+
is simply a buffer where the least recent item is dropped as each new
60+
one is added. The buffer can be configured to have a fixed number of failures and/or a failure ratio (specified as a percentage), both based on a time window.
61+
62+
{{< image filename="images/failover/failover-sliding-window.svg" alt="Sliding window of recent connection attempts" >}}
63+
64+
When the number of failures in the window exceeds a configured
65+
threshold, the circuit breaker declares the server to be unhealthy and triggers
66+
a failover.
67+
68+
### Selecting a failover target
69+
70+
Since you may have multiple Redis servers available to fail over to, redis-py
71+
lets you configure a list of endpoints to try, ordered by priority or
72+
"weight". When a failover is triggered, redis-py selects the highest-weighted
73+
endpoint that is still healthy and uses it for the temporary connection.
74+
75+
### Health checks
76+
77+
Given that the original endpoint had some geographical or other advantage
78+
over the failover target, you will generally want to fail back to it as soon
79+
as it recovers. In the meantime, another server might recover that is
80+
still better than the current failover target, so it might be worth
81+
failing back to that server even if it is not optimal.
82+
83+
redis-py periodically runs a "health check" on each server to see if it has recovered.
84+
The health check can be as simple as
85+
sending a Redis [`PING`]({{< relref "/commands/ping" >}}) command and ensuring
86+
that it gives the expected response.
87+
88+
You can also configure redis-py to run health checks on the current target
89+
server during periods of inactivity, even if no failover has occurred. This can
90+
help to detect problems even if your app is not actively using the server.
91+
92+
## Failover configuration
93+
94+
The example below shows a simple case with a list of two servers,
95+
`redis-east` and `redis-west`, where `redis-east` is the preferred
96+
target. If `redis-east` fails, redis-py should fail over to
97+
`redis-west`.
98+
99+
Supply the weighted endpoints using a list of `DatabaseConfig` objects.
100+
Use the `weight` option to order the endpoints, with the highest
101+
weight being tried first. Then, use the list to create a `MultiDbConfig` object,
102+
which you can pass to the `MultiDBClient` constructor to create the client.
103+
`MultiDBClient` implements the usual Redis commands using an internal
104+
`RedisClient` instance, but will also handle the connection management and failover transparently.
105+
106+
```py
107+
from redis.multidb.client import MultiDBClient
108+
from redis.multidb.config import MultiDbConfig, DatabaseConfig
109+
110+
db_configs = [
111+
DatabaseConfig(
112+
client_kwargs={"host": "redis-east.example.com", "port": "14000"},
113+
weight=1.0
114+
),
115+
DatabaseConfig(
116+
client_kwargs={"host": "redis-west.example.com", "port": "14000"},
117+
weight=0.5
118+
),
119+
]
120+
121+
cfg = MultiDbConfig(databases_config=db_configs)
122+
client = MultiDBClient(cfg)
123+
```
124+
125+
### Endpoint configuration
126+
127+
The `DatabaseConfig` class provides several options to configure each endpoint, as
128+
described in the table below. Supply the configurations for the whole set of
129+
endpoints by passing a list of `DatabaseConfig` objects to the `MultiDbConfig`
130+
constructor in the `databases_config` parameter.
131+
132+
| Option | Description |
133+
| --- | --- |
134+
| `client_kwargs` | Keyword parameters to pass to the internal client constructor for this endpoint. Use it to specify the host, port, username, password, and other connection parameters (see [Connect to the server]({{< relref "/develop/clients/redis-py/connect" >}}) for more information). This is especially useful if you are using a custom client class (see [Client configuration](#client-configuration) below for more information). |
135+
| `from_url` | Redis URL to connect to this endpoint, as an alternative to passing the host and port in `client_kwargs`. |
136+
| `from_pool` | A `ConnectionPool` to supply the endpoint connection (see [Connect with a connection pool]({{< relref "/develop/clients/redis-py/connect#connect-with-a-connection-pool" >}}) for more information) |
137+
| `weight` | Priority of the endpoint, with higher values being tried first. Default is `1.0`. |
138+
| `grace_period` | Duration in seconds to keep an unhealthy endpoint disabled before attempting a failback. Default is `60` seconds. |
139+
| `health_check_url` | URL for health checks that use the database's REST API (see [`LagAwareHealthCheck`](#lag-aware-health-check) for more information). |
140+
141+
### Client configuration
142+
143+
`MultiDbConfig` provides the `client_class` option to specify the class of the internal client to use for each endpoint. The default is the basic `redis.Redis` client, but
144+
you could, for example, replace this with `redis.asyncio.client.Redis` for an asynchronous basic client, or with `redis.cluster.RedisCluster`/`redis.asyncio.cluster.RedisCluster` for a cluster client. Use the `client_kwargs` option of `DatabaseConfig` to supply any extra parameters required by the client class (see [Endpoint configuration](#endpoint-configuration) above for more information).
145+
146+
```py
147+
cfg = MultiDbConfig(
148+
...
149+
client_class=redis.asyncio.client.Redis,
150+
...
151+
)
152+
```
153+
154+
### Retry configuration
155+
156+
`MultiDbConfig` provides the `command_retry` option to configure retries for failed commands. This follows the usual approach to configuring retries used with a standard
157+
`RedisClient` connection (see [Retries]({{< relref "/develop/clients/redis-py/produsage#retries" >}}) for more information).
158+
159+
```py
160+
cfg = MultiDbConfig(
161+
...
162+
# Retry failed commands up to three times using exponential backoff
163+
# with jitter between attempts.
164+
command_retry=Retry(
165+
retries=3,
166+
backoff=ExponentialWithJitterBackoff(base=1, cap=10),
167+
),
168+
...
169+
)
170+
```
171+
172+
### Health check configuration
173+
174+
Each health check consists of one or more separate "probes", each of which is a simple
175+
test (such as a [`PING`]({{< relref "/commands/ping" >}}) command) to determine if the database is available. The results of the separate probes are combined
176+
using a configurable policy to determine if the database is healthy. `MultiDbConfig` provides the following options to configure the health check behavior:
177+
178+
| Option | Description |
179+
| --- | --- |
180+
| `health_check_interval` | Time interval between successive health checks (each of which may consist of multiple probes). Default is `5` seconds. |
181+
| `health_check_probes` | Number of separate probes performed during each health check. Default is `3`. |
182+
| `health_check_probes_delay` | Delay between probes during a health check. Default is `0.5` seconds. |
183+
| `health_check_policy` | `HealthCheckPolicies` enum value to specify the policy for determining database health from the separate probes of a health check. The options are `HealthCheckPolicies.ALL` (all probes must succeed), `HealthCheckPolicies.ANY` (at least one probe must succeed), and `HealthCheckPolicies.MAJORITY` (more than half the probes must succeed). The default policy is `HealthCheckPolicies.MAJORITY`. |
184+
| `health_check` | Custom list of `HealthCheck` objects to specify how to perform each probe during a health check. This defaults to just the simple [`PingHealthCheck`](#pinghealthcheck-default). |
185+
186+
### Circuit breaker configuration
187+
188+
`MultiDbConfig` gives you several options to configure the circuit breaker:
189+
190+
| Option | Description |
191+
| --- | --- |
192+
| `failures_detection_window` | Duration in seconds to keep failures and successes in the sliding window. Default is `2` seconds. |
193+
| `min_num_failures` | Minimum number of failures that must occur to trigger a failover. Default is `1000`. |
194+
| `failure_rate_threshold` | Fraction of failed commands required to trigger a failover. Default is `0.1` (10%). |
195+
196+
### General failover configuration
197+
198+
There are also a few other options you can pass to the `MultiDbConfig` constructor to control the failover behavior:
199+
200+
| Option | Description |
201+
| --- | --- |
202+
| `failover_attempts` | Number of attempts to fail over to a new endpoint before giving up. Default is `10`. |
203+
| `failover_delay` | Time interval between successive failover attempts. Default is `12` seconds. |
204+
| `auto_fallback_interval` | Time interval between automatic failback attempts. Default is `30` seconds. |
205+
206+
## Health check strategies
207+
208+
There are several strategies available for health checks that you can configure using the
209+
`MultiClusterClientConfig` builder. The sections below explain these strategies
210+
in more detail.
211+
212+
### `PingHealthCheck` (default)
213+
214+
The default strategy, `PingHealthCheck`, periodically sends a Redis
215+
[`PING`]({{< relref "/commands/ping" >}}) command
216+
and checks that it gives the expected response. Any unexpected response
217+
or exception indicates an unhealthy server. Although `PingHealthCheck` is
218+
very simple, it is a good basic approach for most Redis deployments.
219+
220+
### `LagAwareHealthCheck` (Redis Enterprise only) {#lag-aware-health-check}
221+
222+
`LagAwareHealthCheck` is designed specifically for
223+
Redis Enterprise [Active-Active]({{< relref "/operate/rs/databases/active-active" >}})
224+
deployments. It determines the health of the server by using the
225+
[REST API]({{< relref "/operate/rs/references/rest-api" >}}) to check the
226+
synchronization lag between a specific database and the others in the Active-Active
227+
setup. If the lag is within a specified tolerance, the server is considered healthy.
228+
229+
`LagAwareHealthCheck` uses the `health_check_url` value for the endpoint
230+
to connect to the database's REST API, so you must specify this in
231+
the `DatabaseConfig` for each endpoint:
232+
233+
```py
234+
db_configs = [
235+
DatabaseConfig(
236+
client_kwargs={"host": "redis-east.example.com", "port": "14000"},
237+
weight=1.0,
238+
health_check_url="https://health.redis-east.example.com"
239+
),
240+
DatabaseConfig(
241+
client_kwargs={"host": "redis-west.example.com", "port": "14000"},
242+
weight=0.5,
243+
health_check_url="https://health.redis-west.example.com"
244+
),
245+
]
246+
```
247+
248+
You must also add a `LagAwareHealthCheck` instance to the `health_check` list in
249+
the `MultiDbConfig` constructor:
250+
251+
```py
252+
cfg = MultiDbConfig(
253+
databases_config=db_configs,
254+
health_check=[LagAwareHealthCheck(
255+
rest_api_port=9443,
256+
lag_aware_tolerance=100, # ms
257+
verify_tls=True,
258+
# auth_basic=("user", "pass"),
259+
# ca_file="/path/ca.pem",
260+
# client_cert_file="/path/cert.pem",
261+
# client_key_file="/path/key.pem",
262+
)],
263+
...
264+
)
265+
266+
client = MultiDBClient(cfg)
267+
```
268+
269+
The `LagAwareHealthCheck` constructor accepts the following options:
270+
271+
| Option | Description |
272+
| --- | --- |
273+
| `rest_api_port` | Port number for Redis Enterprise REST API (default is 9443). |
274+
| `lag_aware_tolerance` | Tolerable synchronization lag between databases in milliseconds (default is 100ms). |
275+
| `timeout` | REST API request timeout in seconds (default is 30 seconds). |
276+
| `auth_basic` | Tuple of (username, password) for basic authentication. |
277+
| `verify_tls` | Whether to verify TLS certificates (defaults to `True`). |
278+
| `ca_file` | Path to CA certificate file for TLS verification. |
279+
| `ca_path` | Path to CA certificates directory for TLS verification. |
280+
| `ca_data` | CA certificate data as string or bytes. |
281+
| `client_cert_file` | Path to client certificate file for mutual TLS. |
282+
| `client_key_file` | Path to client private key file for mutual TLS. |
283+
| `client_key_password` | Password for encrypted client private key |
284+
285+
### Custom health check strategy
286+
287+
You can supply your own custom health check strategy by
288+
deriving a new class from the `AbstractHealthCheck` class.
289+
For example, you might use this to integrate with external monitoring tools or
290+
to implement checks that are specific to your application. Add an
291+
instance of your custom class to the `health_check` list in
292+
the `MultiDbConfig` constructor, as with [`LagAwareHealthCheck`](#lag-aware-health-check).
293+
294+
The example below
295+
shows a simple custom strategy that sends a Redis [`ECHO`]({{< relref "/commands/echo" >}})
296+
command and checks for the expected response.
297+
298+
```py
299+
from redis.multidb.healthcheck import AbstractHealthCheck
300+
from redis.retry import Retry
301+
from redis.utils import dummy_fail
302+
303+
class EchoHealthCheck(AbstractHealthCheck):
304+
def __init__(self, retry: Retry):
305+
super().__init__(retry=retry)
306+
def check_health(self, database) -> bool:
307+
return self._retry.call_with_retry(
308+
lambda: self._returns_echo(database),
309+
lambda _: dummy_fail()
310+
)
311+
def _returns_echo(self, database) -> bool:
312+
expected_message = ["Yodel-Ay-Ee-Oooo!", b"Yodel-Ay-Ee-Oooo!"]
313+
actual_message = database.client.execute_command("ECHO", "Yodel-Ay-Ee-Oooo!")
314+
return actual_message in expected_message
315+
316+
cfg = MultiDbConfig(
317+
...
318+
health_check=[EchoHealthCheck(retry=Retry(retries=3))],
319+
...
320+
)
321+
322+
client = MultiDBClient(cfg)
323+
```
324+
325+
## Managing databases at runtime
326+
327+
Although you will typically configure all databases during the
328+
initial connection, you can also modify the configuration at runtime.
329+
You can add and remove database endpoints, update their weights,
330+
and manually set the active database rather than waiting for the
331+
failback mechanism:
332+
333+
```py
334+
from redis.multidb.client import MultiDBClient
335+
from redis.multidb.config import MultiDbConfig, DatabaseConfig
336+
from redis.multidb.database import Database
337+
from redis.multidb.circuit import PBCircuitBreakerAdapter
338+
import pybreaker
339+
from redis import Redis
340+
341+
cfg = MultiDbConfig(
342+
databases_config = [
343+
DatabaseConfig(
344+
client_kwargs={"host": "redis-east.example.com", "port": "14000"},
345+
weight=1.0
346+
),
347+
DatabaseConfig(
348+
client_kwargs={"host": "redis-west.example.com", "port": "14000"},
349+
weight=0.5
350+
),
351+
]
352+
)
353+
client = MultiDBClient(cfg)
354+
355+
# Add a database programmatically.
356+
other = Database(
357+
client=Redis.from_url("redis://redis-south.example.com/0"),
358+
circuit=PBCircuitBreakerAdapter(pybreaker.CircuitBreaker(reset_timeout=5.0)),
359+
weight=0.5,
360+
health_check_url=None,
361+
)
362+
client.add_database(other)
363+
364+
# Update the new database's weight.
365+
client.update_database_weight(other, 0.9)
366+
367+
# Manually set it as the active database.
368+
client.set_active_database(other)
369+
370+
# Remove the database from the failover set.
371+
client.remove_database(other)
372+
```
373+
374+
## Troubleshooting
375+
376+
This section lists some common problems and their solutions.
377+
378+
### Excessive or constant health check failures
379+
380+
If all health checks fail, you should first rule out authentication
381+
problems with the Redis server and also make sure there are no persistent
382+
network connectivity problems. If you are using
383+
[`LagAwareHealthCheck`](#lag-aware-health-check), check that the `health_check_url`
384+
is set correctly for each endpoint. You can also try increasing the timeout
385+
for health checks and the interval between them. See
386+
[Health check configuration](#health-check-configuration) and
387+
[Endpoint configuration](#endpoint-configuration) for more information about these options.
388+
389+
### Slow failback after recovery
390+
391+
If failback is too slow after a server recovers, you can try
392+
reducing the `health_check_interval` period and also reducing the `grace_period`
393+
before failback is attempted (see [Health check configuration](#health-check-configuration)
394+
for more information about these options).

0 commit comments

Comments
 (0)