|
| 1 | +--- |
| 2 | +categories: |
| 3 | +- docs |
| 4 | +- develop |
| 5 | +- stack |
| 6 | +- oss |
| 7 | +- rs |
| 8 | +- rc |
| 9 | +- oss |
| 10 | +- kubernetes |
| 11 | +- clients |
| 12 | +description: Improve reliability using the failover/failback features of redis-py. |
| 13 | +linkTitle: Failover/failback |
| 14 | +title: Failover and failback |
| 15 | +weight: 65 |
| 16 | +bannerText: This feature is currently in preview and may be subject to change. |
| 17 | +--- |
| 18 | + |
| 19 | +redis-py supports [failover and failback](https://en.wikipedia.org/wiki/Failover) |
| 20 | +to improve the availability of connections to Redis databases. This page explains |
| 21 | +the concepts and describes how to configure redis-py for failover and failback. |
| 22 | + |
| 23 | +## Concepts |
| 24 | + |
| 25 | +You may have several [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}}) |
| 26 | +or independent Redis servers that are all suitable to serve your app. |
| 27 | +Typically, you would prefer to use some database endpoints over others for a particular |
| 28 | +instance of your app (perhaps the ones that are closest geographically to the app server |
| 29 | +to reduce network latency). However, if the best endpoint is not available due |
| 30 | +to a failure, it is generally better to switch to another, suboptimal endpoint |
| 31 | +than to let the app fail completely. |
| 32 | + |
| 33 | +*Failover* is the technique of actively checking for connection failures or |
| 34 | +unacceptably slow connections and automatically switching to the best available endpoint |
| 35 | +when they occur. This requires you to specify a list of endpoints to try, ordered by priority. The diagram below shows this process: |
| 36 | + |
| 37 | +{{< image filename="images/failover/failover-client-reconnect.svg" alt="Failover and client reconnection" >}} |
| 38 | + |
| 39 | +The complementary technique of *failback* then involves periodically checking the health |
| 40 | +of all endpoints that have failed. If any endpoints recover, the failback mechanism |
| 41 | +automatically switches the connection to the one with the highest priority. |
| 42 | +This could potentially be repeated until the optimal endpoint is available again. |
| 43 | + |
| 44 | +{{< image filename="images/failover/failover-client-failback.svg" alt="Failback: client switches back to original server" width="75%" >}} |
| 45 | + |
| 46 | +### Detecting connection problems |
| 47 | + |
| 48 | +redis-py detects connection problems using a |
| 49 | +[circuit breaker design pattern](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern). |
| 50 | + |
| 51 | +The circuit breaker is a software component that tracks the sequence of recent |
| 52 | +Redis connection attempts and commands, recording which ones have succeeded and |
| 53 | +which have failed. |
| 54 | +(Note that many command failures are caused by transient errors such as timeouts, |
| 55 | +so before recording a failure, the first response should usually be just to retry |
| 56 | +the command a few times.) |
| 57 | + |
| 58 | +The status of the attempted command calls is kept in a "sliding window", which |
| 59 | +is simply a buffer where the least recent item is dropped as each new |
| 60 | +one is added. The buffer can be configured to have a fixed number of failures and/or a failure ratio (specified as a percentage), both based on a time window. |
| 61 | + |
| 62 | +{{< image filename="images/failover/failover-sliding-window.svg" alt="Sliding window of recent connection attempts" >}} |
| 63 | + |
| 64 | +When the number of failures in the window exceeds a configured |
| 65 | +threshold, the circuit breaker declares the server to be unhealthy and triggers |
| 66 | +a failover. |
| 67 | + |
| 68 | +### Selecting a failover target |
| 69 | + |
| 70 | +Since you may have multiple Redis servers available to fail over to, redis-py |
| 71 | +lets you configure a list of endpoints to try, ordered by priority or |
| 72 | +"weight". When a failover is triggered, redis-py selects the highest-weighted |
| 73 | +endpoint that is still healthy and uses it for the temporary connection. |
| 74 | + |
| 75 | +### Health checks |
| 76 | + |
| 77 | +Given that the original endpoint had some geographical or other advantage |
| 78 | +over the failover target, you will generally want to fail back to it as soon |
| 79 | +as it recovers. In the meantime, another server might recover that is |
| 80 | +still better than the current failover target, so it might be worth |
| 81 | +failing back to that server even if it is not optimal. |
| 82 | + |
| 83 | +redis-py periodically runs a "health check" on each server to see if it has recovered. |
| 84 | +The health check can be as simple as |
| 85 | +sending a Redis [`PING`]({{< relref "/commands/ping" >}}) command and ensuring |
| 86 | +that it gives the expected response. |
| 87 | + |
| 88 | +You can also configure redis-py to run health checks on the current target |
| 89 | +server during periods of inactivity, even if no failover has occurred. This can |
| 90 | +help to detect problems even if your app is not actively using the server. |
| 91 | + |
| 92 | +## Failover configuration |
| 93 | + |
| 94 | +The example below shows a simple case with a list of two servers, |
| 95 | +`redis-east` and `redis-west`, where `redis-east` is the preferred |
| 96 | +target. If `redis-east` fails, redis-py should fail over to |
| 97 | +`redis-west`. |
| 98 | + |
| 99 | +Supply the weighted endpoints using a list of `DatabaseConfig` objects. |
| 100 | +Use the `weight` option to order the endpoints, with the highest |
| 101 | +weight being tried first. Then, use the list to create a `MultiDbConfig` object, |
| 102 | +which you can pass to the `MultiDBClient` constructor to create the client. |
| 103 | +`MultiDBClient` implements the usual Redis commands using an internal |
| 104 | +`RedisClient` instance, but will also handle the connection management and failover transparently. |
| 105 | + |
| 106 | +```py |
| 107 | +from redis.multidb.client import MultiDBClient |
| 108 | +from redis.multidb.config import MultiDbConfig, DatabaseConfig |
| 109 | + |
| 110 | +db_configs = [ |
| 111 | + DatabaseConfig( |
| 112 | + client_kwargs={"host": "redis-east.example.com", "port": "14000"}, |
| 113 | + weight=1.0 |
| 114 | + ), |
| 115 | + DatabaseConfig( |
| 116 | + client_kwargs={"host": "redis-west.example.com", "port": "14000"}, |
| 117 | + weight=0.5 |
| 118 | + ), |
| 119 | +] |
| 120 | + |
| 121 | +cfg = MultiDbConfig(databases_config=db_configs) |
| 122 | +client = MultiDBClient(cfg) |
| 123 | +``` |
| 124 | + |
| 125 | +### Endpoint configuration |
| 126 | + |
| 127 | +The `DatabaseConfig` class provides several options to configure each endpoint, as |
| 128 | +described in the table below. Supply the configurations for the whole set of |
| 129 | +endpoints by passing a list of `DatabaseConfig` objects to the `MultiDbConfig` |
| 130 | +constructor in the `databases_config` parameter. |
| 131 | + |
| 132 | +| Option | Description | |
| 133 | +| --- | --- | |
| 134 | +| `client_kwargs` | Keyword parameters to pass to the internal client constructor for this endpoint. Use it to specify the host, port, username, password, and other connection parameters (see [Connect to the server]({{< relref "/develop/clients/redis-py/connect" >}}) for more information). This is especially useful if you are using a custom client class (see [Client configuration](#client-configuration) below for more information). | |
| 135 | +| `from_url` | Redis URL to connect to this endpoint, as an alternative to passing the host and port in `client_kwargs`. | |
| 136 | +| `from_pool` | A `ConnectionPool` to supply the endpoint connection (see [Connect with a connection pool]({{< relref "/develop/clients/redis-py/connect#connect-with-a-connection-pool" >}}) for more information) | |
| 137 | +| `weight` | Priority of the endpoint, with higher values being tried first. Default is `1.0`. | |
| 138 | +| `grace_period` | Duration in seconds to keep an unhealthy endpoint disabled before attempting a failback. Default is `60` seconds. | |
| 139 | +| `health_check_url` | URL for health checks that use the database's REST API (see [`LagAwareHealthCheck`](#lag-aware-health-check) for more information). | |
| 140 | + |
| 141 | +### Client configuration |
| 142 | + |
| 143 | +`MultiDbConfig` provides the `client_class` option to specify the class of the internal client to use for each endpoint. The default is the basic `redis.Redis` client, but |
| 144 | +you could, for example, replace this with `redis.asyncio.client.Redis` for an asynchronous basic client, or with `redis.cluster.RedisCluster`/`redis.asyncio.cluster.RedisCluster` for a cluster client. Use the `client_kwargs` option of `DatabaseConfig` to supply any extra parameters required by the client class (see [Endpoint configuration](#endpoint-configuration) above for more information). |
| 145 | + |
| 146 | +```py |
| 147 | +cfg = MultiDbConfig( |
| 148 | + ... |
| 149 | + client_class=redis.asyncio.client.Redis, |
| 150 | + ... |
| 151 | +) |
| 152 | +``` |
| 153 | + |
| 154 | +### Retry configuration |
| 155 | + |
| 156 | +`MultiDbConfig` provides the `command_retry` option to configure retries for failed commands. This follows the usual approach to configuring retries used with a standard |
| 157 | +`RedisClient` connection (see [Retries]({{< relref "/develop/clients/redis-py/produsage#retries" >}}) for more information). |
| 158 | + |
| 159 | +```py |
| 160 | +cfg = MultiDbConfig( |
| 161 | + ... |
| 162 | + # Retry failed commands up to three times using exponential backoff |
| 163 | + # with jitter between attempts. |
| 164 | + command_retry=Retry( |
| 165 | + retries=3, |
| 166 | + backoff=ExponentialWithJitterBackoff(base=1, cap=10), |
| 167 | + ), |
| 168 | + ... |
| 169 | +) |
| 170 | +``` |
| 171 | + |
| 172 | +### Health check configuration |
| 173 | + |
| 174 | +Each health check consists of one or more separate "probes", each of which is a simple |
| 175 | +test (such as a [`PING`]({{< relref "/commands/ping" >}}) command) to determine if the database is available. The results of the separate probes are combined |
| 176 | +using a configurable policy to determine if the database is healthy. `MultiDbConfig` provides the following options to configure the health check behavior: |
| 177 | + |
| 178 | +| Option | Description | |
| 179 | +| --- | --- | |
| 180 | +| `health_check_interval` | Time interval between successive health checks (each of which may consist of multiple probes). Default is `5` seconds. | |
| 181 | +| `health_check_probes` | Number of separate probes performed during each health check. Default is `3`. | |
| 182 | +| `health_check_probes_delay` | Delay between probes during a health check. Default is `0.5` seconds. | |
| 183 | +| `health_check_policy` | `HealthCheckPolicies` enum value to specify the policy for determining database health from the separate probes of a health check. The options are `HealthCheckPolicies.ALL` (all probes must succeed), `HealthCheckPolicies.ANY` (at least one probe must succeed), and `HealthCheckPolicies.MAJORITY` (more than half the probes must succeed). The default policy is `HealthCheckPolicies.MAJORITY`. | |
| 184 | +| `health_check` | Custom list of `HealthCheck` objects to specify how to perform each probe during a health check. This defaults to just the simple [`PingHealthCheck`](#pinghealthcheck-default). | |
| 185 | + |
| 186 | +### Circuit breaker configuration |
| 187 | + |
| 188 | +`MultiDbConfig` gives you several options to configure the circuit breaker: |
| 189 | + |
| 190 | +| Option | Description | |
| 191 | +| --- | --- | |
| 192 | +| `failures_detection_window` | Duration in seconds to keep failures and successes in the sliding window. Default is `2` seconds. | |
| 193 | +| `min_num_failures` | Minimum number of failures that must occur to trigger a failover. Default is `1000`. | |
| 194 | +| `failure_rate_threshold` | Fraction of failed commands required to trigger a failover. Default is `0.1` (10%). | |
| 195 | + |
| 196 | +### General failover configuration |
| 197 | + |
| 198 | +There are also a few other options you can pass to the `MultiDbConfig` constructor to control the failover behavior: |
| 199 | + |
| 200 | +| Option | Description | |
| 201 | +| --- | --- | |
| 202 | +| `failover_attempts` | Number of attempts to fail over to a new endpoint before giving up. Default is `10`. | |
| 203 | +| `failover_delay` | Time interval between successive failover attempts. Default is `12` seconds. | |
| 204 | +| `auto_fallback_interval` | Time interval between automatic failback attempts. Default is `30` seconds. | |
| 205 | + |
| 206 | +## Health check strategies |
| 207 | + |
| 208 | +There are several strategies available for health checks that you can configure using the |
| 209 | +`MultiClusterClientConfig` builder. The sections below explain these strategies |
| 210 | +in more detail. |
| 211 | + |
| 212 | +### `PingHealthCheck` (default) |
| 213 | + |
| 214 | +The default strategy, `PingHealthCheck`, periodically sends a Redis |
| 215 | +[`PING`]({{< relref "/commands/ping" >}}) command |
| 216 | +and checks that it gives the expected response. Any unexpected response |
| 217 | +or exception indicates an unhealthy server. Although `PingHealthCheck` is |
| 218 | +very simple, it is a good basic approach for most Redis deployments. |
| 219 | + |
| 220 | +### `LagAwareHealthCheck` (Redis Enterprise only) {#lag-aware-health-check} |
| 221 | + |
| 222 | +`LagAwareHealthCheck` is designed specifically for |
| 223 | +Redis Enterprise [Active-Active]({{< relref "/operate/rs/databases/active-active" >}}) |
| 224 | +deployments. It determines the health of the server by using the |
| 225 | +[REST API]({{< relref "/operate/rs/references/rest-api" >}}) to check the |
| 226 | +synchronization lag between a specific database and the others in the Active-Active |
| 227 | +setup. If the lag is within a specified tolerance, the server is considered healthy. |
| 228 | + |
| 229 | +`LagAwareHealthCheck` uses the `health_check_url` value for the endpoint |
| 230 | +to connect to the database's REST API, so you must specify this in |
| 231 | +the `DatabaseConfig` for each endpoint: |
| 232 | + |
| 233 | +```py |
| 234 | +db_configs = [ |
| 235 | + DatabaseConfig( |
| 236 | + client_kwargs={"host": "redis-east.example.com", "port": "14000"}, |
| 237 | + weight=1.0, |
| 238 | + health_check_url="https://health.redis-east.example.com" |
| 239 | + ), |
| 240 | + DatabaseConfig( |
| 241 | + client_kwargs={"host": "redis-west.example.com", "port": "14000"}, |
| 242 | + weight=0.5, |
| 243 | + health_check_url="https://health.redis-west.example.com" |
| 244 | + ), |
| 245 | +] |
| 246 | +``` |
| 247 | + |
| 248 | +You must also add a `LagAwareHealthCheck` instance to the `health_check` list in |
| 249 | +the `MultiDbConfig` constructor: |
| 250 | + |
| 251 | +```py |
| 252 | +cfg = MultiDbConfig( |
| 253 | + databases_config=db_configs, |
| 254 | + health_check=[LagAwareHealthCheck( |
| 255 | + rest_api_port=9443, |
| 256 | + lag_aware_tolerance=100, # ms |
| 257 | + verify_tls=True, |
| 258 | + # auth_basic=("user", "pass"), |
| 259 | + # ca_file="/path/ca.pem", |
| 260 | + # client_cert_file="/path/cert.pem", |
| 261 | + # client_key_file="/path/key.pem", |
| 262 | + )], |
| 263 | + ... |
| 264 | +) |
| 265 | + |
| 266 | +client = MultiDBClient(cfg) |
| 267 | +``` |
| 268 | + |
| 269 | +The `LagAwareHealthCheck` constructor accepts the following options: |
| 270 | + |
| 271 | +| Option | Description | |
| 272 | +| --- | --- | |
| 273 | +| `rest_api_port` | Port number for Redis Enterprise REST API (default is 9443). | |
| 274 | +| `lag_aware_tolerance` | Tolerable synchronization lag between databases in milliseconds (default is 100ms). | |
| 275 | +| `timeout` | REST API request timeout in seconds (default is 30 seconds). | |
| 276 | +| `auth_basic` | Tuple of (username, password) for basic authentication. | |
| 277 | +| `verify_tls` | Whether to verify TLS certificates (defaults to `True`). | |
| 278 | +| `ca_file` | Path to CA certificate file for TLS verification. | |
| 279 | +| `ca_path` | Path to CA certificates directory for TLS verification. | |
| 280 | +| `ca_data` | CA certificate data as string or bytes. | |
| 281 | +| `client_cert_file` | Path to client certificate file for mutual TLS. | |
| 282 | +| `client_key_file` | Path to client private key file for mutual TLS. | |
| 283 | +| `client_key_password` | Password for encrypted client private key | |
| 284 | + |
| 285 | +### Custom health check strategy |
| 286 | + |
| 287 | +You can supply your own custom health check strategy by |
| 288 | +deriving a new class from the `AbstractHealthCheck` class. |
| 289 | +For example, you might use this to integrate with external monitoring tools or |
| 290 | +to implement checks that are specific to your application. Add an |
| 291 | +instance of your custom class to the `health_check` list in |
| 292 | +the `MultiDbConfig` constructor, as with [`LagAwareHealthCheck`](#lag-aware-health-check). |
| 293 | + |
| 294 | +The example below |
| 295 | +shows a simple custom strategy that sends a Redis [`ECHO`]({{< relref "/commands/echo" >}}) |
| 296 | +command and checks for the expected response. |
| 297 | + |
| 298 | +```py |
| 299 | +from redis.multidb.healthcheck import AbstractHealthCheck |
| 300 | +from redis.retry import Retry |
| 301 | +from redis.utils import dummy_fail |
| 302 | + |
| 303 | +class EchoHealthCheck(AbstractHealthCheck): |
| 304 | + def __init__(self, retry: Retry): |
| 305 | + super().__init__(retry=retry) |
| 306 | + def check_health(self, database) -> bool: |
| 307 | + return self._retry.call_with_retry( |
| 308 | + lambda: self._returns_echo(database), |
| 309 | + lambda _: dummy_fail() |
| 310 | + ) |
| 311 | + def _returns_echo(self, database) -> bool: |
| 312 | + expected_message = ["Yodel-Ay-Ee-Oooo!", b"Yodel-Ay-Ee-Oooo!"] |
| 313 | + actual_message = database.client.execute_command("ECHO", "Yodel-Ay-Ee-Oooo!") |
| 314 | + return actual_message in expected_message |
| 315 | + |
| 316 | +cfg = MultiDbConfig( |
| 317 | + ... |
| 318 | + health_check=[EchoHealthCheck(retry=Retry(retries=3))], |
| 319 | + ... |
| 320 | +) |
| 321 | + |
| 322 | +client = MultiDBClient(cfg) |
| 323 | +``` |
| 324 | + |
| 325 | +## Managing databases at runtime |
| 326 | + |
| 327 | +Although you will typically configure all databases during the |
| 328 | +initial connection, you can also modify the configuration at runtime. |
| 329 | +You can add and remove database endpoints, update their weights, |
| 330 | +and manually set the active database rather than waiting for the |
| 331 | +failback mechanism: |
| 332 | + |
| 333 | +```py |
| 334 | +from redis.multidb.client import MultiDBClient |
| 335 | +from redis.multidb.config import MultiDbConfig, DatabaseConfig |
| 336 | +from redis.multidb.database import Database |
| 337 | +from redis.multidb.circuit import PBCircuitBreakerAdapter |
| 338 | +import pybreaker |
| 339 | +from redis import Redis |
| 340 | + |
| 341 | +cfg = MultiDbConfig( |
| 342 | + databases_config = [ |
| 343 | + DatabaseConfig( |
| 344 | + client_kwargs={"host": "redis-east.example.com", "port": "14000"}, |
| 345 | + weight=1.0 |
| 346 | + ), |
| 347 | + DatabaseConfig( |
| 348 | + client_kwargs={"host": "redis-west.example.com", "port": "14000"}, |
| 349 | + weight=0.5 |
| 350 | + ), |
| 351 | + ] |
| 352 | +) |
| 353 | +client = MultiDBClient(cfg) |
| 354 | + |
| 355 | +# Add a database programmatically. |
| 356 | +other = Database( |
| 357 | + client=Redis.from_url("redis://redis-south.example.com/0"), |
| 358 | + circuit=PBCircuitBreakerAdapter(pybreaker.CircuitBreaker(reset_timeout=5.0)), |
| 359 | + weight=0.5, |
| 360 | + health_check_url=None, |
| 361 | +) |
| 362 | +client.add_database(other) |
| 363 | + |
| 364 | +# Update the new database's weight. |
| 365 | +client.update_database_weight(other, 0.9) |
| 366 | + |
| 367 | +# Manually set it as the active database. |
| 368 | +client.set_active_database(other) |
| 369 | + |
| 370 | +# Remove the database from the failover set. |
| 371 | +client.remove_database(other) |
| 372 | +``` |
| 373 | + |
| 374 | +## Troubleshooting |
| 375 | + |
| 376 | +This section lists some common problems and their solutions. |
| 377 | + |
| 378 | +### Excessive or constant health check failures |
| 379 | + |
| 380 | +If all health checks fail, you should first rule out authentication |
| 381 | +problems with the Redis server and also make sure there are no persistent |
| 382 | +network connectivity problems. If you are using |
| 383 | +[`LagAwareHealthCheck`](#lag-aware-health-check), check that the `health_check_url` |
| 384 | +is set correctly for each endpoint. You can also try increasing the timeout |
| 385 | +for health checks and the interval between them. See |
| 386 | +[Health check configuration](#health-check-configuration) and |
| 387 | +[Endpoint configuration](#endpoint-configuration) for more information about these options. |
| 388 | + |
| 389 | +### Slow failback after recovery |
| 390 | + |
| 391 | +If failback is too slow after a server recovers, you can try |
| 392 | +reducing the `health_check_interval` period and also reducing the `grace_period` |
| 393 | +before failback is attempted (see [Health check configuration](#health-check-configuration) |
| 394 | +for more information about these options). |
0 commit comments