Skip to content

Commit 9879716

Browse files
Address PR feedback for Active-Active app failover documentation
- Make sharding monitoring requirements less prescriptive, offer database-level and per-shard approaches - Convert asymmetric sharding section to a note for cleaner structure - Move dataset monitoring warning to Failback criteria section for better context - Fix Next steps section with appropriate links and remove broken monitoring link - Remove inappropriate generic troubleshooting content to keep focus on Redis Enterprise specifics
1 parent 138afcb commit 9879716

File tree

1 file changed

+40
-48
lines changed

1 file changed

+40
-48
lines changed

content/operate/rs/databases/active-active/develop/app-failover-active-active.md

Lines changed: 40 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,10 @@ Your application should monitor local replica failures and replication failures.
7070

7171
The most reliable way to detect replication failures is using Redis pub/sub.
7272

73+
{{< tip >}}
74+
**Why pub/sub works**: Pub/sub messages are delivered as replicated effects and are a more reliable indicator of a live replication link. In certain cases, dataset keys may appear to be modified even if the replication link fails. This happens because keys may receive updates through full-state replication (re-sync) or through online replication of effects. Unlike dataset changes, pub/sub doesn't make assumptions about your data structure.
75+
{{< /tip >}}
76+
7377
### How it works
7478

7579
1. Subscribe to a dedicated health-check channel on each replica.
@@ -125,18 +129,38 @@ The most reliable way to detect replication failures is using Redis pub/sub.
125129
mark_replica_unhealthy(replica_name)
126130
```
127131

128-
{{< tip >}}
129-
**Why pub/sub works**: Pub/sub messages are delivered as replicated effects, making them a reliable indicator of active replication links. Unlike dataset changes, pub/sub doesn't make assumptions about your data structure.
130-
{{< /tip >}}
131-
132132
## Handle sharded databases
133133

134-
If your Active-Active database uses sharding, you need to monitor each shard individually:
134+
If your Active-Active database uses sharding, you have several monitoring approaches:
135+
136+
### Database-level monitoring (simpler approach)
137+
138+
For many use cases, you can monitor the entire database using a single pub/sub channel per replica. This approach:
139+
140+
- **Works well when**: All shards typically fail together (node failures, network partitions)
141+
- **Simpler to implement**: Uses the same monitoring logic as non-sharded databases
142+
- **May miss**: Individual shard failures that don't affect the entire database
143+
144+
```python
145+
# Example implementation - adapt for your environment
146+
# Use the same approach as non-sharded databases
147+
for name, client in replicas.items():
148+
client.subscribe(f'health-check-{name}')
149+
```
150+
151+
### Per-shard monitoring (comprehensive approach)
152+
153+
Monitor each shard individually when you need to detect partial database failures:
135154

136-
### Symmetric sharding (recommended)
155+
#### Symmetric sharding (recommended)
137156

138157
With symmetric sharding, all replicas have the same number of shards and hash slots.
139158

159+
**When to use per-shard monitoring**:
160+
- You need to detect individual shard failures
161+
- Your application can handle partial database availability
162+
- You want maximum visibility into database health
163+
140164
**Monitoring approach**:
141165
1. Use the Cluster API to get the sharding configuration
142166
2. Create one pub/sub channel per shard
@@ -157,9 +181,9 @@ def get_channels_per_shard(redis_client):
157181
return channels
158182
```
159183

160-
### Asymmetric sharding (not recommended)
161-
162-
Asymmetric configurations require monitoring every hash slot intersection, which is complex and error-prone.
184+
{{< note >}}
185+
**Asymmetric sharding**: Asymmetric configurations require monitoring every hash slot intersection, which is complex and error-prone. For asymmetric sharding, database-level monitoring is often more practical than per-shard monitoring.
186+
{{< /note >}}
163187

164188
## Implement failover
165189

@@ -208,6 +232,10 @@ A replica is ready for failback when it's:
208232
2. **Synchronized**: Caught up with changes from other replicas.
209233
3. **Not stale**: You can read and write to the replica.
210234

235+
{{< warning >}}
236+
**Avoid dataset-based monitoring**: Don't rely solely on reading/writing test keys to determine replica health. Replicas can appear healthy while still in stale mode or missing recent updates.
237+
{{< /warning >}}
238+
211239
### Failback process
212240

213241
1. Verify replica health:
@@ -248,10 +276,6 @@ A replica is ready for failback when it's:
248276
redirect_writes_to(primary_replica)
249277
```
250278

251-
{{< warning >}}
252-
**Avoid dataset-based monitoring**: Don't rely solely on reading/writing test keys to determine replica health. Replicas can appear healthy while still in stale mode or missing recent updates.
253-
{{< /warning >}}
254-
255279
## Configuration best practices
256280

257281
### Application-side failover only
@@ -344,42 +368,10 @@ class FailoverRedisClient:
344368
pass
345369
```
346370

347-
## Next steps
348-
349-
- [Configure Active-Active databases]({{< relref "/operate/rs/databases/active-active/create" >}})
350-
- [Monitor Active-Active replication]({{< relref "/operate/rs/databases/active-active/monitor" >}})
351-
- [Develop applications with Active-Active databases]({{< relref "/operate/rs/databases/active-active/develop" >}})
352-
353-
## Troubleshooting common issues
354-
355-
### False positive failure detection
356-
357-
**Problem**: Application detects failures when replicas are actually healthy.
358-
359-
**Solutions**:
360-
- Increase heartbeat timeout windows
361-
- Use multiple consecutive failures before triggering failover
362-
- Monitor network latency between replicas
363-
364-
### Split-brain scenarios
365-
366-
**Problem**: Network partition causes multiple replicas to appear as "primary" to different application instances.
367-
368-
**Solutions**:
369-
- Implement consensus mechanisms in your application
370-
- Use external coordination services (like Consul or etcd)
371-
- Design for eventual consistency
372-
373-
### Slow failback
374-
375-
**Problem**: Replica appears healthy but failback causes performance issues.
376-
377-
**Solutions**:
378-
- Implement gradual failback (reads first, then writes)
379-
- Monitor replica performance metrics during failback
380-
- Use canary deployments for failback testing
381-
382371
## Related topics
383372

373+
- [Manage Active-Active databases]({{< relref "/operate/rs/databases/active-active/manage" >}})
374+
- [Active-Active database synchronization]({{< relref "/operate/rs/databases/active-active/syncer" >}})
375+
- [Monitor Redis Enterprise Software]({{< relref "/operate/rs/monitoring" >}})
384376
- [Redis pub/sub]({{< relref "/develop/interact/pubsub" >}})
385377
- [OSS Cluster API]({{< relref "/operate/rs/clusters/optimize/oss-cluster-api/" >}})

0 commit comments

Comments
 (0)