You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Address PR feedback for Active-Active app failover documentation
- Make sharding monitoring requirements less prescriptive, offer database-level and per-shard approaches
- Convert asymmetric sharding section to a note for cleaner structure
- Move dataset monitoring warning to Failback criteria section for better context
- Fix Next steps section with appropriate links and remove broken monitoring link
- Remove inappropriate generic troubleshooting content to keep focus on Redis Enterprise specifics
Copy file name to clipboardExpand all lines: content/operate/rs/databases/active-active/develop/app-failover-active-active.md
+40-48Lines changed: 40 additions & 48 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -70,6 +70,10 @@ Your application should monitor local replica failures and replication failures.
70
70
71
71
The most reliable way to detect replication failures is using Redis pub/sub.
72
72
73
+
{{< tip >}}
74
+
**Why pub/sub works**: Pub/sub messages are delivered as replicated effects and are a more reliable indicator of a live replication link. In certain cases, dataset keys may appear to be modified even if the replication link fails. This happens because keys may receive updates through full-state replication (re-sync) or through online replication of effects. Unlike dataset changes, pub/sub doesn't make assumptions about your data structure.
75
+
{{< /tip >}}
76
+
73
77
### How it works
74
78
75
79
1. Subscribe to a dedicated health-check channel on each replica.
@@ -125,18 +129,38 @@ The most reliable way to detect replication failures is using Redis pub/sub.
125
129
mark_replica_unhealthy(replica_name)
126
130
```
127
131
128
-
{{< tip >}}
129
-
**Why pub/sub works**: Pub/sub messages are delivered as replicated effects, making them a reliable indicator of active replication links. Unlike dataset changes, pub/sub doesn't make assumptions about your data structure.
130
-
{{< /tip >}}
131
-
132
132
## Handle sharded databases
133
133
134
-
If your Active-Active database uses sharding, you need to monitor each shard individually:
134
+
If your Active-Active database uses sharding, you have several monitoring approaches:
135
+
136
+
### Database-level monitoring (simpler approach)
137
+
138
+
For many use cases, you can monitor the entire database using a single pub/sub channel per replica. This approach:
139
+
140
+
-**Works well when**: All shards typically fail together (node failures, network partitions)
141
+
-**Simpler to implement**: Uses the same monitoring logic as non-sharded databases
142
+
-**May miss**: Individual shard failures that don't affect the entire database
143
+
144
+
```python
145
+
# Example implementation - adapt for your environment
146
+
# Use the same approach as non-sharded databases
147
+
for name, client in replicas.items():
148
+
client.subscribe(f'health-check-{name}')
149
+
```
150
+
151
+
### Per-shard monitoring (comprehensive approach)
152
+
153
+
Monitor each shard individually when you need to detect partial database failures:
135
154
136
-
### Symmetric sharding (recommended)
155
+
####Symmetric sharding (recommended)
137
156
138
157
With symmetric sharding, all replicas have the same number of shards and hash slots.
139
158
159
+
**When to use per-shard monitoring**:
160
+
- You need to detect individual shard failures
161
+
- Your application can handle partial database availability
162
+
- You want maximum visibility into database health
163
+
140
164
**Monitoring approach**:
141
165
1. Use the Cluster API to get the sharding configuration
Asymmetric configurations require monitoring every hash slot intersection, which is complex and error-prone.
184
+
{{< note >}}
185
+
**Asymmetric sharding**: Asymmetric configurations require monitoring every hash slot intersection, which is complex and error-prone. For asymmetric sharding, database-level monitoring is often more practical than per-shard monitoring.
186
+
{{< /note >}}
163
187
164
188
## Implement failover
165
189
@@ -208,6 +232,10 @@ A replica is ready for failback when it's:
208
232
2.**Synchronized**: Caught up with changes from other replicas.
209
233
3.**Not stale**: You can read and write to the replica.
210
234
235
+
{{< warning >}}
236
+
**Avoid dataset-based monitoring**: Don't rely solely on reading/writing test keys to determine replica health. Replicas can appear healthy while still in stale mode or missing recent updates.
237
+
{{< /warning >}}
238
+
211
239
### Failback process
212
240
213
241
1. Verify replica health:
@@ -248,10 +276,6 @@ A replica is ready for failback when it's:
248
276
redirect_writes_to(primary_replica)
249
277
```
250
278
251
-
{{< warning >}}
252
-
**Avoid dataset-based monitoring**: Don't rely solely on reading/writing test keys to determine replica health. Replicas can appear healthy while still in stale mode or missing recent updates.
0 commit comments