-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Summary
The coordinator's load balancer exhibits shard ping-pong behavior — shards are swapped from server A to B, then immediately back from B to A within 30 seconds. This creates unnecessary churn, increases client-visible latency during leader transitions, and wastes resources on follower catch-up.
Environment
- Oxia image:
oxia/oxia@sha256:bb56deee95e0ca24589333c7c6b032a2b220c892e0afd8e689204c7c6535684b(main branch) - Cluster: 7 data servers, 32 shards, replication factor 3, notifications enabled
- Active chaos testing: pod kills every 5 min, network delays (100ms±50ms) every 3 min, coordinator kills every 10 min
- OKK test cases: 4 running (basic-kv, streaming-sequence, metadata-ephemeral, notification)
Observed Behavior
10 out of 32 shards exhibited A→B→A bounce-back within the coordinator's logs over a ~12 minute window:
| Shard | Bounces | Pattern | Time Between |
|---|---|---|---|
| 0 | 1 | okk-2 → okk-6 → okk-2 | 30s |
| 6 | 1 | okk-6 → okk-3 → okk-6 | 30s |
| 8 | 1 | okk-6 → okk-2 → okk-6 | 30s |
| 17 | 1 | okk-4 → okk-2 → okk-4 | 43s |
| 19 | 1 | okk-1 → okk-4 → okk-1 | 2.5 min |
| 20 | 1 | okk-6 → okk-2 → okk-6 | 30s |
| 21 | 1 | okk-4 → okk-2 → okk-4 | 43s |
| 22 | 1 | okk-6 → okk-2 → okk-6 | 30s |
| 24 | 2 | okk-6 → okk-5 → okk-6, okk-6 → okk-2 → okk-6 | 30s each |
| 26 | 2 | okk-6 → okk-2 → okk-6, okk-2 → okk-6 → okk-2 | 30s each |
Excessive movement: Most shards had 8-13 total moves in ~12 minutes. Example shard movement paths:
Shard 1 (10 moves): okk-1 → okk-6 → okk-3 → okk-4 → okk-2 → okk-6 → okk-5 → okk-6 → okk-1 → okk-2 → okk-6
Shard 6 (13 moves): okk-1 → okk-6 → okk-3 → okk-4 → okk-2 → okk-6 → okk-0 → okk-5 → okk-1 → okk-6 → okk-2 → okk-6 → okk-3 → okk-6
Shard 26 (13 moves): okk-1 → okk-6 → okk-3 → okk-4 → okk-2 → okk-6 → okk-0 → okk-5 → okk-2 → okk-6 → okk-2 → okk-6 → okk-1 → okk-4
Coordinator Log Evidence
// 00:57:03 - shard 6 moved from okk-6 to okk-3
{"component":"load-balancer","shard":6,"from":"data-server-okk-6","to":"data-server-okk-3","message":"propose to swap the shard"}
// 00:57:33 - shard 6 moved BACK from okk-3 to okk-6 (30 seconds later)
{"component":"load-balancer","shard":6,"from":"data-server-okk-3","to":"data-server-okk-6","message":"propose to swap the shard"}Expected Behavior
After a shard swap, the load balancer should have a stabilization/cooldown window before considering the same shard for another swap. This would prevent oscillation and allow the cluster to converge to a stable state.
Suggested Fix
Consider one or more of:
- Cooldown per shard: After swapping shard X, don't propose another swap for shard X for N seconds (e.g., 5 min)
- Dampening threshold: Only swap if the imbalance exceeds a minimum threshold (e.g., >2 leaders difference)
- Swap history check: Before proposing A→B, check if B→A happened recently and skip if so
- Batch evaluation: Evaluate balance globally rather than per-tick, considering pending swaps
Impact
- Unnecessary leader transitions increase client-visible latency
- Follower catch-up after each swap wastes network/CPU
- Under chaos testing, the oscillation is amplified — shards never settle
- Despite the churn, no data loss was observed (DB checksums remain consistent)
Context
Discovered during OKK testing framework verification of proposal #849 (Shard Fingerprint Consistency Validation). The excessive shard movement may also be contributing to the WAL checksum divergences reported in #898.