Skip to content

Load balancer shard ping-pong: shards oscillate between servers without stabilization #899

@mattisonchao

Description

@mattisonchao

Summary

The coordinator's load balancer exhibits shard ping-pong behavior — shards are swapped from server A to B, then immediately back from B to A within 30 seconds. This creates unnecessary churn, increases client-visible latency during leader transitions, and wastes resources on follower catch-up.

Environment

  • Oxia image: oxia/oxia@sha256:bb56deee95e0ca24589333c7c6b032a2b220c892e0afd8e689204c7c6535684b (main branch)
  • Cluster: 7 data servers, 32 shards, replication factor 3, notifications enabled
  • Active chaos testing: pod kills every 5 min, network delays (100ms±50ms) every 3 min, coordinator kills every 10 min
  • OKK test cases: 4 running (basic-kv, streaming-sequence, metadata-ephemeral, notification)

Observed Behavior

10 out of 32 shards exhibited A→B→A bounce-back within the coordinator's logs over a ~12 minute window:

Shard Bounces Pattern Time Between
0 1 okk-2 → okk-6 → okk-2 30s
6 1 okk-6 → okk-3 → okk-6 30s
8 1 okk-6 → okk-2 → okk-6 30s
17 1 okk-4 → okk-2 → okk-4 43s
19 1 okk-1 → okk-4 → okk-1 2.5 min
20 1 okk-6 → okk-2 → okk-6 30s
21 1 okk-4 → okk-2 → okk-4 43s
22 1 okk-6 → okk-2 → okk-6 30s
24 2 okk-6 → okk-5 → okk-6, okk-6 → okk-2 → okk-6 30s each
26 2 okk-6 → okk-2 → okk-6, okk-2 → okk-6 → okk-2 30s each

Excessive movement: Most shards had 8-13 total moves in ~12 minutes. Example shard movement paths:

Shard  1 (10 moves): okk-1 → okk-6 → okk-3 → okk-4 → okk-2 → okk-6 → okk-5 → okk-6 → okk-1 → okk-2 → okk-6
Shard  6 (13 moves): okk-1 → okk-6 → okk-3 → okk-4 → okk-2 → okk-6 → okk-0 → okk-5 → okk-1 → okk-6 → okk-2 → okk-6 → okk-3 → okk-6
Shard 26 (13 moves): okk-1 → okk-6 → okk-3 → okk-4 → okk-2 → okk-6 → okk-0 → okk-5 → okk-2 → okk-6 → okk-2 → okk-6 → okk-1 → okk-4

Coordinator Log Evidence

// 00:57:03 - shard 6 moved from okk-6 to okk-3
{"component":"load-balancer","shard":6,"from":"data-server-okk-6","to":"data-server-okk-3","message":"propose to swap the shard"}

// 00:57:33 - shard 6 moved BACK from okk-3 to okk-6 (30 seconds later)
{"component":"load-balancer","shard":6,"from":"data-server-okk-3","to":"data-server-okk-6","message":"propose to swap the shard"}

Expected Behavior

After a shard swap, the load balancer should have a stabilization/cooldown window before considering the same shard for another swap. This would prevent oscillation and allow the cluster to converge to a stable state.

Suggested Fix

Consider one or more of:

  1. Cooldown per shard: After swapping shard X, don't propose another swap for shard X for N seconds (e.g., 5 min)
  2. Dampening threshold: Only swap if the imbalance exceeds a minimum threshold (e.g., >2 leaders difference)
  3. Swap history check: Before proposing A→B, check if B→A happened recently and skip if so
  4. Batch evaluation: Evaluate balance globally rather than per-tick, considering pending swaps

Impact

  • Unnecessary leader transitions increase client-visible latency
  • Follower catch-up after each swap wastes network/CPU
  • Under chaos testing, the oscillation is amplified — shards never settle
  • Despite the churn, no data loss was observed (DB checksums remain consistent)

Context

Discovered during OKK testing framework verification of proposal #849 (Shard Fingerprint Consistency Validation). The excessive shard movement may also be contributing to the WAL checksum divergences reported in #898.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions