Load balancer shard ping-pong: shards oscillate between servers without stabilization

## Summary

The coordinator's load balancer exhibits shard ping-pong behavior — shards are swapped from server A to B, then immediately back from B to A within 30 seconds. This creates unnecessary churn, increases client-visible latency during leader transitions, and wastes resources on follower catch-up.

## Environment

- **Oxia image**: `oxia/oxia@sha256:bb56deee95e0ca24589333c7c6b032a2b220c892e0afd8e689204c7c6535684b` (main branch)
- **Cluster**: 7 data servers, 32 shards, replication factor 3, notifications enabled
- **Active chaos testing**: pod kills every 5 min, network delays (100ms±50ms) every 3 min, coordinator kills every 10 min
- **OKK test cases**: 4 running (basic-kv, streaming-sequence, metadata-ephemeral, notification)

## Observed Behavior

**10 out of 32 shards** exhibited A→B→A bounce-back within the coordinator's logs over a ~12 minute window:

| Shard | Bounces | Pattern | Time Between |
|---|---|---|---|
| 0 | 1 | okk-2 → okk-6 → okk-2 | 30s |
| 6 | 1 | okk-6 → okk-3 → okk-6 | 30s |
| 8 | 1 | okk-6 → okk-2 → okk-6 | 30s |
| 17 | 1 | okk-4 → okk-2 → okk-4 | 43s |
| 19 | 1 | okk-1 → okk-4 → okk-1 | 2.5 min |
| 20 | 1 | okk-6 → okk-2 → okk-6 | 30s |
| 21 | 1 | okk-4 → okk-2 → okk-4 | 43s |
| 22 | 1 | okk-6 → okk-2 → okk-6 | 30s |
| 24 | 2 | okk-6 → okk-5 → okk-6, okk-6 → okk-2 → okk-6 | 30s each |
| 26 | 2 | okk-6 → okk-2 → okk-6, okk-2 → okk-6 → okk-2 | 30s each |

**Excessive movement**: Most shards had 8-13 total moves in ~12 minutes. Example shard movement paths:

```
Shard  1 (10 moves): okk-1 → okk-6 → okk-3 → okk-4 → okk-2 → okk-6 → okk-5 → okk-6 → okk-1 → okk-2 → okk-6
Shard  6 (13 moves): okk-1 → okk-6 → okk-3 → okk-4 → okk-2 → okk-6 → okk-0 → okk-5 → okk-1 → okk-6 → okk-2 → okk-6 → okk-3 → okk-6
Shard 26 (13 moves): okk-1 → okk-6 → okk-3 → okk-4 → okk-2 → okk-6 → okk-0 → okk-5 → okk-2 → okk-6 → okk-2 → okk-6 → okk-1 → okk-4
```

### Coordinator Log Evidence

```json
// 00:57:03 - shard 6 moved from okk-6 to okk-3
{"component":"load-balancer","shard":6,"from":"data-server-okk-6","to":"data-server-okk-3","message":"propose to swap the shard"}

// 00:57:33 - shard 6 moved BACK from okk-3 to okk-6 (30 seconds later)
{"component":"load-balancer","shard":6,"from":"data-server-okk-3","to":"data-server-okk-6","message":"propose to swap the shard"}
```

## Expected Behavior

After a shard swap, the load balancer should have a **stabilization/cooldown window** before considering the same shard for another swap. This would prevent oscillation and allow the cluster to converge to a stable state.

## Suggested Fix

Consider one or more of:
1. **Cooldown per shard**: After swapping shard X, don't propose another swap for shard X for N seconds (e.g., 5 min)
2. **Dampening threshold**: Only swap if the imbalance exceeds a minimum threshold (e.g., >2 leaders difference)
3. **Swap history check**: Before proposing A→B, check if B→A happened recently and skip if so
4. **Batch evaluation**: Evaluate balance globally rather than per-tick, considering pending swaps

## Impact

- Unnecessary leader transitions increase client-visible latency
- Follower catch-up after each swap wastes network/CPU
- Under chaos testing, the oscillation is amplified — shards never settle
- Despite the churn, no data loss was observed (DB checksums remain consistent)

## Context

Discovered during OKK testing framework verification of proposal [#849](https://github.com/orgs/oxia-db/discussions/849) (Shard Fingerprint Consistency Validation). The excessive shard movement may also be contributing to the WAL checksum divergences reported in #898.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load balancer shard ping-pong: shards oscillate between servers without stabilization #899

Summary

Environment

Observed Behavior

Coordinator Log Evidence

Expected Behavior

Suggested Fix

Impact

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Shard	Bounces	Pattern	Time Between
0	1	okk-2 → okk-6 → okk-2	30s
6	1	okk-6 → okk-3 → okk-6	30s
8	1	okk-6 → okk-2 → okk-6	30s
17	1	okk-4 → okk-2 → okk-4	43s
19	1	okk-1 → okk-4 → okk-1	2.5 min
20	1	okk-6 → okk-2 → okk-6	30s
21	1	okk-4 → okk-2 → okk-4	43s
22	1	okk-6 → okk-2 → okk-6	30s
24	2	okk-6 → okk-5 → okk-6, okk-6 → okk-2 → okk-6	30s each
26	2	okk-6 → okk-2 → okk-6, okk-2 → okk-6 → okk-2	30s each

Load balancer shard ping-pong: shards oscillate between servers without stabilization #899

Description

Summary

Environment

Observed Behavior

Coordinator Log Evidence

Expected Behavior

Suggested Fix

Impact

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions