Skip to content

Handle leader switch during Baseline Resync when new leader has all logs #835

@Besroy

Description

@Besroy

Problem Description

When a follower is catching up via Baseline Resync and a leader switch occurs where the new leader has all logs, the follower attempts to switch to incremental resync. However, this creates a stuck situation:

  1. During BR, the follower set last_snapshot_lsn as the snapshot lsn in save_snp_resync_data
  2. The followers skip log commits when last_snapshot_lsn != 0 at commit_ext
    This results in commit_lsn stuck at 0 while append_lsn is non-zero (e.g., put_blob requests fail because shard create messages cannot commit)

The incremental resync being stuck seems to be a valid option . If it were allowed to proceed, there will be many corner cases, for example:

  1. If the follower has LSN_A (new_leader_start_lsn < LSN_A < snapshot_lsn) before doing BR
  • The BR process will destroy all PGs and shards, then send PG/shard/blob data based on the snapshot (not by log sequence).
  • Then if a leader switch occurs and incremental resync is allowed, the new leader would send logs starting from LSN_A + 1.
  • This would result in the loss of logs from 1 to LSN_A unless additional efforts are made to replay the old logs. However, this approach is complex and may not be feasible, as the logs might be incomplete.
  1. If another leader switch occurs during incremental resync
  • The original BR process would resume (if the snapshot context still exists with same snapshot index)
  • This leads to unexpected corner cases such as blob leakage
  • Cleaning up the snapshot context to enable incremental resync requires significant efforts (such as the complexity described in the first point) and might introduces additional corner cases (where to trigger cleanup)

Therefore, the most ideal behavior might be to only allow the follower to recover through BR.

Current Workaround

Since the scenario where a member has all logs during BR is rare, the current workaround is: Manually delete the current leader pod to trigger re-election, then the follower can catch up by continuing the BR process

Additional Issue
Even if we have a workaround to let BR continue, there are potential problems:

  • Incremental resync allocates space during append log operations, which means BR may encounter no_space_left errors and cannot proceed
  • If incremental resync has appended logs up to LSN_A, while the new leader starts at LSN_B where LSN_B <= LSN_A, then the BR process will not continue
  • If the old leader comes to unhealthy, then there is no way to let BR continue

The reason why follower not continue snapshot is that we lost receiving_snapshot_ in the svc_state transition

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions