Skip to content

Improve robustness to incomplete ledgers during recovery #7891

@cjen1-msft

Description

@cjen1-msft

Currently if we recover using a ledger that ends before the startup snapshot, this will cause the recovery to fail (#7890).

We expect to be robust to this kind of failure.

There are two action items.

  1. We need to fix the bug preventing these nodes from recovering.
  2. We need to be more principled in our handling of gaps in the ledger during join and recovery.

To be specific on the second point the proposed behaviour is:

  • Any non-committed ledger files before snapshot are marked .ignored (as we have no evidence that they have been committed by the previous service).
    • Note that .committed is only a hint that they might have been committed not a guarantee. This can be tightened using receipts, but if the service identity changed within the gap, these receipts are hard to verify automatically.
  • In join, anything newer than the snapshot should be marked .ignored and then re-replicated
  • The ledger GC then deletes the .ignored files if there is a ledger file in the read-only-mount which supersedes it
    • For complete files the filename is sufficient for this, but for incomplete files we may need to inspect the file itself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions