Improve robustness to incomplete ledgers during recovery

Currently if we recover using a ledger that ends before the startup snapshot, this will cause the recovery to fail (#7890).

We expect to be robust to this kind of failure.

There are two action items.
1. We need to fix the bug preventing these nodes from recovering.
2. We need to be more principled in our handling of gaps in the ledger during join and recovery.

To be specific on the second point the proposed behaviour is:
- Any non-committed ledger files before snapshot are marked .ignored (as we have no evidence that they have been committed by the previous service).
  - Note that .committed is only a hint that they might have been committed not a guarantee. This can be tightened using receipts, but if the service identity changed within the gap, these receipts are hard to verify automatically. 
- In join, anything newer than the snapshot should be marked .ignored and then re-replicated
- The ledger GC then deletes the .ignored files if there is a ledger file in the read-only-mount which supersedes it
  - For complete files the filename is sufficient for this, but for incomplete files we may need to inspect the file itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve robustness to incomplete ledgers during recovery #7891

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improve robustness to incomplete ledgers during recovery #7891

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions