Currently if we recover using a ledger that ends before the startup snapshot, this will cause the recovery to fail (#7890).
We expect to be robust to this kind of failure.
There are two action items.
- We need to fix the bug preventing these nodes from recovering.
- We need to be more principled in our handling of gaps in the ledger during join and recovery.
To be specific on the second point the proposed behaviour is:
- Any non-committed ledger files before snapshot are marked .ignored (as we have no evidence that they have been committed by the previous service).
- Note that .committed is only a hint that they might have been committed not a guarantee. This can be tightened using receipts, but if the service identity changed within the gap, these receipts are hard to verify automatically.
- In join, anything newer than the snapshot should be marked .ignored and then re-replicated
- The ledger GC then deletes the .ignored files if there is a ledger file in the read-only-mount which supersedes it
- For complete files the filename is sufficient for this, but for incomplete files we may need to inspect the file itself.
Currently if we recover using a ledger that ends before the startup snapshot, this will cause the recovery to fail (#7890).
We expect to be robust to this kind of failure.
There are two action items.
To be specific on the second point the proposed behaviour is: