Skip to content

More resilient snapshots and some de/commit fixes.#2533

Open
v0d1ch wants to merge 16 commits intomasterfrom
resilient-snapshots
Open

More resilient snapshots and some de/commit fixes.#2533
v0d1ch wants to merge 16 commits intomasterfrom
resilient-snapshots

Conversation

@v0d1ch
Copy link
Contributor

@v0d1ch v0d1ch commented Mar 5, 2026


  • CHANGELOG updated or not needed
  • Documentation updated or not needed
  • Haddocks updated or not needed
  • No new TODOs introduced or explained herafter

@v0d1ch v0d1ch self-assigned this Mar 5, 2026
v0d1ch added 14 commits March 5, 2026 16:47
Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
It was green for the wrong reason.

Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
  When DecommitFinalized bumps the version while a snapshot request is
  already in-flight, nodes reject the stale request with a version
  mismatch error and the head gets permanently stuck. Add three tests
  that fail against the current code and will pass once the fix is in:

  - stale ReqSn with old version should be ignored, not rejected
  - DecommitFinalized should reset seenSnapshot to the confirmed number
  - BehaviorSpec end-to-end scenario where the race causes a deadlock

Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
  Replace the immediate per-ReqTx snapshot trigger with a timer-driven
  model: the periodic timer (TimerInput/onOpenTimer) batches pending work
  into ReqSn requests, while AckSn confirmation still chains consecutive
  snapshots for throughput.

  - Remove maybeRequestSnapshot from onOpenNetworkReqTx
  - Drop stale/duplicate ReqSn and AckSn silently (noop) instead of
    waiting or erroring; remove the now-unused WaitOn* variants
  - Reset localUTxO/localTxs/allTxs to confirmed state on version bump
    when a snapshot was in-flight, so the timer can build a fresh ReqSn
  - onOpenTimer re-broadcasts ReqSn + own AckSn when stuck in SeenSnapshot

Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
@v0d1ch v0d1ch force-pushed the resilient-snapshots branch from 3146382 to 766de0c Compare March 5, 2026 15:47
Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
@github-actions
Copy link

github-actions bot commented Mar 5, 2026

Transaction cost differences

No cost or size differences found

  - snapshotInFlight: allow leader to process its own ReqSn echo by
    changing RequestedSnapshot{} -> True to RequestedSnapshot{requested}
    -> sn /= requested. The old behaviour prevented the leader from ever
    signing its own broadcast, causing a permanent liveness deadlock.

  - onOpenTimer: compute nextSn as max(confSn, latestSeenSnapshotNumber)
    + 1 instead of confSn + 1. seenSnapshot and confirmedSnapshot can
    diverge when DecommitFinalized/CommitFinalized resets seenSnapshot via
    toLastSeenSnapshot while confirmedSnapshot is still behind. The old
    calculation made nextSn too small, causing snapshotInFlight to
    incorrectly block fresh snapshot requests.

  - requireReqSn: return noop for version mismatches, too-old snapshots,
    and overlapping requests instead of Error. The timer will retry with
    the correct version, so these are expected races not protocol
    violations.

  - Update tests to reflect timer-driven batching: ReqTx no longer
    immediately triggers ReqSn, all pending txs batch into one snapshot
    per timer tick, and TxInvalid precedes SnapshotConfirmed by ~10s.

Signed-off-by: Sasha Bogicevic <sasha.bogicevic@iohk.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant