Skip to content

Reset propagation_delay to 0 after we catch up on all consensus commits#25157

Draft
mystenmark wants to merge 1 commit intomainfrom
mlogan-propagation-delay-fix
Draft

Reset propagation_delay to 0 after we catch up on all consensus commits#25157
mystenmark wants to merge 1 commit intomainfrom
mlogan-propagation-delay-fix

Conversation

@mystenmark
Copy link
Contributor

This change seems to fix test failures in which a node cannot submit EOP to consensus after the cluster restarts.

According to claude, the root cause is as follows: I'm not endorsing this analysis, but it sounds plausible and the suggested fix works.

  • ROOT CAUSE:

    1. When node 11 restarts, it recovers with last_proposed_round around 105
    2. The round_prober calculates propagation_delay as:
      propagation_delay = last_proposed_round - quorum_received_rounds[own_index]
    3. Since node 11 was down, the quorum hasn't received its recent blocks
      The difference is 105 - 91 = 14 rounds of delay
    4. propagation_delay_stop_proposal_threshold = 2 (from consensus parameters)
    5. Since 14 > 2, should_propose() returns false
    6. Node 11 NEVER proposes any blocks, even though:
      • ThresholdClock is properly advancing
      • clock_round > last_known_proposed_round would normally allow proposing
    7. Without proposing blocks, node 11 cannot include its EndOfPublish transaction
    8. The epoch change requires quorum (3) validators' EndOfPublish, but can only get 2
  • ANALYSIS: This is a consensus LIVENESS BUG in the propagation delay mechanism:

    • The propagation delay check is designed to prevent slow validators from proposing
    • But it creates a deadlock for catching-up nodes:
      • Node can't reduce propagation delay without proposing blocks
      • Node can't propose blocks due to high propagation delay
    • The node is stuck forever with propagation_delay = 14
  • RELEVANT CODE:

    • consensus/core/src/core.rs:938-981 - should_propose() with propagation delay check
    • consensus/core/src/round_tracker.rs:98-172 - calculate_propagation_delay()
    • Threshold: propagation_delay_stop_proposal_threshold = 2 (consensus parameters)
  • POTENTIAL FIXES (not implementing, just documenting):

    1. Reset propagation_delay to 0 after successful commit sync catch-up
    2. Add a "catch-up mode" that bypasses the propagation delay check
    3. Use a different metric for catching-up nodes (e.g., threshold clock vs last proposed)
    4. Gradually decrease propagation delay over time if node is synced

iteration 5 - FIX IMPLEMENTED

  • IMPLEMENTED FIX Windows test fixes #1: Reset propagation_delay to 0 after successful commit sync catch-up

  • CHANGES:

    • File: consensus/core/src/commit_syncer.rs
    • Added syncing_up: bool field to CommitSyncer struct to track if the node was lagging
    • Added logic in try_schedule_once() to detect when node catches up and reset propagation delay:
      • A node is "syncing up" when: synced_commit_index + batch_size < quorum_commit_index OR has pending/inflight fetches
      • When node transitions from syncing to caught up, call core_thread_dispatcher.set_propagation_delay(0)
  • RESULT: TEST PASSES

    • Before fix: Test times out waiting for epoch change
    • After fix: Test passes in ~114-122 seconds

@mystenmark mystenmark temporarily deployed to sui-typescript-aws-kms-test-env January 29, 2026 01:52 — with GitHub Actions Inactive
@vercel
Copy link

vercel bot commented Jan 29, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
sui-docs Ready Ready Preview, Comment Jan 29, 2026 1:55am
2 Skipped Deployments
Project Deployment Review Updated (UTC)
multisig-toolkit Ignored Ignored Preview Jan 29, 2026 1:55am
sui-kiosk Ignored Ignored Preview Jan 29, 2026 1:55am

Request Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant