Skip to content

FCU to canonical ancestor silently ignored, leaving descendants canonical#10876

Draft
svlachakis wants to merge 12 commits intomasterfrom
canonical-fix
Draft

FCU to canonical ancestor silently ignored, leaving descendants canonical#10876
svlachakis wants to merge 12 commits intomasterfrom
canonical-fix

Conversation

@svlachakis
Copy link
Contributor

@svlachakis svlachakis commented Mar 19, 2026

Fixes Closes Resolves #9906 & #10861

Changes

Fix: Stale canonical markers after beacon sync cause wrong eth_getBlockByNumber results

Problem

BlockTree.UpdateMainChain uses Head?.Number as the baseline for its unmark loop:

long previousHeadNumber = Head?.Number ?? 0L;
if (previousHeadNumber > lastNumber)
{
    // unmarks levels previousHeadNumber..lastNumber+1
}

During beacon sync, BlockDownloader calls:

UpdateMainChain([block], wereProcessed: false)

This marks downloaded blocks as canonical (HasBlockOnMainChain = true) without updating Head. The head update in MoveToMain is gated on
wasProcessed = true, so Head remains at whatever height the last FCU set it to.

After syncing blocks at heights H+1..Hn this way, Head remains stale at H.

When a reorg then arrives — FCU(B) where B is a sibling of the canonical block at height H — UpdateMainChain is called with:

  • lastNumber = H
  • previousHeadNumber = H (stale)

Since H > H is false, the unmark loop is skipped entirely. As a result, orphaned blocks at H+1..Hn retain HasBlockOnMainChain = true.

This causes eth_getBlockByNumber to return the wrong block at those heights — the canonical mismatch observed on Gnosis.

A second related case exists in ePBS, where the CL can issue FCU to an ancestor rather than a sibling. If Head is stale at H and beacon sync has
marked H+1..Hn canonical, a subsequent FCU(H-1) triggers the downward unmark loop which correctly clears H but cannot see above the stale head —
leaving H+1..Hn canonical.

Fix

Replaced the if/else with a downward unmark (unchanged) followed by an unconditional upward scan:

// Downward unmark — handles normal reorg where new chain is shorter than old head.
if (previousHeadNumber > lastNumber)
{
    for (long i = 0; i < previousHeadNumber - lastNumber; i++)
    {
        // clear previousHeadNumber..lastNumber+1
    }
}

// Upward scan — clears stale beacon-sync markers above max(previousHead, lastNumber).
// Covers both FCU at same/higher height and ePBS FCU to ancestor with stale head.
for (long levelNumber = Math.Max(previousHeadNumber, lastNumber) + 1; ; levelNumber++)
{
    ChainLevelInfo? level = LoadLevel(levelNumber);
    if (level is null || !level.HasBlockOnMainChain)
        break;

    level.HasBlockOnMainChain = false;
    _chainLevelInfoRepository.PersistLevel(levelNumber, level, batch);
}

The IsOnMainChainBehindHead early-return in ForkchoiceUpdatedHandler is not changed — it is spec-compliant, and geth's forkchoiceUpdated has
identical behaviour.


Startup healing flag for nodes already affected

Nodes that hit this bug before the fix may have corrupted canonical markers persisted to disk. The fix prevents new corruption from occurring,
but does not repair existing state.

A new startup flag triggers a one-time repair pass at boot:

--Init.HealCanonicalChainOnStartup true
--Init.HealCanonicalChainDepth 8192 (default, adjustable)

When enabled, ReviewBlockTree invokes BlockTree.HealCanonicalChain(head, depth) before the normal startup traversal. The heal runs in two phases
within a single BatchWrite (atomic, deferred flush):

  1. Upward scan (ClearStaleMarkersAbove): walks from head + 1 upward and clears HasBlockOnMainChain on any level left canonical by the sync path
    but no longer reachable from the true head.
  2. Backward walk (RepairMarkersBelow): follows ParentHash from the head for up to HealCanonicalChainDepth blocks, ensuring every ancestor has
    HasBlockOnMainChain = true and sits at index 0 in its ChainLevelInfo slot. Stops early at genesis, a missing parent (pruned chain), or a block
    hash not found in any slot.

HealCanonicalChain is a void command on IBlockTree. ReadOnlyBlockTree and StatelessBlockTree throw on invocation; BlockTreeOverlay delegates to
the inner tree.


Tests

BlockTreeTests.cs:

  • UpdateMainChain_sync_marks_descendant_canonical_then_reorg_to_sibling_decanonicalizes_it — regression test for the original fix. Failed before
    the upward scan, passes after.
  • UpdateMainChain_fcu_to_ancestor_with_stale_head_clears_all_beacon_synced_descendants — ePBS scenario: FCU to ancestor with beacon-synced blocks
    above the stale head. The downward loop clears the stale head level; the upward scan clears the rest. Failed before Math.Max, passes after.
  • HealCanonicalChain_clears_stale_marker_above_head_left_by_sync — upward scan removes single stale level.
  • HealCanonicalChain_clears_multiple_stale_levels_above_head — upward scan removes three consecutive stale levels.
  • HealCanonicalChain_fixes_wrong_canonical_block_at_height — downward walk swaps wrong canonical block back to index 0.
  • HealCanonicalChain_fixes_both_directions_in_one_pass — combined scenario: wrong canonical at H=1 and stale marker at H=2, both repaired in one
    call.
  • HealCanonicalChain_does_not_alter_already_consistent_chain — idempotency: clean chain is unchanged.
  • HealCanonicalChain_does_nothing_for_unknown_start_hash — unknown hash is a safe no-op.
  • HealCanonicalChain_respects_maxBlockDepth — depth limit prevents repairing beyond the configured window.

Types of changes

What types of changes does your code introduce?

  • Bugfix (a non-breaking change that fixes an issue)
  • New feature (a non-breaking change that adds functionality)
  • Breaking change (a change that causes existing functionality not to work as expected)
  • Optimization
  • Refactoring
  • Documentation update
  • Build-related changes
  • Other: Description

Testing

Requires testing

  • Yes
  • No

If yes, did you write tests?

  • Yes
  • No

@smartprogrammer93
Copy link
Contributor

@claude review

@claude
Copy link

claude bot commented Mar 19, 2026

Claude finished @smartprogrammer93's task in 3m 53s —— View job


Claude's Review

  • Gather context
  • Read changed files
  • Read relevant rules
  • Provide review feedback

Summary

The fix is correct and minimal. Removing the 5-line early-return restores the semantics demanded by the Engine API spec: a FCU to any already-known block must always update chain state, not just return Valid silently.


The Fix (ForkchoiceUpdatedHandler.cs)

The removed block (in the WasProcessed = true branch, after EnsureNewHead) was:

// REMOVED:
if (!IsOnMainChainBehindHead(newHeadBlock, forkchoiceState, out ResultWrapper<...>? result))
    return result;

This fired when block.Number < Head.Number && IsMainChain(block), skipping the UpdateMainChain call that de-canonicalises descendants. The existing guard below it already handles the true no-op case:

bool newHeadTheSameAsCurrentHead = _blockTree.Head!.Hash == newHeadBlock.Hash;
bool shouldUpdateHead = !newHeadTheSameAsCurrentHead && blocks is not null;
if (shouldUpdateHead)
    _blockTree.UpdateMainChain(blocks!, true, true);

This correctly avoids redundant work when FCU targets the current head while still processing ancestor FCUs.

Interesting finding: TaikoForkchoiceUpdatedHandler had already worked around this exact bug by overriding IsOnMainChainBehindHead to be a no-op (return true always). The PR correctly fixes the root cause in the base class instead.


Issues

1. The same early-return pattern survives in the !WasProcessed branch (line 170)

if (!blockInfo.WasProcessed)
{
    if (!IsOnMainChainBehindHead(newHeadBlock, forkchoiceState, out ... errorResult))
        return errorResult;   // ← still returns Valid here
    ...
    return ForkchoiceUpdatedV1Result.Syncing;
}

If a block exists in the chain with WasProcessed = false but is already marked canonical (possible in certain beacon-sync edge cases), this path also returns Valid without calling UpdateMainChain. This is a narrower scenario and likely only affects sync paths, but the same class of bug exists. Consider whether this branch also needs addressing. Investigate this →

2. IsOnMainChainBehindHead virtual method is now misleadingly positioned

The virtual method (lines 94–106) only affects the !WasProcessed path now. Its log message "already in canonical chain" combined with returning Valid is inconsistent with what that path does (all other branches in !WasProcessed return Syncing). Worth a follow-up cleanup or comment.


Test Coverage

Thorough. The regression test (fcu_to_ancestor_below_head_must_update_head_and_decanonicalise_descendants) captures the exact failure mode. The 18 BlockTreeTests unit tests cover:

  • Double reorgs and lower-height reorgs
  • SwapToMain with 3+ siblings
  • DB persistence after reload
  • IsMainChain / FindHeader / FindBlock consistency
  • wereProcessed = false marking
  • Best-difficulty fallback

Test structure and naming follow repo conventions correctly.


Verdict

Approve. The core fix is correct, minimal, and well-tested. The residual IsOnMainChainBehindHead usage in the !WasProcessed path (issue #1) is a pre-existing concern outside this PR's scope.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a forkchoice update edge case where FCU targeting an already-canonical ancestor below the current head was previously short-circuited, leaving the higher descendants incorrectly marked canonical and the head pointer unchanged. This aligns ForkchoiceUpdatedHandler.ApplyForkchoiceUpdate behavior with the intended canonical reorg semantics and adds regression/unit coverage around canonical marking.

Changes:

  • Remove the early-return optimization in ForkchoiceUpdatedHandler so UpdateMainChain runs even when FCU targets a canonical ancestor below head.
  • Add an Engine API regression test to ensure FCU to an ancestor moves head back and de-canonicalizes descendants.
  • Add a suite of BlockTree unit tests covering canonical marking/unmarking across reorg and lookup scenarios (including persistence/reload).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/Nethermind/Nethermind.Merge.Plugin/Handlers/ForkchoiceUpdatedHandler.cs Removes the processed-branch early return so FCU to a canonical ancestor triggers main-chain update/de-canonicalization.
src/Nethermind/Nethermind.Merge.Plugin.Test/EngineModuleTests.V1.cs Adds an Engine API regression test reproducing and preventing the “ancestor FCU ignored” canonical mismatch.
src/Nethermind/Nethermind.Blockchain.Test/BlockTreeTests.cs Adds canonical-marking/unit tests for reorg, lookup consistency, and persistence/reload edge cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@smartprogrammer93
Copy link
Contributor

@claude review

@claude
Copy link

claude bot commented Mar 20, 2026

Claude finished @smartprogrammer93's task in 5m 56s —— View job


Claude finished @smartprogrammer93's task


Claude's Review

  • Gather context
  • Read rules files
  • Read changed files
  • Provide review feedback

Context

This is a different implementation from what the previous Claude review described. The current branch does not touch ForkchoiceUpdatedHandler.cs. Instead it fixes the root cause in BlockTree.UpdateMainChain and adds a startup healing path.


Summary

The fix is correct and well-targeted. The else branch in UpdateMainChain handles exactly the scenario where previousHeadNumber == lastNumber (stale head after wereProcessed=false sync), and it also correctly no-ops in the normal forward-advancing case. The HealCanonicalChain startup recovery is well-designed with atomic writes. Test coverage is thorough.


Core Fix (BlockTree.cs:1016–1034)

The else branch fires when previousHeadNumber <= lastNumber, covering two cases:

Bug scenario (previousHeadNumber == lastNumber): BlockDownloader calls UpdateMainChain([C], false) marking C canonical at H+1 without advancing Head (which stays at A at H). When FCU(B, same height H) arrives, the old previousHeadNumber > lastNumber check is false (H > H = false), so the unmark loop was skipped entirely. The else branch now scans upward from lastNumber+1, finds C's stale marker, and clears it. ✓

Normal advancing (previousHeadNumber < lastNumber): The else branch also fires here but quickly breaks — LoadLevel(lastNumber + 1) returns null or a non-canonical level on every normal block. The behavior is a no-op except for one extra DB read per block. ✓

The fix is also correct in the re-sync-after-crash case, where stale markers from a previous run are progressively cleared as the sync path re-marks blocks one by one.


Issues

1. Extra LoadLevel read on the block-processing hot path (minor)

Every UpdateMainChain call where previousHeadNumber <= lastNumber — which is every normal block-processing call — now performs one additional LoadLevel(lastNumber + 1) read. In the common case this immediately breaks (nothing found), so the overhead is a single key-value lookup per block. Given that LoadLevel results are likely cached, this is low cost in practice, but it is a measurable regression on the forward-processing hot path. Investigate this →

2. HealCanonicalChainOnStartup not auto-disabled after first run (usability)

The PR description says "use this flag once after observing a canonical-mismatch". If a user enables it and forgets to turn it off, HealCanonicalChain runs on every restart (idempotently, but wastefully scanning up to 8192 blocks). The existing info log ("Healing canonical chain from head...") communicates it's running, but it'd be safer to add a warning like "Consider setting --Init.HealCanonicalChainOnStartup false after recovery" after the heal completes. Fix this →

3. Config items missing HiddenFromDocs (minor, coding style)

HealCanonicalChainOnStartup and HealCanonicalChainDepth are recovery-oriented one-time flags. Most similar diagnostic/technical flags in IInitConfig (e.g., DisableGcOnNewPayload, InRunnerTest, ExitOnBlockNumber) carry HiddenFromDocs = true to avoid surfacing them in standard docs. These should too. Fix this →

4. Typo in test name (trivial)

UpdateMainChain_sync_marks_descendant_canonical_then_reorg_to_sibling_decanonalizes_it — "decanonalizes" should be "decanonicalizes". Fix this →


Observations

  • BatchWrite atomicity: The HealCanonicalChain implementation correctly wraps both phases (upward clear + backward repair) in a single BatchWrite. A crash before Dispose() leaves the DB unchanged — well reasoned.

  • else branch in the sync path: When BlockDownloader calls UpdateMainChain([N+1], false) one block at a time, the else branch fires (since previousHeadNumber < lastNumber) and scans from N+2. If there's a pre-existing stale marker at N+2 from a crashed run, it will be cleared — then re-marked on the next sync call. Functionally correct, adds extra write traffic in re-sync scenarios but this is acceptable.

  • RepairMarkersBelow gracefully handles pruned chains: The parent is null guard with a Logger.Warn is correct. Good defensive coding.

  • ReadOnlyBlockTree throws, BlockTreeOverlay delegates: Both are correct. PerfTest/Program.cs stub no-op is consistent with how other unrelated methods are stubbed there.

  • Copilot's "Ahmad's change" comment: Already addressed — the test comment now says "RequireCanonical-by-number behavior change" which is neutral and appropriate.

  • 26 new tests: Well-structured. Scenarios are distinct enough that they don't mechanically differ by parameters alone, so not parameterizing them is reasonable under the repo test guidelines.


Verdict

Approve with minor suggestions. The core fix is correct, minimal, and properly tested. Items 1–4 are non-blocking — item 1 is a micro-optimization opportunity, items 2–4 are polish.

@LukaszRozmej
Copy link
Member

LukaszRozmej commented Mar 20, 2026

BTW in ePBS we will get FCU a block and then we can get FCU to the ancestor (not sure if always parent) and we need to support it. How does this affect this scenario? Is it compatible with this during sync?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gnosis chain - invalid block in canonical index

4 participants