Skip to content

heimdall: consensus, eth: add Heimdall HTTP/gRPC/WS failover with recovery#2050

Open
pratikspatil024 wants to merge 13 commits intodevelopfrom
psp-pos-3181
Open

heimdall: consensus, eth: add Heimdall HTTP/gRPC/WS failover with recovery#2050
pratikspatil024 wants to merge 13 commits intodevelopfrom
psp-pos-3181

Conversation

@pratikspatil024
Copy link
Member

@pratikspatil024 pratikspatil024 commented Feb 11, 2026

Description

  • Add transparent failover for all Heimdall client transports (HTTP, gRPC, WebSocket) so Bor stays operational when the primary Heimdall node goes down
  • All three flags (--bor.heimdall, --bor.heimdallgRPC, --bor.heimdallWS) now accept comma-separated lists of URLs for failover (e.g. "url1,url2,url3")
  • Introduce FailoverHeimdallClient that wraps N Heimdall endpoints, cascading through them on transport errors while staying sticky on the active endpoint until a cooldown-based probe recovers the primary
  • WS client (HeimdallWSClient) gains similar multi-URL failover with configurable attempt threshold and cooldown

Changes

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue, and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)
  • Changes only for a subset of nodes

Breaking changes

Please complete this section if any breaking changes have been made, otherwise delete it

Nodes audience

In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)

Checklist

  • I have added at least 2 reviewer or the whole pos-v1 team
  • I have added sufficient documentation in code
  • I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
  • Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
  • Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

  • This PR requires changes to heimdall
    • In case link the PR here:
  • This PR requires changes to matic-cli
    • In case link the PR here:

Testing

  • I have added unit tests
  • I have added tests to CI
  • I have tested this code manually on local environment
  • I have tested this code manually on remote devnet using express-cli
  • I have tested this code manually on amoy
  • I have created new e2e tests into express-cli

Manual tests

Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it

Additional comments

Please post additional comments in this section if you have them, otherwise delete it

@pratikspatil024 pratikspatil024 requested a review from a team February 11, 2026 04:58
@claude
Copy link

claude bot commented Feb 11, 2026

Code review

Bug: Missing timeout for secondary client calls

File: consensus/bor/heimdall/failover_client.go

The primary client calls use a 30-second timeout via context.WithTimeout (lines 122-123 and 157-158), but the secondary client calls at lines 148, 153, and 178 use the original ctx directly without any timeout.

If a caller passes context.Background() (which has no deadline) and the secondary endpoint is also unresponsive, these calls will hang indefinitely. This defeats the purpose of having a failover mechanism since the system could still hang when both endpoints are problematic.

Affected lines:

  • Line 148: return fn(ctx, f.clients[1]) - after probe fails
  • Line 153: return fn(ctx, f.clients[1]) - on secondary, not yet time to probe
  • Line 178: return fn(ctx, f.clients[1]) - failover to secondary

Suggested fix: Apply the same timeout to secondary calls:

// For each secondary call, wrap with timeout:
subCtx, cancel := context.WithTimeout(ctx, f.attemptTimeout)
defer cancel()
return fn(subCtx, f.clients[1])

@codecov
Copy link

codecov bot commented Feb 11, 2026

Codecov Report

❌ Patch coverage is 95.75290% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.51%. Comparing base (6f40e51) to head (27f53b9).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
consensus/bor/heimdall/failover_client.go 94.96% 6 Missing and 2 partials ⚠️
eth/ethconfig/config.go 96.15% 0 Missing and 2 partials ⚠️
consensus/bor/heimdallws/client.go 97.77% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2050      +/-   ##
===========================================
+ Coverage    50.32%   50.51%   +0.19%     
===========================================
  Files          871      872       +1     
  Lines       150917   151151     +234     
===========================================
+ Hits         75946    76353     +407     
+ Misses       69911    69733     -178     
- Partials      5060     5065       +5     
Files with missing lines Coverage Δ
internal/cli/server/config.go 63.14% <ø> (ø)
internal/cli/server/flags.go 100.00% <100.00%> (ø)
consensus/bor/heimdallws/client.go 79.87% <97.77%> (+79.87%) ⬆️
eth/ethconfig/config.go 81.11% <96.15%> (+43.61%) ⬆️
consensus/bor/heimdall/failover_client.go 94.96% <94.96%> (ø)

... and 17 files with indirect coverage changes

Files with missing lines Coverage Δ
internal/cli/server/config.go 63.14% <ø> (ø)
internal/cli/server/flags.go 100.00% <100.00%> (ø)
consensus/bor/heimdallws/client.go 79.87% <97.77%> (+79.87%) ⬆️
eth/ethconfig/config.go 81.11% <96.15%> (+43.61%) ⬆️
consensus/bor/heimdall/failover_client.go 94.96% <94.96%> (ø)

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@pratikspatil024
Copy link
Member Author

Code review

Bug: Missing timeout for secondary client calls

File: consensus/bor/heimdall/failover_client.go

The primary client calls use a 30-second timeout via context.WithTimeout (lines 122-123 and 157-158), but the secondary client calls at lines 148, 153, and 178 use the original ctx directly without any timeout.

If a caller passes context.Background() (which has no deadline) and the secondary endpoint is also unresponsive, these calls will hang indefinitely. This defeats the purpose of having a failover mechanism since the system could still hang when both endpoints are problematic.

Affected lines:

  • Line 148: return fn(ctx, f.clients[1]) - after probe fails
  • Line 153: return fn(ctx, f.clients[1]) - on secondary, not yet time to probe
  • Line 178: return fn(ctx, f.clients[1]) - failover to secondary

Suggested fix: Apply the same timeout to secondary calls:

// For each secondary call, wrap with timeout:
subCtx, cancel := context.WithTimeout(ctx, f.attemptTimeout)
defer cancel()
return fn(subCtx, f.clients[1])

@claude - This is intentional. The 30s timeout on primary exists specifically to bound the failover decision.
Once we're on secondary, there's nothing left to switch to, so a timeout would just cause unnecessary failures.

Adding a timeout on secondary would actually be harmful in two ways:

  1. Both endpoints down - the call fails after 30s, but the caller (FetchWithRetry) just retries the whole thing. Each retry cycle would now burn 60s (30s primary + 30s secondary) for no benefit.
  2. Secondary is healthy but slow - a large state sync query that legitimately takes >30s would get killed prematurely.

Callers in Bor never pass context.Background(), they always use contexts tied to the node lifecycle (via closeCh or consensus engine contexts), so there's always a cancellation path. Added a comment clarifying this rationale.

@claude
Copy link

claude bot commented Feb 11, 2026

Claude finished @pratikspatil024's task in 46s —— View job


Reviewing Design Decision

  • Read the failover client implementation to understand the context handling
  • Verify the caller patterns and context usage
  • Review the added comment clarifying the rationale

Thanks for the explanation! I've verified the design decision and the new comment at lines 148-151.

Your reasoning is sound:

  1. Timeout purpose is clear: The 30s timeout on primary bounds the failover decision window. Once on secondary, there's no further fallback, so no additional timeout is needed.

  2. Caller contexts have cancellation: Looking at the code, callers use:

    • Contexts with cancel functions (e.g., span_store.go:57 creates a cancellable context)
    • The closeCh shutdown channel in FetchWithRetry (lines 305-308 in client.go)
  3. Retry semantics make sense: If both endpoints are down, FetchWithRetry already handles retries at lines 294-328. Adding a secondary timeout would just waste 30s per retry cycle.

  4. The comment is helpful: The comment at lines 148-151 clearly explains the rationale for future maintainers.

One minor note: I did see some context.Background() usages in bor.go and snapshot.go (e.g., lines 586, 696, 775, 985, 1493, 1514, 1695, 1697). These aren't going through FetchWithRetry directly in most cases, but the tests in failover_client_test.go do use context.Background(). This is fine for tests since they're short-lived and controlled, but it validates why the comment is important—it documents the assumption that production callers use lifecycle-bound contexts.


@sonarqubecloud
Copy link

@pratikspatil024 pratikspatil024 changed the title heimdall: added secondary URL failover for Heimdall RPC heimdall: consensus, eth: add Heimdall HTTP/gRPC/WS failover with recovery Feb 12, 2026
@pratikspatil024
Copy link
Member Author

@claude - I updated the PR by extending the failover logic to grpc and ws client. Also now accepting a list of urls (multiple) instead of just secondary url. Review this PR again.

@claude
Copy link

claude bot commented Feb 12, 2026

Claude finished @pratikspatil024's task in 1m 16s —— View job


Code Review

  • Read failover client implementation (consensus/bor/heimdall/failover_client.go)
  • Read failover client tests (consensus/bor/heimdall/failover_client_test.go)
  • Read WS client changes (consensus/bor/heimdallws/client.go)
  • Read WS client tests (consensus/bor/heimdallws/client_test.go)
  • Read config changes (eth/ethconfig/config.go)
  • Read config tests (eth/ethconfig/config_test.go)
  • Read CLI flag changes (cmd/utils/bor_flags.go, internal/cli/server/)
  • Provide comprehensive review feedback

Overview

This PR adds a well-designed failover mechanism for Heimdall clients across all three transports (HTTP, gRPC, WebSocket). The implementation is solid with good test coverage (~96% patch coverage). Here's my detailed review:

Strengths

  1. Clean API design: The extension of existing flags to accept comma-separated URLs is backwards-compatible and intuitive.

  2. Sticky behavior with recovery probing: The failover client correctly stays on a working secondary endpoint while periodically probing the primary (after cooldown), which prevents unnecessary switching.

  3. Smart error classification: isFailoverError() distinguishes between transport errors (worth retrying on another endpoint) and application-level errors like ErrServiceUnavailable or ErrShutdownDetected (not worth retrying).

  4. Generic implementation: Using Go generics for callWithFailover[T any] and cascadeClients[T any] keeps the code DRY.

  5. Comprehensive tests: 800+ lines of tests covering normal operation, failover scenarios, probe-back behavior, and edge cases.

Issues and Suggestions

Medium: Potential nil pointer in WS client Close()

File: consensus/bor/heimdallws/client.go:256-260

func (c *HeimdallWSClient) Close() error {
    c.mu.Lock()
    defer c.mu.Unlock()
    return c.conn.Close()
}

If Close() is called before a successful connection (e.g., during startup failure), c.conn will be nil and this will panic. Consider adding a nil check:

func (c *HeimdallWSClient) Close() error {
    c.mu.Lock()
    defer c.mu.Unlock()
    if c.conn == nil {
        return nil
    }
    return c.conn.Close()
}

Fix this →

Low: WS failover counter resets on URL switch, not on connection success

File: consensus/bor/heimdallws/client.go:125-131

When switching URLs at line 129-131, primaryAttempts is reset to 0. However, if the new URL also fails repeatedly, it will switch again after primaryAttempts failures. This is fine behavior, but the variable name primaryAttempts is misleading since it's used for all URLs, not just the primary. Consider renaming to currentURLAttempts for clarity.

Low: Config client initialization could use index bounds more defensively

File: eth/ethconfig/config.go:359

httpURL := httpURLs[min(i, len(httpURLs)-1)]

This falls back to the last HTTP URL when there are more gRPC addresses than HTTP URLs. This behavior may not be obvious to operators. Consider adding a log message when this fallback occurs, or documenting this in the flag description.

Nit: Consider exposing cooldown/timeout as config options

The failover cooldown (2 minutes) and attempt timeout (30 seconds) are hardcoded. For production flexibility, consider exposing these as optional CLI flags with sensible defaults. This would allow operators to tune behavior based on their infrastructure.

Test Coverage Observations

The Codecov report shows 6 missing lines and 2 partials in failover_client.go. Looking at the code, these are likely:

  • Error logging paths that are hard to trigger in unit tests
  • The ErrNoResponse branch (line 257-259) which requires specific server behavior

These gaps are acceptable given the overall coverage.

Summary

This is a well-implemented feature that significantly improves Bor's resilience when connecting to Heimdall. The code is clean, the tests are thorough, and the design decisions (like the sticky-with-probe-back behavior) are sound.

Recommended action: Address the nil pointer issue in HeimdallWSClient.Close() before merging.


// FailoverHeimdallClient wraps N heimdall clients (primary at index 0, failovers
// at 1..N-1) and transparently cascades through them when the active client is
// unreachable. After a cooldown period it probes the primary again.
type FailoverHeimdallClient struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does MultiHeimdallClient sound better? Since this structure does not only contain the failover client, but also the primary client.


// If on a non-primary client and cooldown has elapsed, probe primary
if shouldProbe {
subCtx, cancel := context.WithTimeout(ctx, f.attemptTimeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be risky if primary is still not up after the cooldown, resulting a lag. Instead of querying on receiving a new request, we can probe primary in a background goroutine, and when it is back up, update the active accordingly.

// Cascade through clients [1, 2, ..., N-1]
log.Warn("Heimdall failover: primary failed, cascading to next client", "err", err)

return cascadeClients(f, ctx, fn, 0, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - by the time when we discover a failure in backup client, it might be too late to switch over to the next one if the request times out, causing a span rotation. We can always probe the active client in a background goroutine, which will automatically update the active if something goes wrong with the current one.


var out []string
for _, p := range parts {
p = strings.TrimSpace(p)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we verify these are valid URLs or we leave it up to the operator?

} else {
heimdallClient = grpcClient
heimdallClient = heimdall.NewFailoverHeimdallClient(heimdallClients...)
log.Info("Heimdall failover enabled", "endpoints", len(heimdallClients))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change to multiple instead of failover, worth to update this log as well. It could be misleading.

}

if len(wsAddrs) > 1 {
log.Info("Heimdall WS failover enabled", "endpoints", len(wsAddrs))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants