heimdall: consensus, eth: add Heimdall HTTP/gRPC/WS failover with recovery by pratikspatil024 · Pull Request #2050 · 0xPolygon/bor

pratikspatil024 · 2026-02-11T04:58:42Z

Description

Add transparent failover for all Heimdall client transports (HTTP, gRPC, WebSocket) so Bor stays operational when the primary Heimdall node goes down
All three flags (--bor.heimdall, --bor.heimdallgRPC, --bor.heimdallWS) now accept comma-separated lists of URLs for failover (e.g. "url1,url2,url3")
Introduce FailoverHeimdallClient that wraps N Heimdall endpoints, cascading through them on transport errors while staying sticky on the active endpoint until a cooldown-based probe recovers the primary
WS client (HeimdallWSClient) gains similar multi-URL failover with configurable attempt threshold and cooldown

Changes

Bugfix (non-breaking change that solves an issue)
Hotfix (change that solves an urgent issue, and requires immediate attention)
New feature (non-breaking change that adds functionality)
Breaking change (change that is not backwards-compatible and/or changes current functionality)
Changes only for a subset of nodes

Breaking changes

Please complete this section if any breaking changes have been made, otherwise delete it

Nodes audience

In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)

Checklist

I have added at least 2 reviewer or the whole pos-v1 team
I have added sufficient documentation in code
I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

This PR requires changes to heimdall
- In case link the PR here:
This PR requires changes to matic-cli
- In case link the PR here:

Testing

I have added unit tests
I have added tests to CI
I have tested this code manually on local environment
I have tested this code manually on remote devnet using express-cli
I have tested this code manually on amoy
I have created new e2e tests into express-cli

Manual tests

Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it

Additional comments

Please post additional comments in this section if you have them, otherwise delete it

claude · 2026-02-11T05:04:49Z

Code review

Bug: Missing timeout for secondary client calls

File: consensus/bor/heimdall/failover_client.go

The primary client calls use a 30-second timeout via context.WithTimeout (lines 122-123 and 157-158), but the secondary client calls at lines 148, 153, and 178 use the original ctx directly without any timeout.

If a caller passes context.Background() (which has no deadline) and the secondary endpoint is also unresponsive, these calls will hang indefinitely. This defeats the purpose of having a failover mechanism since the system could still hang when both endpoints are problematic.

Affected lines:

Line 148: return fn(ctx, f.clients[1]) - after probe fails
Line 153: return fn(ctx, f.clients[1]) - on secondary, not yet time to probe
Line 178: return fn(ctx, f.clients[1]) - failover to secondary

Suggested fix: Apply the same timeout to secondary calls:

// For each secondary call, wrap with timeout:
subCtx, cancel := context.WithTimeout(ctx, f.attemptTimeout)
defer cancel()
return fn(subCtx, f.clients[1])

codecov · 2026-02-11T12:07:55Z

Codecov Report

❌ Patch coverage is 95.75290% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.51%. Comparing base (6f40e51) to head (27f53b9).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines	Patch %	Lines
consensus/bor/heimdall/failover_client.go	94.96%	6 Missing and 2 partials ⚠️
eth/ethconfig/config.go	96.15%	0 Missing and 2 partials ⚠️
consensus/bor/heimdallws/client.go	97.77%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2050      +/-   ##
===========================================
+ Coverage    50.32%   50.51%   +0.19%     
===========================================
  Files          871      872       +1     
  Lines       150917   151151     +234     
===========================================
+ Hits         75946    76353     +407     
+ Misses       69911    69733     -178     
- Partials      5060     5065       +5

Files with missing lines	Coverage Δ
internal/cli/server/config.go	`63.14% <ø> (ø)`
internal/cli/server/flags.go	`100.00% <100.00%> (ø)`
consensus/bor/heimdallws/client.go	`79.87% <97.77%> (+79.87%)`	⬆️
eth/ethconfig/config.go	`81.11% <96.15%> (+43.61%)`	⬆️
consensus/bor/heimdall/failover_client.go	`94.96% <94.96%> (ø)`

... and 17 files with indirect coverage changes

Files with missing lines	Coverage Δ
internal/cli/server/config.go	`63.14% <ø> (ø)`
internal/cli/server/flags.go	`100.00% <100.00%> (ø)`
consensus/bor/heimdallws/client.go	`79.87% <97.77%> (+79.87%)`	⬆️
eth/ethconfig/config.go	`81.11% <96.15%> (+43.61%)`	⬆️
consensus/bor/heimdall/failover_client.go	`94.96% <94.96%> (ø)`

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

pratikspatil024 · 2026-02-11T16:47:17Z

Code review

Bug: Missing timeout for secondary client calls

File: consensus/bor/heimdall/failover_client.go

The primary client calls use a 30-second timeout via context.WithTimeout (lines 122-123 and 157-158), but the secondary client calls at lines 148, 153, and 178 use the original ctx directly without any timeout.

If a caller passes context.Background() (which has no deadline) and the secondary endpoint is also unresponsive, these calls will hang indefinitely. This defeats the purpose of having a failover mechanism since the system could still hang when both endpoints are problematic.

Affected lines:

Line 148: return fn(ctx, f.clients[1]) - after probe fails

Line 153: return fn(ctx, f.clients[1]) - on secondary, not yet time to probe

Line 178: return fn(ctx, f.clients[1]) - failover to secondary

Suggested fix: Apply the same timeout to secondary calls:
// For each secondary call, wrap with timeout:
subCtx, cancel := context.WithTimeout(ctx, f.attemptTimeout)
defer cancel()
return fn(subCtx, f.clients[1])

@claude - This is intentional. The 30s timeout on primary exists specifically to bound the failover decision.
Once we're on secondary, there's nothing left to switch to, so a timeout would just cause unnecessary failures.

Adding a timeout on secondary would actually be harmful in two ways:

Both endpoints down - the call fails after 30s, but the caller (FetchWithRetry) just retries the whole thing. Each retry cycle would now burn 60s (30s primary + 30s secondary) for no benefit.
Secondary is healthy but slow - a large state sync query that legitimately takes >30s would get killed prematurely.

Callers in Bor never pass context.Background(), they always use contexts tied to the node lifecycle (via closeCh or consensus engine contexts), so there's always a cancellation path. Added a comment clarifying this rationale.

claude · 2026-02-11T16:47:36Z

Claude finished @pratikspatil024's task in 46s —— View job

Reviewing Design Decision

Read the failover client implementation to understand the context handling
Verify the caller patterns and context usage
Review the added comment clarifying the rationale

Thanks for the explanation! I've verified the design decision and the new comment at lines 148-151.

Your reasoning is sound:

Timeout purpose is clear: The 30s timeout on primary bounds the failover decision window. Once on secondary, there's no further fallback, so no additional timeout is needed.
Caller contexts have cancellation: Looking at the code, callers use:
- Contexts with cancel functions (e.g., span_store.go:57 creates a cancellable context)
- The closeCh shutdown channel in FetchWithRetry (lines 305-308 in client.go)
Retry semantics make sense: If both endpoints are down, FetchWithRetry already handles retries at lines 294-328. Adding a secondary timeout would just waste 30s per retry cycle.
The comment is helpful: The comment at lines 148-151 clearly explains the rationale for future maintainers.

One minor note: I did see some context.Background() usages in bor.go and snapshot.go (e.g., lines 586, 696, 775, 985, 1493, 1514, 1695, 1697). These aren't going through FetchWithRetry directly in most cases, but the tests in failover_client_test.go do use context.Background(). This is fine for tests since they're short-lived and controlled, but it validates why the comment is important—it documents the assumption that production callers use lifecycle-bound contexts.

cmd/utils/bor_flags.go

sonarqubecloud · 2026-02-12T14:23:10Z

Quality Gate passed

Issues
55 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
2.7% Duplication on New Code

See analysis details on SonarQube Cloud

pratikspatil024 · 2026-02-12T16:11:17Z

@claude - I updated the PR by extending the failover logic to grpc and ws client. Also now accepting a list of urls (multiple) instead of just secondary url. Review this PR again.

claude · 2026-02-12T16:11:39Z

Claude finished @pratikspatil024's task in 1m 16s —— View job

Code Review

Read failover client implementation (consensus/bor/heimdall/failover_client.go)
Read failover client tests (consensus/bor/heimdall/failover_client_test.go)
Read WS client changes (consensus/bor/heimdallws/client.go)
Read WS client tests (consensus/bor/heimdallws/client_test.go)
Read config changes (eth/ethconfig/config.go)
Read config tests (eth/ethconfig/config_test.go)
Read CLI flag changes (cmd/utils/bor_flags.go, internal/cli/server/)
Provide comprehensive review feedback

Overview

This PR adds a well-designed failover mechanism for Heimdall clients across all three transports (HTTP, gRPC, WebSocket). The implementation is solid with good test coverage (~96% patch coverage). Here's my detailed review:

Strengths

Clean API design: The extension of existing flags to accept comma-separated URLs is backwards-compatible and intuitive.
Sticky behavior with recovery probing: The failover client correctly stays on a working secondary endpoint while periodically probing the primary (after cooldown), which prevents unnecessary switching.
Smart error classification: isFailoverError() distinguishes between transport errors (worth retrying on another endpoint) and application-level errors like ErrServiceUnavailable or ErrShutdownDetected (not worth retrying).
Generic implementation: Using Go generics for callWithFailover[T any] and cascadeClients[T any] keeps the code DRY.
Comprehensive tests: 800+ lines of tests covering normal operation, failover scenarios, probe-back behavior, and edge cases.

Issues and Suggestions

Medium: Potential nil pointer in WS client Close()

File: consensus/bor/heimdallws/client.go:256-260

func (c *HeimdallWSClient) Close() error {
    c.mu.Lock()
    defer c.mu.Unlock()
    return c.conn.Close()
}

If Close() is called before a successful connection (e.g., during startup failure), c.conn will be nil and this will panic. Consider adding a nil check:

func (c *HeimdallWSClient) Close() error {
    c.mu.Lock()
    defer c.mu.Unlock()
    if c.conn == nil {
        return nil
    }
    return c.conn.Close()
}

Fix this →

Low: WS failover counter resets on URL switch, not on connection success

File: consensus/bor/heimdallws/client.go:125-131

When switching URLs at line 129-131, primaryAttempts is reset to 0. However, if the new URL also fails repeatedly, it will switch again after primaryAttempts failures. This is fine behavior, but the variable name primaryAttempts is misleading since it's used for all URLs, not just the primary. Consider renaming to currentURLAttempts for clarity.

Low: Config client initialization could use index bounds more defensively

File: eth/ethconfig/config.go:359

httpURL := httpURLs[min(i, len(httpURLs)-1)]

This falls back to the last HTTP URL when there are more gRPC addresses than HTTP URLs. This behavior may not be obvious to operators. Consider adding a log message when this fallback occurs, or documenting this in the flag description.

Nit: Consider exposing cooldown/timeout as config options

The failover cooldown (2 minutes) and attempt timeout (30 seconds) are hardcoded. For production flexibility, consider exposing these as optional CLI flags with sensible defaults. This would allow operators to tune behavior based on their infrastructure.

Test Coverage Observations

The Codecov report shows 6 missing lines and 2 partials in failover_client.go. Looking at the code, these are likely:

Error logging paths that are hard to trigger in unit tests
The ErrNoResponse branch (line 257-259) which requires specific server behavior

These gaps are acceptable given the overall coverage.

Summary

This is a well-implemented feature that significantly improves Bor's resilience when connecting to Heimdall. The code is clean, the tests are thorough, and the design decisions (like the sticky-with-probe-back behavior) are sound.

Recommended action: Address the nil pointer issue in HeimdallWSClient.Close() before merging.

cffls · 2026-02-12T22:35:05Z

consensus/bor/heimdall/failover_client.go

+// FailoverHeimdallClient wraps N heimdall clients (primary at index 0, failovers
+// at 1..N-1) and transparently cascades through them when the active client is
+// unreachable. After a cooldown period it probes the primary again.
+type FailoverHeimdallClient struct {


Does MultiHeimdallClient sound better? Since this structure does not only contain the failover client, but also the primary client.

cffls · 2026-02-12T22:50:20Z

consensus/bor/heimdall/failover_client.go

+
+	// If on a non-primary client and cooldown has elapsed, probe primary
+	if shouldProbe {
+		subCtx, cancel := context.WithTimeout(ctx, f.attemptTimeout)


This may be risky if primary is still not up after the cooldown, resulting a lag. Instead of querying on receiving a new request, we can probe primary in a background goroutine, and when it is back up, update the active accordingly.

cffls · 2026-02-12T22:55:06Z

consensus/bor/heimdall/failover_client.go

+	// Cascade through clients [1, 2, ..., N-1]
+	log.Warn("Heimdall failover: primary failed, cascading to next client", "err", err)
+
+	return cascadeClients(f, ctx, fn, 0, err)


Same as above - by the time when we discover a failure in backup client, it might be too late to switch over to the next one if the request times out, causing a span rotation. We can always probe the active client in a background goroutine, which will automatically update the active if something goes wrong with the current one.

marcello33 · 2026-02-13T07:10:47Z

eth/ethconfig/config.go

+
+	var out []string
+	for _, p := range parts {
+		p = strings.TrimSpace(p)


Shall we verify these are valid URLs or we leave it up to the operator?

marcello33 · 2026-02-13T07:12:02Z

eth/ethconfig/config.go

 				} else {
-					heimdallClient = grpcClient
+					heimdallClient = heimdall.NewFailoverHeimdallClient(heimdallClients...)
+					log.Info("Heimdall failover enabled", "endpoints", len(heimdallClients))


If we change to multiple instead of failover, worth to update this log as well. It could be misleading.

marcello33 · 2026-02-13T07:12:24Z

eth/ethconfig/config.go

 				}
+
+				if len(wsAddrs) > 1 {
+					log.Info("Heimdall WS failover enabled", "endpoints", len(wsAddrs))


Same as before

heimdall: added initial implementation of heimdall RPC fallback

1629daf

pratikspatil024 requested a review from a team February 11, 2026 04:58

pratikspatil024 added 2 commits February 11, 2026 16:12

added comment for clarification

fe0c07b

reduced the colldown time to 2 minutes

fe49be3

pratikspatil024 added 4 commits February 11, 2026 21:25

added more unit tests

0c8c0a6

lint and duplication fix

93bd0e6

1 more unit test

a02d07f

Merge branch 'develop' of github.com:0xPolygon/bor into psp-pos-3181

b66fe79

cffls reviewed Feb 12, 2026

View reviewed changes

cmd/utils/bor_flags.go Outdated Show resolved Hide resolved

pratikspatil024 added 6 commits February 12, 2026 13:54

added failover for heimdall grpc and ws clients

8a9d2f7

added tests

d4df759

accepting a list of urls (multiple) instead of just secondary url

25b7cad

code duplication fix

4d44077

added mode tests

de26840

code duplication fix

27f53b9

pratikspatil024 changed the title ~~heimdall: added secondary URL failover for Heimdall RPC~~ heimdall: consensus, eth: add Heimdall HTTP/gRPC/WS failover with recovery Feb 12, 2026

cffls reviewed Feb 12, 2026

View reviewed changes

marcello33 reviewed Feb 13, 2026

View reviewed changes

Conversation

pratikspatil024 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Breaking changes

Nodes audience

Checklist

Cross repository changes

Testing

Manual tests

Additional comments

Uh oh!

claude bot commented Feb 11, 2026

Code review

Bug: Missing timeout for secondary client calls

Uh oh!

codecov bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pratikspatil024 commented Feb 11, 2026

Code review

Bug: Missing timeout for secondary client calls

Uh oh!

claude bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing Design Decision

Uh oh!

Uh oh!

sonarqubecloud bot commented Feb 12, 2026

Quality Gate passed

Uh oh!

pratikspatil024 commented Feb 12, 2026

Uh oh!

claude bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Overview

Strengths

Issues and Suggestions

Medium: Potential nil pointer in WS client Close()

Low: WS failover counter resets on URL switch, not on connection success

Low: Config client initialization could use index bounds more defensively

Nit: Consider exposing cooldown/timeout as config options

Test Coverage Observations

Summary

Uh oh!

cffls Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

cffls Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

cffls Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

marcello33 Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

marcello33 Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

marcello33 Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pratikspatil024 commented Feb 11, 2026 •

edited

Loading

codecov bot commented Feb 11, 2026 •

edited

Loading

claude bot commented Feb 11, 2026 •

edited

Loading

claude bot commented Feb 12, 2026 •

edited

Loading