heimdall: consensus, eth: add Heimdall HTTP/gRPC/WS failover with recovery#2050
heimdall: consensus, eth: add Heimdall HTTP/gRPC/WS failover with recovery#2050pratikspatil024 wants to merge 13 commits intodevelopfrom
Conversation
Code reviewBug: Missing timeout for secondary client callsFile: consensus/bor/heimdall/failover_client.go The primary client calls use a 30-second timeout via If a caller passes Affected lines:
Suggested fix: Apply the same timeout to secondary calls: // For each secondary call, wrap with timeout:
subCtx, cancel := context.WithTimeout(ctx, f.attemptTimeout)
defer cancel()
return fn(subCtx, f.clients[1]) |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #2050 +/- ##
===========================================
+ Coverage 50.32% 50.51% +0.19%
===========================================
Files 871 872 +1
Lines 150917 151151 +234
===========================================
+ Hits 75946 76353 +407
+ Misses 69911 69733 -178
- Partials 5060 5065 +5
... and 17 files with indirect coverage changes
🚀 New features to boost your workflow:
|
@claude - This is intentional. The 30s timeout on primary exists specifically to bound the failover decision. Adding a timeout on secondary would actually be harmful in two ways:
Callers in Bor never pass |
|
Claude finished @pratikspatil024's task in 46s —— View job Reviewing Design Decision
Thanks for the explanation! I've verified the design decision and the new comment at lines 148-151. Your reasoning is sound:
One minor note: I did see some |
|
|
@claude - I updated the PR by extending the failover logic to grpc and ws client. Also now accepting a list of urls (multiple) instead of just secondary url. Review this PR again. |
|
Claude finished @pratikspatil024's task in 1m 16s —— View job Code Review
OverviewThis PR adds a well-designed failover mechanism for Heimdall clients across all three transports (HTTP, gRPC, WebSocket). The implementation is solid with good test coverage (~96% patch coverage). Here's my detailed review: Strengths
Issues and SuggestionsMedium: Potential nil pointer in WS client Close()File: func (c *HeimdallWSClient) Close() error {
c.mu.Lock()
defer c.mu.Unlock()
return c.conn.Close()
}If func (c *HeimdallWSClient) Close() error {
c.mu.Lock()
defer c.mu.Unlock()
if c.conn == nil {
return nil
}
return c.conn.Close()
}Low: WS failover counter resets on URL switch, not on connection successFile: When switching URLs at line 129-131, Low: Config client initialization could use index bounds more defensivelyFile: httpURL := httpURLs[min(i, len(httpURLs)-1)]This falls back to the last HTTP URL when there are more gRPC addresses than HTTP URLs. This behavior may not be obvious to operators. Consider adding a log message when this fallback occurs, or documenting this in the flag description. Nit: Consider exposing cooldown/timeout as config optionsThe failover cooldown (2 minutes) and attempt timeout (30 seconds) are hardcoded. For production flexibility, consider exposing these as optional CLI flags with sensible defaults. This would allow operators to tune behavior based on their infrastructure. Test Coverage ObservationsThe Codecov report shows 6 missing lines and 2 partials in
These gaps are acceptable given the overall coverage. SummaryThis is a well-implemented feature that significantly improves Bor's resilience when connecting to Heimdall. The code is clean, the tests are thorough, and the design decisions (like the sticky-with-probe-back behavior) are sound. Recommended action: Address the nil pointer issue in |
| // FailoverHeimdallClient wraps N heimdall clients (primary at index 0, failovers | ||
| // at 1..N-1) and transparently cascades through them when the active client is | ||
| // unreachable. After a cooldown period it probes the primary again. | ||
| type FailoverHeimdallClient struct { |
There was a problem hiding this comment.
Does MultiHeimdallClient sound better? Since this structure does not only contain the failover client, but also the primary client.
|
|
||
| // If on a non-primary client and cooldown has elapsed, probe primary | ||
| if shouldProbe { | ||
| subCtx, cancel := context.WithTimeout(ctx, f.attemptTimeout) |
There was a problem hiding this comment.
This may be risky if primary is still not up after the cooldown, resulting a lag. Instead of querying on receiving a new request, we can probe primary in a background goroutine, and when it is back up, update the active accordingly.
| // Cascade through clients [1, 2, ..., N-1] | ||
| log.Warn("Heimdall failover: primary failed, cascading to next client", "err", err) | ||
|
|
||
| return cascadeClients(f, ctx, fn, 0, err) |
There was a problem hiding this comment.
Same as above - by the time when we discover a failure in backup client, it might be too late to switch over to the next one if the request times out, causing a span rotation. We can always probe the active client in a background goroutine, which will automatically update the active if something goes wrong with the current one.
|
|
||
| var out []string | ||
| for _, p := range parts { | ||
| p = strings.TrimSpace(p) |
There was a problem hiding this comment.
Shall we verify these are valid URLs or we leave it up to the operator?
| } else { | ||
| heimdallClient = grpcClient | ||
| heimdallClient = heimdall.NewFailoverHeimdallClient(heimdallClients...) | ||
| log.Info("Heimdall failover enabled", "endpoints", len(heimdallClients)) |
There was a problem hiding this comment.
If we change to multiple instead of failover, worth to update this log as well. It could be misleading.
| } | ||
|
|
||
| if len(wsAddrs) > 1 { | ||
| log.Info("Heimdall WS failover enabled", "endpoints", len(wsAddrs)) |



Description
--bor.heimdall,--bor.heimdallgRPC,--bor.heimdallWS) now accept comma-separated lists of URLs for failover (e.g."url1,url2,url3")FailoverHeimdallClientthat wraps N Heimdall endpoints, cascading through them on transport errors while staying sticky on the active endpoint until a cooldown-based probe recovers the primaryHeimdallWSClient) gains similar multi-URL failover with configurable attempt threshold and cooldownChanges
Breaking changes
Please complete this section if any breaking changes have been made, otherwise delete it
Nodes audience
In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)
Checklist
Cross repository changes
Testing
Manual tests
Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it
Additional comments
Please post additional comments in this section if you have them, otherwise delete it