Skip to content

Conversation

@yashnevatia
Copy link
Contributor

@yashnevatia yashnevatia commented Jul 17, 2025

https://smartcontract-it.atlassian.net/browse/PLEX-1596

If the rpc is a proxy rpc, then it will be the only node in the node list (it will always be the only available node).
In such cases we dont want to apply the logic where we dont disconnect to an out of sync node if its the only available node.
BCY wants us to reconnect to the proxy and they will return a better rpc to us.
So we are adding a feature flag which will tell us if its a load balanced rpc or a normal rpc.

  • For aliveLoop()

    • if pollingThreshold breached or polling out of sync -> reconnect if rpc proxy
    • if no new heads/finalizedHeads threashold breached -> reconnect if rpc proxy
  • For outOfSyncLoop()

    • no new heads/finalizedHeads threashold breached -> reconnect if rpc proxy

@yashnevatia yashnevatia requested a review from a team as a code owner July 17, 2025 12:04
@github-actions
Copy link

👋 yashnevatia, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

@yashnevatia yashnevatia marked this pull request as draft July 17, 2025 12:05
@yashnevatia yashnevatia marked this pull request as ready for review July 18, 2025 15:59
@yashnevatia yashnevatia changed the title rpc proxy flag Add IsRPCPRoxy flag Jul 18, 2025
@yashnevatia yashnevatia changed the title Add IsRPCPRoxy flag Add IsLoadBalancedRPC flag Jul 22, 2025
Copy link
Contributor

@Unheilbar Unheilbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'll wait for corresponding PRs with updates for consumers

@Unheilbar Unheilbar self-requested a review July 28, 2025 14:38
@yashnevatia yashnevatia merged commit d49ca2e into main Jul 28, 2025
23 checks passed
@yashnevatia yashnevatia deleted the rpc-proxy-flag branch July 28, 2025 14:42
pollError := errors.New("failed to get ClientVersion")
rpc.On("ClientVersion", mock.Anything).Return("", pollError)
node.declareAlive()
tests.AssertLogEventually(t, observedLogs, fmt.Sprintf("RPC endpoint failed to respond to %d consecutive polls", pollFailureThreshold))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RPC endpoint failed to respond to is logged before RPC has transitioned to unreachable. To avoid flakiness, we should wait for RPC Node is unreachable msg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair, will assertEventually for the nodeState as well, like done in other tests.

if n.poolInfoProvider != nil {
if l, _ := n.poolInfoProvider.LatestChainInfo(); l < 1 {
if n.isLoadBalancedRPC {
n.declareUnreachable()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should declareOutOfSync here to avoid cases when an RPC that does not produce new blocks gets stuck in a loop of alive->outOfSync->Unreachable->alive. declareOutOfSync guarantees reconnection and allows us to keep track of previous issues like being out of sync with the pool or not generating new heads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (n *node[CHAIN_ID, HEAD, RPC]) transitionToOutOfSync(fn func()) {
	ctx, cancel := n.stopCh.NewCtx()
	defer cancel()
	n.metrics.IncrementNodeTransitionsToOutOfSync(ctx, n.name)
	n.stateMu.Lock()
	defer n.stateMu.Unlock()
	if n.state == nodeStateClosed {
		return
	}
	switch n.state {
	case nodeStateAlive:
		n.rpc.Close()
		n.state = nodeStateOutOfSync
	default:
		panic(transitionFail(n.state, nodeStateOutOfSync))
	}
	fn()
}

i cannot transitionToOutOfSync from an outOfSyncState

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

declareUnreachable forces reconnection as well.
would like to understand more about how issue tracking will be made easier with declareOutOfSync

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We can modify transitionToOutOfSync to allow self-transition just to force reconnect.

would like to understand more about how issue tracking will be made easier with declareOutOfSync

Let's say we have one load-balanced RPC. All RPCs fail to produce new blocks.

In the current implementation, the node will transition through the following states: alive (no new heads timeout) -> outOfSync (zombie check timeout) -> unreachable (successful dial) -> alive. The problem here is that RPC failed to overcome the initial health issue but was declared alive.

If we replace declareUnreachable with declareOutOfSync in this case, we'll still force RPC to reconnect, but will wait for a new head before declaring RPC alive. We'll still have a state transition loop, but in this case, we won't falsely transition to the alive state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants