-
Notifications
You must be signed in to change notification settings - Fork 2
Add IsLoadBalancedRPC flag #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 yashnevatia, thanks for creating this pull request! To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team. Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks! |
…ainlink-framework into rpc-proxy-flag
Unheilbar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'll wait for corresponding PRs with updates for consumers
| pollError := errors.New("failed to get ClientVersion") | ||
| rpc.On("ClientVersion", mock.Anything).Return("", pollError) | ||
| node.declareAlive() | ||
| tests.AssertLogEventually(t, observedLogs, fmt.Sprintf("RPC endpoint failed to respond to %d consecutive polls", pollFailureThreshold)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RPC endpoint failed to respond to is logged before RPC has transitioned to unreachable. To avoid flakiness, we should wait for RPC Node is unreachable msg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair, will assertEventually for the nodeState as well, like done in other tests.
| if n.poolInfoProvider != nil { | ||
| if l, _ := n.poolInfoProvider.LatestChainInfo(); l < 1 { | ||
| if n.isLoadBalancedRPC { | ||
| n.declareUnreachable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should declareOutOfSync here to avoid cases when an RPC that does not produce new blocks gets stuck in a loop of alive->outOfSync->Unreachable->alive. declareOutOfSync guarantees reconnection and allows us to keep track of previous issues like being out of sync with the pool or not generating new heads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (n *node[CHAIN_ID, HEAD, RPC]) transitionToOutOfSync(fn func()) {
ctx, cancel := n.stopCh.NewCtx()
defer cancel()
n.metrics.IncrementNodeTransitionsToOutOfSync(ctx, n.name)
n.stateMu.Lock()
defer n.stateMu.Unlock()
if n.state == nodeStateClosed {
return
}
switch n.state {
case nodeStateAlive:
n.rpc.Close()
n.state = nodeStateOutOfSync
default:
panic(transitionFail(n.state, nodeStateOutOfSync))
}
fn()
}i cannot transitionToOutOfSync from an outOfSyncState
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
declareUnreachable forces reconnection as well.
would like to understand more about how issue tracking will be made easier with declareOutOfSync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We can modify transitionToOutOfSync to allow self-transition just to force reconnect.
would like to understand more about how issue tracking will be made easier with declareOutOfSync
Let's say we have one load-balanced RPC. All RPCs fail to produce new blocks.
In the current implementation, the node will transition through the following states: alive (no new heads timeout) -> outOfSync (zombie check timeout) -> unreachable (successful dial) -> alive. The problem here is that RPC failed to overcome the initial health issue but was declared alive.
If we replace declareUnreachable with declareOutOfSync in this case, we'll still force RPC to reconnect, but will wait for a new head before declaring RPC alive. We'll still have a state transition loop, but in this case, we won't falsely transition to the alive state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://smartcontract-it.atlassian.net/browse/PLEX-1596
If the rpc is a proxy rpc, then it will be the only node in the node list (it will always be the only available node).
In such cases we dont want to apply the logic where we dont disconnect to an out of sync node if its the only available node.
BCY wants us to reconnect to the proxy and they will return a better rpc to us.
So we are adding a feature flag which will tell us if its a load balanced rpc or a normal rpc.
For aliveLoop()
For outOfSyncLoop()