Skip to content

Conversation

@Garandor
Copy link
Contributor

@Garandor Garandor commented Jan 15, 2026

What

Improved RPC fallback selection logic to choose default-priority (0) entries when no healthy RPCs are available.

New (not-yet polled) RPCs are now initialized as Unhealthy instead of Healthy with a fake latency based on order of definition.

This change removes reliance on order of defining RPCs in rpc-config.yaml

followup to #73

Why

Some of our services (prov bootstrap) request RPCs before healthcheck poller has had a chance to run. We thus have no idea which RPC from the list to return and should prefer the one with the highest likelihood of being available.

Previously, new entries that have not yet been polled were initialized as Healthy, which according to the new selection method led to the replica IP being selected (highest priority) before it could be proven to be Unhealthy by the poller, which broke prov bootstrap.
The previous selection algorithm just selected by order of definition in this case (first RPC defined in the file) which is brittle and darkmagic-y as well.

This change ensures we use a more reliable default-priority remote RPC anytime we don't know which RPC is actually available, which is typically a public or private remote RPC to that chain and can be more safely assumed to be available.

This default (priority-0) RPC is now mandatory for every chain that might be used in such way and prio 0 now has a special meaning.

Background

During prov bootstrap, the healthcheck poller requests a yellowstone RPC immediately on boot without waiting for a healthcheck cycle. Previously, a replica URL was selected due to high prio because not-yet polled RPCs were initialized as Healthy, which made prov boot fail.

Copy link
Contributor Author

Garandor commented Jan 15, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@Garandor Garandor marked this pull request as ready for review January 15, 2026 06:25
@github-actions
Copy link

github-actions bot commented Jan 15, 2026

PASS [ 44.694s] (3/3) lit_node::test toxiproxy::perf_tests::load_with_no_latency
PASS [ 44.783s] (2/3) lit_node::test toxiproxy::perf_tests::load_with_50ms_latency_single_link
PASS [ 91.215s] (1/3) lit_node::test toxiproxy::perf_tests::load_with_50ms_latency_all_links

Copy link
Contributor

@GTC6244 GTC6244 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

For a future PR, I'd suggest maybe doing away with "convention" for when things are not healthy, and add some sort of indicator to the RpcEntry itself ( like a default_public_endpoint : bool or something... )

Copy link
Contributor Author

Using default prio is mainly for effort conservation and backward compatibility because we can just assign 0 to currently existing RPCs, but I agree an explicit (and user-selectable) marker per chain is desirable

@Garandor
Copy link
Contributor Author

RPC selection is still broken for nodes even with this change for some reason, it's still selecting the replica.
Investigating some more

lit os guest instance create node \\\n--net4-ip 23.105.38.175/26 \\\n--net4-gw 23.105.38.190 \\\n--subnet-id 149a054CE79A379Ae5E97f5B984B993233b28061 \\\n--node-staker-address 0x1A3596441024a3CA8b33AD996e50c492aFd42fC7 \\\n--node-admin-address 0x4C06111c11556284cA3A9660Eae340c6485C2BAD \\\n--vcpus 16 \\\n--mem 20G\" /dev/null\n", "delta": "0:00:00.054469", "end": "2026-01-15 14:10:11.086831", "finished": true, "msg": "non-zero return code", "rc": 101, "results_file": "/root/.ansible_async/j838247541378.4112", "start": "2026-01-15 14:10:11.032362", "started": true, "stderr": "", "stderr_lines": [], "stdout": "\r\nthread 'main' panicked at /opt/assets/lit-assets/rust/lit-os/lit-cli-os/src/guest/instance/release.rs:74:55:\r\nfailed to construct ProvApiClient: lit_blockchain::Error { kind: Blockchain, msg: \"failed to call contract_resolver on staking contract (subnet id: 149a054ce79a379ae5e97f5b984b993233b28061, contract address: 0x149a…8061, chain_id: 175188, chain_name: yellowstone)\", source: MiddlewareError { e: HTTPError(reqwest::Error { kind: Request, url: Url { scheme: \"http\", cannot_be_a_base: false, username: \"\", password: None, host: Some(Ipv4(172.30.0.1)), port: Some(8547), path: \"/\", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError(\"tcp connect error\", Os { code: 111, kind: ConnectionRefused, message: \"Connection refused\" })) }) }, caller:  { file: \"/opt/assets/lit-assets/rust/lit-core/lit-blockchain/src/resolver/contract/mod.rs:189:26\" } }\r\nnote: run with `RUST_BACKTRACE=1` environment variable to display a backtrace", "stdout_lines": ["", "thread 'main' panicked at /opt/assets/lit-assets/rust/lit-os/lit-cli-os/src/guest/instance/release.rs:74:55:", "failed to construct ProvApiClient: lit_blockchain::Error { kind: Blockchain, msg: \"failed to call contract_resolver on staking contract (subnet id: 149a054ce79a379ae5e97f5b984b993233b28061, contract address: 0x149a…8061, chain_id: 175188, chain_name: yellowstone)\", source: MiddlewareError { e: HTTPError(reqwest::Error { kind: Request, url: Url { scheme: \"http\", cannot_be_a_base: false, username: \"\", password: None, host: Some(Ipv4(172.30.0.1)), port: Some(8547), path: \"/\", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError(\"tcp connect error\", Os { code: 111, kind: ConnectionRefused, message: \"Connection refused\" })) }) }

@Garandor
Copy link
Contributor Author

RPC selection is still broken for nodes even with this change for some reason, it's still selecting the replica. Investigating some more

scratch the above, i was observing a stale build - works as advertised.

@Garandor Garandor changed the title If all are unhealthy, return the default RPC RPC Resolver: If all are unhealthy, return the default RPC Jan 21, 2026
Copy link
Contributor

@GTC6244 GTC6244 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - minor comment that may or may not be relevant to code.

Copy link
Contributor Author

Garandor commented Jan 22, 2026

Merge activity

  • Jan 22, 1:43 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Jan 22, 1:44 PM UTC: Graphite rebased this pull request as part of a merge.
  • Jan 22, 2:26 PM UTC: Graphite couldn't merge this PR because it failed for an unknown reason (GitHub threw an unexpected error that did not resolve after multiple retries. Please try again later or contact Graphite support if this continues.).
  • Jan 22, 3:21 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Jan 22, 3:22 PM UTC: @Garandor merged this pull request with Graphite.

@Garandor Garandor merged commit 09fcd78 into master Jan 22, 2026
85 of 95 checks passed
@Garandor Garandor self-assigned this Jan 22, 2026
@Garandor Garandor deleted the rework_default_rpc branch January 22, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants