feat(sync): multi-endpoint round-robin + batched writes + monotonic cursor#37
Conversation
…ursor Backfill went 38 → 268 b/s (7x) on the live mainnet catch-up by combining three changes: 1. Multi-endpoint RPC pool. ChainProvider + RestClient now accept comma- separated URLs and round-robin requests via an atomic counter. Lets the indexer hit fullnodes directly (bypassing the public Caddy edge + its per-IP rate limit) while spreading load across N nodes. 2. REST_URL env split. Direct fullnode access needs `/rpc` for JSON-RPC but root for native REST. The Caddy edge papered over this with a path rewrite; the public-facing single URL doesn't work once we bypass it. REST_URL falls back to RPC_URL when unset (no behaviour change for single-endpoint deployments). 3. Batched writes. Backfill now buffers up to INDEXER_BACKFILL_BATCH (100) bundles and flushes them as one transaction with multi-row INSERTs (`insert_batch` helpers added to blocks/transactions/logs). One commit per batch instead of one per block. 4. Monotonic cursor. The tail loop's `ingest_one` path advances the cursor per-block; running in parallel with the batched backfill, its slower commits clobbered the batched writer's higher cursor values, causing visible regression of 100k+ blocks in seconds. write_cursor now uses `GREATEST(_meta.value::int8, EXCLUDED.value::int8)` at the SQL level so the on-disk cursor is monotonic regardless of which writer commits last. Also: INDEXER_BACKFILL_BATCH env (1..1000, default 100) for tuning.
📝 WalkthroughWalkthroughThe PR adds round‑robin selection for JSON‑RPC and REST endpoints, an optional IndexerConfig sequenceDiagram
participant Backfill
participant ChainProvider
participant RestClient
participant batch_write_blocks
participant PgPool
participant Analytics
Backfill->>ChainProvider: fetch blocks via JSON‑RPC (round‑robin)
Backfill->>RestClient: fetch native block/tx via HTTP (round‑robin)
Backfill->>batch_write_blocks: submit Vec<BlockBundle>
batch_write_blocks->>PgPool: transactional multi-row inserts (blocks, txs, logs)
PgPool-->>batch_write_blocks: commit
batch_write_blocks->>Analytics: best‑effort per‑tx pushes
batch_write_blocks-->>Backfill: flush complete
Sequence Diagram(s)sequenceDiagram
participant Backfill
participant Fetcher
participant Buffer
participant batch_write_blocks
participant PgPool
participant Cursor
Backfill->>Fetcher: concurrently fetch BlockBundle stream
Fetcher->>Buffer: push BlockBundle
Buffer->>Buffer: when buf.len() >= batch_size -> prepare batch
Buffer->>batch_write_blocks: submit batch Vec<BlockBundle>
batch_write_blocks->>PgPool: begin tx, insert chunks (blocks, txs, logs)
PgPool-->>batch_write_blocks: commit
batch_write_blocks->>Cursor: write_cursor(max_height, ts)
batch_write_blocks-->>Backfill: ack flushed
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/chain/src/provider.rs`:
- Around line 17-18: The file fails rustfmt checks; run `cargo fmt --all` (or
`cargo fmt` in the crate) to reformat crates/chain/src/provider.rs and commit
the formatting-only changes so the import block and surrounding regions
(including the AtomicUsize/Ordering and Arc use lines and nearby code around the
earlier reported regions) match rustfmt output; ensure no logic changes are
introduced—only whitespace/formatting fixes—to satisfy `cargo fmt --check`.
In `@crates/chain/src/rest.rs`:
- Around line 64-68: The current new() construction only trims and splits the
raw endpoints into bases, but does not validate them; update new() to parse and
validate each configured endpoint (the raw string entries and the resulting
bases Vec<String>) during construction and return an Err (or panic) if any entry
is malformed so the program fails fast; use a URL parser (e.g., url::Url::parse
or reqwest::Url::parse) on each trimmed base and ensure trailing slashes are
normalized only after successful parse, referencing the raw variable, the bases
collection, and the new() constructor to locate where to add the validation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: b29b317b-2609-4f2a-8702-8f15666d4bd6
📒 Files selected for processing (9)
bin/indexer.rscrates/chain/src/provider.rscrates/chain/src/rest.rscrates/db/src/blocks.rscrates/db/src/logs.rscrates/db/src/transactions.rscrates/sync/src/backfill.rscrates/sync/src/block_writer.rscrates/sync/src/cursor.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/chain/src/rest.rs`:
- Around line 59-62: Add unit tests in crates/chain/src/rest.rs that exercise
the pub fn new(base_url: impl Into<String>) -> ChainResult<Self>: one test
should pass a comma-separated string like "http://a,http://b" to new and assert
the resulting client normalizes/splits into the two distinct base URLs; another
test should instantiate the client with at least two bases and then call the
public operation that selects/uses the base (the method that triggers
round‑robin/base selection) multiple times to assert deterministic rotation
between the bases (e.g., first call uses base A, second uses base B, third uses
base A). Ensure the tests cover trimming/normalization of whitespace and
deterministic ordering across repeated calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: f8d62489-5685-439a-8f3a-44495f5680f1
📒 Files selected for processing (2)
crates/chain/src/provider.rscrates/chain/src/rest.rs
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/chain/src/rest.rs (1)
103-116:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd per-request failover across bases.
Round-robin alone means one bad endpoint causes every Nth request to fail, even when other bases are healthy. Please retry on the next base(s) for transport errors and 5xx before returning an error.
Proposed direction
+ async fn get_with_failover(&self, path: &str) -> ChainResult<reqwest::Response> { + let mut last_err: Option<ChainError> = None; + for _ in 0..self.bases.len() { + let url = format!("{}/{}", self.next_base(), path.trim_start_matches('/')); + match self.http.get(&url).send().await { + Ok(resp) if !resp.status().is_server_error() => return Ok(resp), + Ok(resp) => { + let status = resp.status(); + let body = resp.text().await.unwrap_or_default(); + last_err = Some(ChainError::Rpc(format!("native rest {status}: {body}"))); + } + Err(e) => last_err = Some(e.into()), + } + } + Err(last_err.unwrap_or_else(|| ChainError::Rpc("all rest endpoints failed".to_string()))) + } + pub async fn tx(&self, hash: &Hash) -> ChainResult<Option<NativeTxResponse>> { - let url = format!("{}/tx/{}", self.next_base(), hash); - let resp = self.http.get(&url).send().await?; + let resp = self.get_with_failover(&format!("tx/{hash}")).await?; if resp.status() == reqwest::StatusCode::NOT_FOUND { return Ok(None); } ... }Also applies to: 124-134, 139-152
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/chain/src/rest.rs` around lines 103 - 116, The tx handler currently hits a single base returned by next_base() and fails immediately on transport errors or 5xx; change it to attempt the request across all configured bases (round-robin-starting-from-current) and only return an error after exhausting bases: for each base build the url (use the same url formation code with self.next_base()/or a base-at-index helper), call self.http.get(...).send().await and if send() returns Err (transport error) or resp.status().is_server_error() then log/continue to the next base and try the next URL; keep the existing behavior for 404 (return Ok(None) if you get NOT_FOUND from a healthy response) and for non-success non-5xx responses return the parsed error immediately; when a request succeeds parse bytes -> serde_json::from_slice and return Ok(Some(parsed)). Apply the same retry-across-bases pattern to the other similar methods referenced in the comment (the blocks at lines 124-134 and 139-152) so all native REST requests retry on transport errors and 5xx before failing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/chain/src/rest.rs`:
- Around line 103-116: The tx handler currently hits a single base returned by
next_base() and fails immediately on transport errors or 5xx; change it to
attempt the request across all configured bases
(round-robin-starting-from-current) and only return an error after exhausting
bases: for each base build the url (use the same url formation code with
self.next_base()/or a base-at-index helper), call
self.http.get(...).send().await and if send() returns Err (transport error) or
resp.status().is_server_error() then log/continue to the next base and try the
next URL; keep the existing behavior for 404 (return Ok(None) if you get
NOT_FOUND from a healthy response) and for non-success non-5xx responses return
the parsed error immediately; when a request succeeds parse bytes ->
serde_json::from_slice and return Ok(Some(parsed)). Apply the same
retry-across-bases pattern to the other similar methods referenced in the
comment (the blocks at lines 124-134 and 139-152) so all native REST requests
retry on transport errors and 5xx before failing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 7014dc59-f018-45ea-8291-1d001372a6d5
📒 Files selected for processing (2)
crates/chain/src/rest.rsscripts/smoke.sh
Summary
Backfill 38 → 268 b/s (7x) on live mainnet catch-up (verified during ongoing 1.78M-block walk).
Four changes that compound:
Multi-endpoint RPC pool —
ChainProvider+RestClientaccept comma-separated URLs, round-robin via atomic counter. Lets the indexer hit fullnodes directly (bypassing the public Caddy edge + its per-IP rate limit) while spreading load across N nodes.REST_URL env split — Direct fullnode access needs
/rpcfor JSON-RPC but root for native REST. The Caddy edge papered over this with a path rewrite; bypassing the edge needs the URLs split.REST_URLfalls back toRPC_URLwhen unset, so single-endpoint deployments are unchanged.Batched writes — Backfill buffers up to
INDEXER_BACKFILL_BATCH(default 100) bundles and flushes as one transaction with multi-row INSERTs (insert_batchhelpers added to blocks/transactions/logs). One commit per batch instead of one per block.Monotonic cursor — The tail loop's
ingest_onepath advances cursor per-block. Running in parallel with the batched backfill, its slower commits clobbered the batched writer's higher cursor values — observed live as 100k+ block regression in seconds.write_cursornow usesGREATEST(_meta.value::int8, EXCLUDED.value::int8)at SQL level so the cursor is monotonic regardless of commit order.Throughput history
Env
RPC_URL— comma-separated JSON-RPC endpoints (existing, now multi)REST_URL— optional, comma-separated REST base URLs; falls back toRPC_URLINDEXER_BACKFILL_BATCH— 1..1000, default 100 (new)INDEXER_BACKFILL_CONCURRENCY— 1..500, default 50 (existing)Test plan
cargo check --release -D warningscleanSummary by CodeRabbit
New Features
Refactor
Bug Fixes
Tests