Skip to content

Conversation

@cjonas9
Copy link
Contributor

@cjonas9 cjonas9 commented Dec 15, 2025

What

This PR adds synchronous history backfilling of ledgers through the new configuration parameter backfill. This will fill the local SQL database with the most recent HISTORY_RETENTION_WINDOW ledgers, fetched from CDP. Usage: ./stellar-rpc --backfill, or one can specify this as a config parameter in the TOML or as an environmental variable.

Notes:

  • After verifying the local DB state, this will backfill the specified number of ledgers from CDP backwards/"leftward", then backfill forwards/"rightward" up to the new most current ledger. By termination, we are guaranteed a database filled with the most recent HISTORY_RETENTION_WINDOW ledgers.
    • Leftward/backwards backfill: This backfills the range [oldest ledger we want in our DB <- oldest ledger in local DB-1] into the local DB. This phase is skipped if the DB is empty and this is generally slower than forwards backfill.
    • Rightward/forwards backfill: This backfills the range (max(newest ledger currently in local DB, oldest ledger we want in our DB) -> current tip of datastore] into the local DB. This phase is run twice if the DB is empty (on the second run, it just refreshes the current DB tip).
  • After completion, it will start RPC to begin filling in the ledgers live through captive core. See my design document for details for how this is done.
  • This design guarantees that, regardless of faults, the database will never be left with gaps in it. The client may supply a partially filled database, but backfill will error if this database initially has gaps in it. Note that (due to how the BufferedStorageBackend prevents backwards iteration) supplying a partially filled DB may be slower than the empty DB case if the partially filled database is 1.) minimally filled and/or 2.) close to the datastore current tip ledger (i.e. the larger the backwards backfilling run is, the slower this can be).
  • The backfilled ledgers are ingested via CDP through the bufferedStorageBackend. It's datastore agnostic and has been tested on GCS and S3.
  • Using the (rate-limited) public S3 datastore in the k8s soroban-rpc-pubnet-dev deployment, to ingest one week of ledgers (120,960 ledgers, ~150Gb), this takes about 3 hours.
    • Of three runs backfilling from no local DB, we see the following timing: 3h6m, 3h7m, 2h41m, 2h48m. I suspect that if any ledgers are in the DB (and a backwards phase runs), we would see longer runtimes.
    • In a truly pessimistic case (e.g. the backfill backwards phase is maximal), the runtime for this scenario is around ~3h45m.

Why

There is no way to backfill RPC's local database with historical ledgers, let alone a time-conscious method for this. This PR provides one.
See issues/discussions on this: #203, 1718

Known limitations

[N/A]

@cjonas9 cjonas9 linked an issue Dec 15, 2025 that may be closed by this pull request
@stellar stellar deleted a comment from Copilot AI Jan 16, 2026
@stellar stellar deleted a comment from Copilot AI Jan 16, 2026
@stellar stellar deleted a comment from Copilot AI Jan 20, 2026
@stellar stellar deleted a comment from Copilot AI Jan 20, 2026
Copy link
Contributor

@Shaptic Shaptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another pass

info, err := i.getCoreInfo()
return err == nil && info.Info.Ledger.Num >= ledger
},
90*time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be scaled based on the ledger count and close time, otherwise this will time out for large enough values w/o much explanation

lChunkBound, rChunkBound, 100*(rBound-lChunkBound)/max(rBound-lBound, 1))

if err := tempBackend.Close(); err != nil {
backfill.logger.Warnf("error closing temporary backend: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no return here? is the implication that we can just keep going? fine with that just want to make that clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait this should probably be a defer (they're scoped) cuz otherwise you're not closing it in any error case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no return was intentional, I saw this as recoverable. will scope!

Comment on lines 234 to 235
return fmt.Errorf("post-backfill verification failed: expected at least %d ledgers, "+
"got %d ledgers (exceeds acceptable threshold of %d missing ledgers)", nBackfill, count, ledgerThreshold)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should give a hint to the operator about what to actually do in this case, even if that's just "Try again". Otherwise they need support and we'd ideally want them to be self-sufficient as much as possible. In fact in might be worth doing a pass and adding info like this to any other "recoverable"ish error.

Copy link
Contributor Author

@cjonas9 cjonas9 Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely agreed! this threshold is arbitrary and I feel like making this a warnf/"you may want to try again to avoid a longer catch up" is totally defensible

// Backfills the local DB with older ledgers from oldest to newest within the retention window
func (backfill *BackfillMeta) runFrontfill(ctx context.Context, bounds *backfillBounds) error {
numIterations := 1
// If we skipped backfilling, do a second forwards push to a refreshed current tip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I know what you mean but this comment could use some reclarifying. Something like,

If we skipped backfilling, we want to fill forwards twice because the latest ledger may be significantly further in the future after the first fill completes and fills are faster than catch-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much more clear!


// Backfills the local DB with ledgers in [lBound, rBound] from the cloud datastore
// Used to fill local DB backwards towards older ledgers (starting from newest)
func (backfill *BackfillMeta) backfillChunks(ctx context.Context, bounds *backfillBounds) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should just pass bounds by value so there's no risk of a nil pointer; it's a small structure.

Copy link
Contributor Author

@cjonas9 cjonas9 Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, this is because bounds are updated in backfill for later checking. i can just have it be returned here if it's really preferable, but I liked making the fn signatures match across back/frontfill (of course that's not very deep, but I thought it emphasized that these functions do the same thing in opposite directions). also, a nil pointer would alert to something being borked; if that happens then (imo) it shouldn't pretend everything is ok or that no-op here is normal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I missed that nuance. Yeah generally modifying parameters by reference is a bit of an anti-pattern, let's do return/reassigns everywhere, instead. nil is 2 spooky 👻

```

### Added
- Added `--backfill` configuration parameter providing synchronous backfilling of `HISTORY_RETENTION_WINDOW` ledgers to the local DB prior to RPC starting ([#571](https://github.com/stellar/stellar-rpc/pull/571)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some details about resource consumption and timing here, also the fact that you can only use this if you set up a datastore which also enables getLedger

Validate: func(_ *Option) error {
// Ensure config is valid for backfill
if cfg.Backfill && !cfg.ServeLedgersFromDatastore {
return errors.New("backfill requires serving ledgers from datastore to be enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a ref to the actual flag so someone can grok the --help afterwards would be 💯 here

return err
}
for _, event := range allLedgerEvents {
query, err = insertEvents(query, lcm, event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine but it'd be a smaller diff if you kept building the insert statement in the loop (this has no db effect) and only moved the exec part outside of it. no strong preference but just wanted to note that

//
var beforeIndex, afterIndex uint32
// Accumulate all ledger events to insert
var allLedgerEvents []dbEvent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend pre-sizing this to something reasonable based on transaction count, obviously can't be as precise as before but still nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement synchronous history backfilling

3 participants