bkpr first use after large migration issues: the key fixes #8618

rustyrussell · 2025-10-17T13:14:01Z

When you have 1.6 million channelmoves (migrated from accounts.db), the first call to any bookkeeper function will cause it to digest them all. This proved very slow, and eventually crashed the node.

libplugin attaches a destructor to the command for every request. These are a single-linked list, which gets iterated: making a million requests is exponentially expensive.
the bookkeeper plugin tried to find circular payments by calling the sql plugin looking for a given payment_hash: without an index it has to scan the entire db.
when it finally does finish, it submits all the 1.6M log messages and similar number of requests for invoices/sendpays all at once. This caused lightningd to OOM and crash.

These three fix this.

…own on large numbers of requests. Note that we create a destructor on the command to reset request->cmd pointer if the cmd is freed (so we know not to call the callback). But attaching hundreds of thousands of them is slow: it's a single-linked list, which is iterated in several places. But that's redundant: the request is now allocated off the cmd, so freeing the command will free the request anyway. Hacking in something to print progress to a file, here's the number of requests processed every 10 seconds before and after: Before: $ while sleep 10; do wc -l /tmp/bkpr-progress; done 181529 /tmp/bkpr-progress 195994 /tmp/bkpr-progress 207083 /tmp/bkpr-progress 226336 /tmp/bkpr-progress 234319 /tmp/bkpr-progress 241514 /tmp/bkpr-progress 247421 /tmp/bkpr-progress 255292 /tmp/bkpr-progress 261367 /tmp/bkpr-progress 269085 /tmp/bkpr-progress 276953 /tmp/bkpr-progress 282233 /tmp/bkpr-progress 286193 /tmp/bkpr-progress 290930 /tmp/bkpr-progress 295276 /tmp/bkpr-progress 301086 /tmp/bkpr-progress After: 169505 /tmp/bkpr-progress 196010 /tmp/bkpr-progress 219370 /tmp/bkpr-progress 235671 /tmp/bkpr-progress 244242 /tmp/bkpr-progress 255362 /tmp/bkpr-progress 265636 /tmp/bkpr-progress 276966 /tmp/bkpr-progress 284451 /tmp/bkpr-progress 288836 /tmp/bkpr-progress 296578 /tmp/bkpr-progress 304571 /tmp/bkpr-progress Signed-off-by: Rusty Russell <[email protected]>

This significantly speeds up the query which bookkeeper often does: "SELECT created_index" " FROM channelmoves" " WHERE payment_hash = X'%s'" " AND credit_msat = %"PRIu64 " AND created_index <= %"PRIu64, On large databases this scan is expensive, and a payment_hash index cuts it down a great deal. It does take longer to load the channelmoves in the first place though (about 3x). Before: $ while sleep 10; do wc -l /tmp/bkpr-progress; done 169505 /tmp/bkpr-progress 196010 /tmp/bkpr-progress 219370 /tmp/bkpr-progress 235671 /tmp/bkpr-progress 244242 /tmp/bkpr-progress 255362 /tmp/bkpr-progress 265636 /tmp/bkpr-progress 276966 /tmp/bkpr-progress 284451 /tmp/bkpr-progress 288836 /tmp/bkpr-progress 296578 /tmp/bkpr-progress 304571 /tmp/bkpr-progress After: $ while sleep 10; do wc -l /tmp/bkpr-progress; done 161421 /tmp/bkpr-progress 238273 /tmp/bkpr-progress 281185 /tmp/bkpr-progress 305787 /tmp/bkpr-progress Signed-off-by: Rusty Russell <[email protected]> Changelog-Changed: plugins: the sql plugin now keeps an index on `channelmoves` by `payment_hash`.

If we read all of them, we might get 1.6M at once (after initial migration). Then we submit a few hundred thousand simultaneous requests to lightningd, and it gets upset, queueing them all on the xpay command hook and running out of memory. Signed-off-by: Rusty Russell <[email protected]> Changelog-Fixed: plugins: bookkeeper first invocation after migration from prior to 25.09 with very large databases will not crash.

sangbida

This looks good to me :) I did have a couple of questions:

Do folks who have already migrated have to do anything extra (another migration?) to get the new index, does a migration just happen when a database change is made?
Are the changes to the trace array not required after all?
Is it worth writing a test for the read listchannelmoves 1000 entries at a time? Just wondering if there's a chance we could hit funny edge cases here in a case of a database rollback/deletion? I remember you mentioned that the cln database is not just appened only, not sure if that's entirely relevant here or not though.

rustyrussell added this to the v25.12 milestone Oct 17, 2025

rustyrussell added the 25.09.2 Candidates if we do a 25.09.2 label Oct 17, 2025

rustyrussell force-pushed the guilt/bkpr-post-migration branch from c4b321a to f04241b Compare October 17, 2025 13:25

rustyrussell added 2 commits October 18, 2025 08:51

rustyrussell force-pushed the guilt/bkpr-post-migration branch from f04241b to bd8c907 Compare October 17, 2025 22:21

sangbida approved these changes Oct 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bkpr first use after large migration issues: the key fixes #8618

bkpr first use after large migration issues: the key fixes #8618

rustyrussell commented Oct 17, 2025

Uh oh!

sangbida left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bkpr first use after large migration issues: the key fixes #8618

Are you sure you want to change the base?

bkpr first use after large migration issues: the key fixes #8618

Conversation

rustyrussell commented Oct 17, 2025

Uh oh!

sangbida left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants