Skip to content

Conversation

rustyrussell
Copy link
Contributor

Fixes: #8614

When you have 1.6 million channelmoves (migrated from accounts.db), the first call to any bookkeeper function will cause it to digest them all. This proved very slow, and eventually crashed the node.

  1. libplugin attaches a destructor to the command for every request. These are a single-linked list, which gets iterated: making a million requests is exponentially expensive.
  2. the bookkeeper plugin tried to find circular payments by calling the sql plugin looking for a given payment_hash: without an index it has to scan the entire db.
  3. when it finally does finish, it submits all the 1.6M log messages and similar number of requests for invoices/sendpays all at once. This caused lightningd to OOM and crash.

These three fix this.

…own on large numbers of requests.

Note that we create a destructor on the command to reset request->cmd
pointer if the cmd is freed (so we know not to call the callback).
But attaching hundreds of thousands of them is slow: it's a
single-linked list, which is iterated in several places.

But that's redundant: the request is now allocated off the cmd, so freeing the command
will free the request anyway.

Hacking in something to print progress to a file, here's the number of
requests processed every 10 seconds before and after:

Before:
	$ while sleep 10; do wc -l /tmp/bkpr-progress; done
	181529 /tmp/bkpr-progress
	195994 /tmp/bkpr-progress
	207083 /tmp/bkpr-progress
	226336 /tmp/bkpr-progress
	234319 /tmp/bkpr-progress
	241514 /tmp/bkpr-progress
	247421 /tmp/bkpr-progress
	255292 /tmp/bkpr-progress
	261367 /tmp/bkpr-progress
	269085 /tmp/bkpr-progress
	276953 /tmp/bkpr-progress
	282233 /tmp/bkpr-progress
	286193 /tmp/bkpr-progress
	290930 /tmp/bkpr-progress
	295276 /tmp/bkpr-progress
	301086 /tmp/bkpr-progress

After:
	169505 /tmp/bkpr-progress
	196010 /tmp/bkpr-progress
	219370 /tmp/bkpr-progress
	235671 /tmp/bkpr-progress
	244242 /tmp/bkpr-progress
	255362 /tmp/bkpr-progress
	265636 /tmp/bkpr-progress
	276966 /tmp/bkpr-progress
	284451 /tmp/bkpr-progress
	288836 /tmp/bkpr-progress
	296578 /tmp/bkpr-progress
	304571 /tmp/bkpr-progress

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell added this to the v25.12 milestone Oct 17, 2025
@rustyrussell rustyrussell added the 25.09.2 Candidates if we do a 25.09.2 label Oct 17, 2025
@rustyrussell rustyrussell force-pushed the guilt/bkpr-post-migration branch from c4b321a to f04241b Compare October 17, 2025 13:25
This significantly speeds up the query which bookkeeper often does:

		      "SELECT created_index"
		      " FROM channelmoves"
		      " WHERE payment_hash = X'%s'"
		      "   AND credit_msat = %"PRIu64
		      "   AND created_index <= %"PRIu64,

On large databases this scan is expensive, and a payment_hash index
cuts it down a great deal.  It does take longer to load the channelmoves
in the first place though (about 3x).

Before:
	$ while sleep 10; do wc -l /tmp/bkpr-progress; done
	169505 /tmp/bkpr-progress
	196010 /tmp/bkpr-progress
	219370 /tmp/bkpr-progress
	235671 /tmp/bkpr-progress
	244242 /tmp/bkpr-progress
	255362 /tmp/bkpr-progress
	265636 /tmp/bkpr-progress
	276966 /tmp/bkpr-progress
	284451 /tmp/bkpr-progress
	288836 /tmp/bkpr-progress
	296578 /tmp/bkpr-progress
	304571 /tmp/bkpr-progress

After:
	$ while sleep 10; do wc -l /tmp/bkpr-progress; done
	161421 /tmp/bkpr-progress
	238273 /tmp/bkpr-progress
	281185 /tmp/bkpr-progress
	305787 /tmp/bkpr-progress

Signed-off-by: Rusty Russell <[email protected]>
Changelog-Changed: plugins: the sql plugin now keeps an index on `channelmoves` by `payment_hash`.
If we read all of them, we might get 1.6M at once (after initial
migration).  Then we submit a few hundred thousand simultaneous
requests to lightningd, and it gets upset, queueing them all on the
xpay command hook and running out of memory.

Signed-off-by: Rusty Russell <[email protected]>
Changelog-Fixed: plugins: bookkeeper first invocation after migration from prior to 25.09 with very large databases will not crash.
@rustyrussell rustyrussell force-pushed the guilt/bkpr-post-migration branch from f04241b to bd8c907 Compare October 17, 2025 22:21
Copy link
Collaborator

@sangbida sangbida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me :) I did have a couple of questions:

  • Do folks who have already migrated have to do anything extra (another migration?) to get the new index, does a migration just happen when a database change is made?
  • Are the changes to the trace array not required after all?
  • Is it worth writing a test for the read listchannelmoves 1000 entries at a time? Just wondering if there's a chance we could hit funny edge cases here in a case of a database rollback/deletion? I remember you mentioned that the cln database is not just appened only, not sure if that's entirely relevant here or not though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

25.09.2 Candidates if we do a 25.09.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Odd behaviour with v25.09.1

2 participants