Skip to content

Conversation

jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Aug 22, 2025

Issue Addressed

Part of #7866

In the above PR, we enabled rayon for batch KZG verification in chain segment processing. However, using the global rayon thread pool for backfill is likely to create resource contention with higher-priority beacon processor work.

Proposed Changes

This PR introduces a dedicated low-priority rayon thread pool LOW_PRIORITY_RAYON_POOL and uses it for processing backfill chain segments.

This prevents backfill KZG verification from using the global rayon thread pool and competing with high-priority beacon processor tasks for CPU resources.

However, this PR by itself doesn't prevent CPU oversubscription because other tasks could still fill up the global rayon thread pool, and having an extra thread pool could make things worse. To address this we need the beacon
processor to coordinate total CPU allocation across all tasks, which is covered in:

@jimmygchen jimmygchen added blocked work-in-progress PR is a work-in-progress do-not-merge optimization Something to make Lighthouse run more efficiently. labels Aug 22, 2025
@jimmygchen jimmygchen changed the base branch from stable to unstable August 22, 2025 15:43
@jimmygchen jimmygchen force-pushed the backfill-verify-kzg-use-scoped-rayon branch from 0e9d888 to 47a80e5 Compare August 22, 2025 15:43
@eserilev
Copy link
Member

eserilev commented Aug 29, 2025

In the non-scoped rayon pool case, and zero delay reconstruction, i did see the node briefly become unsynced. Additionally I saw some instances of these log messages

WARN  Delayed proposer preparation                  prepare_slot: 263808, validator: 72
WARN  Error signalling fork choice waiter           error: ForkChoiceSignalOutOfOrder { current: Slot(263831), latest: Slot(263830) }, slot: 263830

I also saw spikes in gossip attestation times (up to 10s)
tracing

The numbers dropped to the 20ms range as soon as I switched to scoped rayon pool usage (at the 6:00-ish mark in the screenshot above). I also saw no sync issues or WARN log messages. This is also when column reconstruction is set to zero delay (i.e. reconstructing as soon as we get half the columns)

With scoped rayon pool usage reconstruction during backfill slows down a bit (as expected)
tracing

@eserilev eserilev marked this pull request as ready for review September 1, 2025 01:59
@eserilev eserilev requested a review from jxs as a code owner September 1, 2025 01:59
@eserilev eserilev added ready-for-review The code is ready for review v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky das Data Availability Sampling and removed blocked work-in-progress PR is a work-in-progress do-not-merge labels Sep 1, 2025
@eserilev
Copy link
Member

eserilev commented Sep 1, 2025

I've done a bunch of testing on backfill with the global rayon pool vs scoped rayon pool usage. The biggest difference is that KZG verification takes more than double the time with a scoped rayon pool (using a max of~25% of cpu threads) vs the global rayon pool. Additionally I did see lower cpu usage on average in the scoped pool case. My node did not have any issues following the chain in either case during backfill. So I can't argue that this change is absolutely necessary. But it is a safe and relatively simple optimization to make. It can potentially help protect nodes from having issues during backfill sync, so maybe thats a good enough reason to include this in our 8.0.0 release candidate

I haven't see any evidence that something like #7789 is required. It seems like the OS scheduler is good enough at figuring things out with the current scoped rayon pool usage. If we were to expand our scoped rayon pool usage to other work events, #7789 could potentially become more relevant.

In a future iteration (or in this PR) we could make the low priority rayon pool configurable via cli flag to give users additional control over the speed of backfill. This is probably unnecessarily at the moment, but could potentially become useful if we expand scoped rayon thread pool usage.

Another TODO could be to introduce a high priority rayon pool and always avoid scheduling rayon tasks on the global thread pool. It remains to be seen if that would be useful considering our current rayon usage.

Copy link

mergify bot commented Sep 1, 2025

Some required checks have failed. Could you please take a look @jimmygchen? 🙏

@mergify mergify bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 1, 2025
@eserilev
Copy link
Member

eserilev commented Sep 2, 2025

I've added a high priority rayon pool thats used during backfill when the flag --disable-backfill-rate-limiting is used. This effectively speeds up kzg verification 2x vs using the low priority queue. The flag itself already warns that the node may have degraded attestation performance, so I dont think we need to warn users more than we already are.

@eserilev
Copy link
Member

eserilev commented Sep 2, 2025

Something about the changes to the backfill work event are causing a test to fail in a non-deterministic manner. I'm wondering if its because the backfill task is now blocking instead of async?

@eserilev
Copy link
Member

eserilev commented Sep 2, 2025

I made a small tweak to assert_event_journal_with_timeout to ignore WORKER_FREED and NOTHING_TO_DO events for the test case that was failing non-deterministically.

@eserilev eserilev added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Sep 2, 2025
Copy link

mergify bot commented Sep 2, 2025

Some required checks have failed. Could you please take a look @jimmygchen? 🙏

@mergify mergify bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 2, 2025
@mergify mergify bot added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling optimization Something to make Lighthouse run more efficiently. ready-for-review The code is ready for review v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants