Skip to content

[storage] Replace pruner worker polling with channel-based wake#18874

Merged
wqfish merged 1 commit intomainfrom
pr18874
Feb 28, 2026
Merged

[storage] Replace pruner worker polling with channel-based wake#18874
wqfish merged 1 commit intomainfrom
pr18874

Conversation

@wqfish
Copy link
Contributor

@wqfish wqfish commented Feb 27, 2026

Replace the 1ms sleep-poll loop in PrunerWorker with a
sync_channel(1)-based blocking wait.

Before: Each pruner worker thread (state_merkle, ledger, state_kv) woke up
every 1ms to check is_pruning_pending(), causing ~1000 context switches/sec
per thread even when fully idle.

After: The worker blocks on recv() when there's no pending work, and is
woken only when:

  • set_target_db_version is called with a new target (try_send(()))
  • The PrunerWorker is dropped (sender dropped → recv() returns Err)

This also eliminates the PrunerWorkerInner struct — the channel naturally
separates the producer (sender on PrunerWorker) from the consumer (receiver
moved into the worker thread). Error backoff is bumped from 1ms to 100ms.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com


Note

Medium Risk
Changes the pruner worker’s concurrency model from tight sleep-polling to channel-based blocking wakeups, which can affect shutdown/wakeup behavior and risk hangs if signaling is missed. Scope is limited to PrunerWorker, but it runs continuously in production so regressions would be noisy.

Overview
Replaces PrunerWorker’s 1ms sleep/poll loop with a sync_channel(1) wake mechanism so idle pruner threads block on recv() and only run when signaled.

set_target_db_version() now try_send(())s to wake the worker after advancing the target version, and Drop signals shutdown by dropping the sender (worker exits when the channel disconnects). The old PrunerWorkerInner + AtomicBool stop flag is removed, and pruning error backoff is increased to 100ms.

Written by Cursor Bugbot for commit e4a6b8e. This will update automatically on new commits. Configure here.

@wqfish wqfish marked this pull request as ready for review February 27, 2026 04:49
@wqfish wqfish requested a review from zekun000 February 27, 2026 04:49
@wqfish wqfish enabled auto-merge (rebase) February 27, 2026 22:15
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@wqfish wqfish disabled auto-merge February 27, 2026 23:46
Replace the 1ms sleep-poll loop in `PrunerWorker` with a
`sync_channel(1)`-based blocking wait.

**Before:** Each pruner worker thread (state_merkle, ledger, state_kv) woke up
every 1ms to check `is_pruning_pending()`, causing ~1000 context switches/sec
per thread even when fully idle.

**After:** The worker blocks on `recv()` when there's no pending work, and is
woken only when:
- `set_target_db_version` is called with a new target (`try_send(())`)
- The `PrunerWorker` is dropped (sender dropped → `recv()` returns `Err`)

This also eliminates the `PrunerWorkerInner` struct — the channel naturally
separates the producer (sender on `PrunerWorker`) from the consumer (receiver
moved into the worker thread). Error backoff is bumped from 1ms to 100ms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wqfish wqfish enabled auto-merge (rebase) February 28, 2026 04:07
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

✅ Forge suite compat success on 411b9c779ae61d440e96591183edd7ea0e0bb563 ==> e4a6b8e6df9d590914111648df0694eee02c05d3

Compatibility test results for 411b9c779ae61d440e96591183edd7ea0e0bb563 ==> e4a6b8e6df9d590914111648df0694eee02c05d3 (PR)
1. Check liveness of validators at old version: 411b9c779ae61d440e96591183edd7ea0e0bb563
compatibility::simple-validator-upgrade::liveness-check : committed: 13593.21 txn/s, latency: 2551.91 ms, (p50: 2500 ms, p70: 2700, p90: 3400 ms, p99: 3900 ms), latency samples: 446260
2. Upgrading first Validator to new version: e4a6b8e6df9d590914111648df0694eee02c05d3
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6186.91 txn/s, latency: 5489.87 ms, (p50: 6000 ms, p70: 6100, p90: 6300 ms, p99: 6400 ms), latency samples: 212920
3. Upgrading rest of first batch to new version: e4a6b8e6df9d590914111648df0694eee02c05d3
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6059.56 txn/s, latency: 5561.51 ms, (p50: 6100 ms, p70: 6200, p90: 6400 ms, p99: 6600 ms), latency samples: 210040
4. upgrading second batch to new version: e4a6b8e6df9d590914111648df0694eee02c05d3
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 2941.45 txn/s, submitted: 3180.68 txn/s, expired: 239.23 txn/s, latency: 3577.31 ms, (p50: 3200 ms, p70: 4500, p90: 5100 ms, p99: 8800 ms), latency samples: 240241
5. check swarm health
Compatibility test for 411b9c779ae61d440e96591183edd7ea0e0bb563 ==> e4a6b8e6df9d590914111648df0694eee02c05d3 passed
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite realistic_env_max_load success on e4a6b8e6df9d590914111648df0694eee02c05d3

two traffics test: inner traffic : committed: 16041.60 txn/s, latency: 2342.71 ms, (p50: 2300 ms, p70: 2400, p90: 2700 ms, p99: 3000 ms), latency samples: 5971320
two traffics test : committed: 100.02 txn/s, latency: 804.09 ms, (p50: 800 ms, p70: 800, p90: 900 ms, p99: 1200 ms), latency samples: 1760
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.791, avg: 1.686", "ConsensusProposalToOrdered: max: 0.179, avg: 0.175", "ConsensusOrderedToCommit: max: 0.123, avg: 0.102", "ConsensusProposalToCommit: max: 0.296, avg: 0.277"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.85s no progress at version 16239 (avg 0.08s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.29s no progress at version 2738479 (avg 0.29s) [limit 16].
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite framework_upgrade success on 411b9c779ae61d440e96591183edd7ea0e0bb563 ==> e4a6b8e6df9d590914111648df0694eee02c05d3

Compatibility test results for 411b9c779ae61d440e96591183edd7ea0e0bb563 ==> e4a6b8e6df9d590914111648df0694eee02c05d3 (PR)
Upgrade the nodes to version: e4a6b8e6df9d590914111648df0694eee02c05d3
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2302.15 txn/s, submitted: 2307.44 txn/s, failed submission: 5.29 txn/s, expired: 5.29 txn/s, latency: 1289.41 ms, (p50: 1200 ms, p70: 1500, p90: 1800 ms, p99: 2400 ms), latency samples: 208980
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2503.12 txn/s, submitted: 2513.65 txn/s, failed submission: 10.53 txn/s, expired: 10.53 txn/s, latency: 1141.35 ms, (p50: 1200 ms, p70: 1200, p90: 1500 ms, p99: 2100 ms), latency samples: 228161
5. check swarm health
Compatibility test for 411b9c779ae61d440e96591183edd7ea0e0bb563 ==> e4a6b8e6df9d590914111648df0694eee02c05d3 passed
Upgrade the remaining nodes to version: e4a6b8e6df9d590914111648df0694eee02c05d3
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2546.41 txn/s, submitted: 2553.34 txn/s, failed submission: 6.93 txn/s, expired: 6.93 txn/s, latency: 1167.66 ms, (p50: 1200 ms, p70: 1200, p90: 1500 ms, p99: 2100 ms), latency samples: 227741
Test Ok

@wqfish wqfish merged commit 9db2bb4 into main Feb 28, 2026
132 of 143 checks passed
@wqfish wqfish deleted the pr18874 branch February 28, 2026 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants