make event dispatching non-blocking by benjamin-stacks · Pull Request #6762 · stacks-network/stacks-core

benjamin-stacks · 2025-12-15T11:20:43Z

addresses #6543

Checklist

Test coverage for new or modified code paths
Changelog is updated
Required documentation changes (e.g., docs/rpc/openapi.yaml and rpc-endpoints.md for v2 endpoints, event-dispatcher.md for new events)
~~New clarity functions have corresponding PR in clarity-benchmarking repo~~

jcnelson · 2025-12-15T18:14:20Z

stacks-node/src/event_dispatcher/db.rs

@@ -0,0 +1,322 @@
+use std::path::PathBuf;


Hey! Please add the copyright header to each source file. Thanks!

Ah, good call, will do. Is "Stacks Open Internet Foundation" still the correct copyright holder?

Also, there's a whole bunch of files that lack that header, I'm wondering if there's a good way to automate this.

codecov · 2025-12-16T17:39:16Z

Codecov Report

❌ Patch coverage is 79.12088% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.86%. Comparing base (0317850) to head (4c3a670).

Files with missing lines	Patch %	Lines
stacks-node/src/event_dispatcher/worker.rs	86.34%	28 Missing ⚠️
stacks-node/src/event_dispatcher/db.rs	70.12%	23 Missing ⚠️
stacks-node/src/main.rs	0.00%	9 Missing ⚠️
stacks-node/src/run_loop/nakamoto.rs	20.00%	8 Missing ⚠️
stacks-node/src/event_dispatcher.rs	85.36%	6 Missing ⚠️
stacks-node/src/run_loop/boot_nakamoto.rs	50.00%	1 Missing ⚠️
stackslib/src/config/mod.rs	92.30%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #6762      +/-   ##
===========================================
+ Coverage    72.67%   76.86%   +4.19%     
===========================================
  Files          411      419       +8     
  Lines       221663   223340    +1677     
  Branches         0      338     +338     
===========================================
+ Hits        161086   171678   +10592     
+ Misses       60577    51662    -8915

Files with missing lines	Coverage Δ
stacks-node/src/node.rs	`87.03% <100.00%> (+0.01%)`	⬆️
stacks-node/src/run_loop/mod.rs	`89.38% <100.00%> (ø)`
stacks-node/src/run_loop/neon.rs	`86.11% <100.00%> (+1.58%)`	⬆️
stacks-node/src/run_loop/boot_nakamoto.rs	`80.13% <50.00%> (+0.13%)`	⬆️
stackslib/src/config/mod.rs	`65.86% <92.30%> (+6.03%)`	⬆️
stacks-node/src/event_dispatcher.rs	`82.93% <85.36%> (+13.75%)`	⬆️
stacks-node/src/run_loop/nakamoto.rs	`86.36% <20.00%> (+0.38%)`	⬆️
stacks-node/src/main.rs	`0.00% <0.00%> (ø)`
stacks-node/src/event_dispatcher/db.rs	`92.52% <70.12%> (+6.34%)`	⬆️
stacks-node/src/event_dispatcher/worker.rs	`86.34% <86.34%> (ø)`

... and 253 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0317850...4c3a670. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

This commit is the main implementation work for stacks-network#6543. It moves event dispatcher HTTP requests to a separate thread. That way, a slow event observer doesn't block the node from continuing its work. Only if your event observers are so slow that the node is continuously producing events faster than they can be delivered, will it eventually start blocking again, because the queue size for pending requests is bounded (at 1,000 right now, but I picked that number out of a hat, happy to change it if anyone has thoughts). Each new event payload is stored in the event observer DB, and its ID is then sent to the subthread, which will make the request and then delete the DB entry. That way, if a node is shut down while there are pending requests, they're in the DB ready to be retried after restart via `process_pending_payloads()` (which blocks until completion). So that's exactly as before (except that previously there couldn't have been more than one or two pending payloads).

This fixes [this integration test failure](https://github.com/stacks-network/stacks-core/actions/runs/20749024845/job/59577684952?pr=6762), caused by the fact that event delivery wasn't complete by the time the assertions were made.

Doing this work in the RunLoop implementations' startup code is *almost* the same thing, but not quite, since the nakamoto run loop might be started later (after an epoch 3 transition), at which point the event DB may already have new items from the current run of the application, which should *not* be touched by `process_pending_payloads`. This used to not be a problem, but now that that DB is used for the actual queue of the (concurrently running) EventDispatcherWorker, it has become one.

This is like 72437b2, but it works for all the tests instead of only the one. While only that one test very obviously failed, the issue exists for pretty much all of the integration tests, because they rely on the test_observer to capture all relevant data. Things are fast enough, and therefore we've only seen one blatant failure, but 1) it's going to be flaky (I can create a whole lot of test failures by adding a small artificial delay to event delivery), and 2) it might actually be *hiding* test failures (in some cases, like e.g. neon_integrations::deep_contract, we're asserting that certain things are *not* in the data, and if the data is incomplete to begin with, those assertions are moot).

When switching runloops at the epoch 2/3 transition, this ensures that the same event dispatcher worker thread is handling delivery, which in turn ensures that all payloads are delivered in order

…g-event-delivery

Thanks Hank for the tip!

Not sure why this wasn't caught in the pre-commit hook, I'd have assumed the checks are the same.

…g-event-delivery

The logic is slightly tricky here, because the size of the queue (the max number of in-flight requests before we start blocking the thread) is implemented through the `bound` parameter of the `sync_channel`, but those two values aren't actually the same. See the comment at the top of `EventDispatcherWorker::new()` for details.

See the discussion in stacks-network#6543 for some background.

benjamin-stacks · 2026-01-16T12:53:25Z

stackslib/src/config/mod.rs

+    /// to `true`, as no in-flight requests are allowed.
+    /// ---
+    /// @default: `1_000`
+    pub event_dispatcher_queue_size: usize,


This value ultimately ends up as the bound argument to sync_channel, which is of type usize.

I don't know if we have any concerns about having a platform-dependent number type on the configuration object -- if yes, we can also use something else here. Arguably, any reasonable values for this setting should fit into 16 bits anyway. If you need your queue to be bigger than 64k, you should reconsider you architecture.

stacks-node/src/event_dispatcher/db.rs

stacks-node/src/event_dispatcher/worker.rs

stacks-network#6762 (comment)

jcnelson · 2026-02-20T19:11:45Z

stacks-node/src/event_dispatcher/tests.rs

+
+    // we waited 500ms previously, so it should take on the order of 1.5s until
+    // the first request is complete
+    assert!(


So, this logic has the potential to be flaky since it's making assumptions about how long it takes processed_mined_nakamoto_block_event() will take. Is there a way to make this more robust?

I thought about that, but I don't think there's a way to completely get around this theoretical possibility, short of #[cfg(test)] code that reports "I promise, I blocked!", which wouldn't be a true test of actual behavior.

The only way to assert that thread A blocked until the completion of thread B is to assert that B finished before A, but that will always make the assumption that waiting for B was the reason that A didn't finish earlier.

And conversely, the only way to assert that thread A was not blocked is to assert that it continues before B is finished, but there can't be a 100% guarantee that B isn't faster -- the CPU starvation theory might as well apply to thread A.

In addition, some such assertions would require a complex setup that need a third thread to coordinate, which would make the test harder to reason about and increase the chance of testing the wrong thing.

That is why I decided to rely on timing conditions that realistically could only be achieved by correct behavior. I also made sure to avoid false negatives by asserting that the measured duration are neither too short nor too long. And I picked durations that I felt are long enough to make the likelihood extremely small that other effects are causing the behavior -- hundreds and thousands of milliseconds seem like an eternity in CPU land (but I'm happy to increase that even more if you disagree).

Since these aren't end-to-end tests, there's no interplay between a bitcoin daemon, a signer, a chainstate coordinator, etc. This further reduces the chance for such flakes. Yes, process_mined_nakamoto_block_event() could theoretically take a long time, but realistically it's simple enough that that's unlikely.

jcnelson · 2026-02-20T19:12:20Z

stacks-node/src/event_dispatcher/tests.rs

+    assert_eq!(start_count.load(Ordering::SeqCst), 2);
+    assert_eq!(end_count.load(Ordering::SeqCst), 1);
+
+    thread::sleep(Duration::from_secs(2));


Same here -- this logic can easily flake out if the event observer thread is starved of CPU time for long enough. Can we use something other than wall-clock time?

responded here

jcnelson · 2026-02-20T19:12:56Z

stacks-node/src/event_dispatcher/tests.rs

+        "dispatcher did not block while sending event"
+    );
+
+    thread::sleep(Duration::from_millis(100));


Flagging this as well, to try and use something more robust to verify that the event dispatcher is making progress correctly.

responded here

jcnelson · 2026-02-20T19:18:44Z

stacks-node/src/event_dispatcher/worker.rs

+
+            debug!("Event Dispatcher Worker: doing payload {id}");
+
+            // This will block forever if we were passed a non-existing ID. Don't do that.


Can the worker just abort in this case? Or better, can the worker send the process a SIGTERM to initiate a clean shutdown?

Yeah that's a fair point.

This code needs to be retry-able because it involves I/O, but the behavior should be different between "failure because the records simply doesn't exist" and "failure because I/O timeout". Will change.

jcnelson · 2026-02-20T19:19:18Z

stacks-node/src/event_dispatcher/worker.rs

+                    // If the sending fails (i.e. the receiver has been dropped), that means a logic bug
+                    // has been introduced to the code -- at time of writing, the main function is waiting
+                    // for this message a few lines down, outside the thread closure.
+                    // We log this, but we still start the loop.


I'm not sure I agree with this line of reasoning. If the worker thread cannot reliably communicate with the supervisor, then the worker should terminate as soon as possible. Otherwise, we'd make it impossible to shut down the Stacks node with a termination signal, since this thread may still be running in the background with no means of cancellation short of a SIGKILL.

This is about the channel over which the worker thread reports to the main thread that it's ready, not about communication to the worker.

Right -- if the worker can't reach the supervisor, then the worker should die and (ideally) the node would shut down since something is seriously amiss.

Yeah, that's fair. I'll change this to panic instead of just logging. Our panic hook will then kill the whole node.

jcnelson · 2026-02-20T19:20:18Z

stacks-node/src/event_dispatcher/worker.rs

+            let mut payload = conn.get_payload_with_retry(id);
+
+            // Deliberately not handling the error case of `duration_since()` -- if the `timestamp`
+            // is *after* `now` (which should be extremely rare), the most likely reason is a *slight*


This very failure mode has happened to us before, and has led to node crashes that warranted an emergency hotfix. Please gracefully handle the case where duration_since() fails, since time can go backwards due to NTP sync (as you mention).

It is handled gracefully -- by simply doing nothing. This is only used for logging a warning if events are old (i.e. have been stuck in the queue). If the payload comes from the future (for all we know), it's definitely not late.

jcnelson · 2026-02-20T19:20:58Z

stacks-node/src/event_dispatcher/worker.rs

+            // is *after* `now` (which should be extremely rare), the most likely reason is a *slight*
+            // adjustment to the the system clock (e.g. NTP sync) that happened between storing the
+            // entity and retrieving it, and that should be fine.
+            // If there was a *major* adjustment, all bets are off anyway. You shouldn't mess with your


No, we should be robust in the face of clock changes. Please use the system monotonic clock instead of the wall clock. We do elsewhere.

Can you point me to an example of where we do that? I honestly don't even understand how it could be possible to do that. And all instances of serializing time stamps that I've seen in the code are either strings or unix time stamp integers, all based on wall clock time.

In my understanding, a monotonic clock can at most be expected to be reliable between operating system restarts. But a database file survives a restart, and could even theoretically be moved to a different machine (even a different platform!).

Now, what we could do is additionally pass an Instant to the worker from the calling thread if the event was in fact generated in the current execution of the application, and only fall back to the DB-stored timestamp if it's a retry across restarts.

However, if we want to log a warning if payloads are older than a certain threshold, we would still have to keep this code around that you're objecting to. And if we don't want to log that, I could just remove all the timestamp-related code here, since the logging is all we're using it for.

Thoughts?

stacks-node/src/event_dispatcher/worker.rs

jcnelson · 2026-02-20T19:25:48Z

stacks-node/src/event_dispatcher/worker.rs

+        // Cap the backoff at 3x the timeout
+        let max_backoff = data.timeout.saturating_mul(3);
+
+        loop {


Could this be factored to use the with_retry() function above?

Also, I see that this code performs retry jitter, whereas with_retry() does not. Is there a reason for this?

Could this be factored to use the with_retry() function above?

Possibly, but as you noted, the two behave differently. That is why I didn't use with_retry for the HTTP request, because I didn't want to introduce unrelated functional changes or make with_retry() more complex to support a single use case.

As for the question why that difference is there, I could guess, but I don't know for sure because that code predates me by a long time.

The with_retry logic was added in #5358, I just moved it to a helper so I could reuse it for get_payload_with_retry().

The backoff jitter for the HTTP request retries was added in #5327.

Since both of those came from @brice-stacks, I guess he could add some color here, but in either case, I don't think a change to any of this needs to be in this PR.

jcnelson · 2026-02-20T19:26:46Z

stacks-node/src/event_dispatcher.rs

    pub stackerdb_channel: Arc<Mutex<StackerDBChannel>>,
    /// Path to the database where pending payloads are stored.
    db_path: PathBuf,
+    /// The worker thread that performs the actuall HTTP requests so that they don't block


fixed as part of 95c8d7d

jcnelson · 2026-02-20T19:29:12Z

stacks-node/src/event_dispatcher.rs

+static ALL_WORKERS: Mutex<Vec<Weak<EventDispatcherWorker>>> = Mutex::new(Vec::new());
+
+#[cfg(test)]
+pub fn catch_up_all_event_dispatchers() {


I'm not sure why this and ALL_WORKERS are necessary? Isn't it the case that an event dispatcher configured with a queue size of zero ought to synchronously deliver events? And, isn't that what all the existing tests expect the event dispatcher to do?

Ooohh this is an interesting point. You're absolutely right, but there's some history and nuance here.

I originally implemented all this to always be non-blocking. This meant that the e2e tests had to explicitly wait for the dispatcher to catch up in order to be able to assert on certain events.

Then later, per your and Aaron's feedback in this thread, I added back the option to make it blocking and also made that the default behavior.

Indeed that means this catch_up code isn't actually necessary right now.

However, this also means that the asynchronous logic no longer gets any coverage from the integration tests. We could change the config for the integration tests to run with a positive queue size instead, in which case this catching up would be needed again.

The more I think about this though. the less I think this is necessary. Even with a queue size of zero, all the bits and pieces of the asynchronous implementation (DB persistence, channel message, worker thread) still get coverage. And for the tricky bits (like blocking on a full queue), I added unit tests.

Long story short, I tend to agree with you that this can be removed again, and will do that. But let me know if you have any additional thoughts based on this context.

removed in 95c8d7d

jcnelson

This overall looks good, but I have a few concerns about the threadpool and the handling of clock skew (among other lesser things).

as [Jude points out](stacks-network#6762 (comment)), it's no longer necessary since the default behavior is still blocking

…f spinning forever stacks-network#6762 (comment)

stacks-network#6762 (comment)

jcnelson reviewed Dec 15, 2025

View reviewed changes

benjamin-stacks mentioned this pull request Dec 17, 2025

Refactor event dispatcher #6765

Merged

benjamin-stacks force-pushed the feat/non-blocking-event-delivery branch from 98b3395 to 0d62bd4 Compare December 29, 2025 14:34

benjamin-stacks mentioned this pull request Jan 6, 2026

refactor: remove the ability to use event dispatcher without DB #6785

Merged

benjamin-stacks added 4 commits January 6, 2026 13:57

benjamin-stacks force-pushed the feat/non-blocking-event-delivery branch from 478efa3 to d5fa2fc Compare January 8, 2026 17:29

benjamin-stacks added 2 commits January 9, 2026 09:32

use the same event dispatcher for neon and nakamoto

1b4d5f4

When switching runloops at the epoch 2/3 transition, this ensures that the same event dispatcher worker thread is handling delivery, which in turn ensures that all payloads are delivered in order

give event dispatcher threads distinct names, and at some debug logging

16d9387

benjamin-stacks mentioned this pull request Jan 9, 2026

[CI test only] non-blocking event delivery + artificial delay #6793

Closed

benjamin-stacks added 7 commits January 9, 2026 17:38

Merge branch 'refactor/event-dispatcher-tweaks' into feat/non-blockin…

b856441

…g-event-delivery

Merge branch 'refactor/event-dispatcher-tweaks' into feat/non-blockin…

2a26166

…g-event-delivery

refactor: remove error boilerplate by using thiserror

c4897da

Thanks Hank for the tip!

remove unused import

02dfa1a

Not sure why this wasn't caught in the pre-commit hook, I'd have assumed the checks are the same.

Merge branch 'refactor/event-dispatcher-tweaks' into feat/non-blockin…

a9e21f1

…g-event-delivery

allow configuring event dispatcher blocking behavior in config.toml

05cd02c

See the discussion in stacks-network#6543 for some background.

benjamin-stacks commented Jan 16, 2026

View reviewed changes

benjamin-stacks added 2 commits January 16, 2026 13:56

chore: update changelog

94984ab

chore: add/bump copyright note in all files that this PR touches

ff7d455

benjamin-stacks changed the title ~~[WIP] make event dispatching non-blocking~~ make event dispatching non-blocking Jan 16, 2026

Merge branch 'develop' into feat/non-blocking-event-delivery

c6a01da

benjamin-stacks marked this pull request as ready for review January 20, 2026 15:35

benjamin-stacks added 3 commits January 30, 2026 15:55

Merge branch 'develop' into feat/non-blocking-event-delivery

6cc94e1

Merge branch 'develop' into feat/non-blocking-event-delivery

c7ae9eb

Merge branch 'develop' into feat/non-blocking-event-delivery

4e7099d

jacinta-stacks reviewed Feb 9, 2026

View reviewed changes

stacks-node/src/event_dispatcher/db.rs Show resolved Hide resolved

stacks-node/src/event_dispatcher/worker.rs Outdated Show resolved Hide resolved

benjamin-stacks added 2 commits February 17, 2026 14:54

Merge branch 'develop' into feat/non-blocking-event-delivery

ffad4bd

tweak so that WorkerTask::NoOp doesn't exist in all during non-test runs

63ad84a

stacks-network#6762 (comment)

benjamin-stacks requested a review from jacinta-stacks February 17, 2026 14:22

benjamin-stacks added 3 commits February 17, 2026 18:31

the attributes need to match the usage elsewhere

e3f3c53

Merge branch 'develop' into feat/non-blocking-event-delivery

fcfd119

Merge branch 'develop' into feat/non-blocking-event-delivery

89263ef

jacinta-stacks previously approved these changes Feb 19, 2026

View reviewed changes

benjamin-stacks requested a review from rob-stacks February 20, 2026 15:39

jcnelson reviewed Feb 20, 2026

View reviewed changes

stacks-node/src/event_dispatcher/worker.rs Show resolved Hide resolved

jcnelson reviewed Feb 20, 2026

View reviewed changes

stacks-node/src/event_dispatcher/worker.rs Show resolved Hide resolved

jcnelson reviewed Feb 20, 2026

View reviewed changes

jcnelson requested changes Feb 20, 2026

View reviewed changes

benjamin-stacks added 4 commits February 25, 2026 09:44

remove catch_up_all_event_dispatchers() and supporting infrastructure

95c8d7d

as [Jude points out](stacks-network#6762 (comment)), it's no longer necessary since the default behavior is still blocking

if the dispatcher worker receives an invalid payload, panic instead o…

ba06d46

…f spinning forever stacks-network#6762 (comment)

panic instead of just logging if the thread can't report being ready

25f00e8

stacks-network#6762 (comment)

Merge branch 'develop' into feat/non-blocking-event-delivery

4c3a670

benjamin-stacks dismissed jacinta-stacks’s stale review via 4c3a670 February 25, 2026 13:35

benjamin-stacks requested a review from jcnelson February 25, 2026 15:49


		debug!("Event Dispatcher Worker: doing payload {id}");

		// This will block forever if we were passed a non-existing ID. Don't do that.

Conversation

benjamin-stacks commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benjamin-stacks Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcnelson left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

benjamin-stacks commented Dec 15, 2025 •

edited

Loading

codecov bot commented Dec 16, 2025 •

edited

Loading

benjamin-stacks Feb 23, 2026 •

edited

Loading