Simplify SRS alarm guarantees #5325

MellowYarker · 2025-10-15T22:14:13Z

The existing SRS alarms implementation prioritizes moving the alarm time forward, and deprioritizes moving it backwards. In theory, this means upon catastrophic failure when we try to move it earlier, we'll silently "roll it back" when we armAlarmHandler, as if the partial write never happened. If we fail while trying to move the alarm time to later and experience a failure, when we run the alarm early we'll see the alarm is supposed to run later (according to SRS, which committed the update) and move the execution time later in armAlarmHandler.

In practice, this implementation is hard to reason about (especially with foreground and background tasks that may interleave upon retries), and if the client sees an exception from storage, it should not be making assumptions about the state of storage. Instead, it should re-run the DO and reconcile storage in a new request however it sees fit.

This PR attempts to loosen some guarantees of SRS alarms by relying on a few things:

Startup Reconciliation: Upon the construction of ActorSqlite, we scheduleRun() with whatever durable alarm time is in SRS. This way, if scheduleRun() completes and the SRS write does not, we rollback on start up. If they are synced, then it's just an extra scheduleRun() call.
Atomic operations via Output Gate: If the output gate breaks during a scheduleRun() request, the client knows the state of storage is unknown and so they have to retry with a new request. The gate only opens successfully if both our scheduleRun() succeeds (first), and our commit to SRS succeeds (second).
Serialization of requests: We rely on commitTasks and waiting on the previous commit to finish, and only send one scheduleRun() per-commit. We also send a "fixup" scheduleRun() once if there are any waiters of a pendingCommit. Ex. if there are 4 waiting commits, each having setAlarm(), the first waiter calls scheduleRun() with the most recently set alarm time.
Eagerly scheduling all alarms: We don't distinguish earlier vs. later alarms, we just always scheduleRun() if there is a new, different alarm state being requested. No background tasks (that may interleave and cause races). We do this before we commit to SRS, and can reconcile catastrophic loss by restarting the actor on the next request (which as previously mentioned, the client should initiate).

In theory, if we scheduleAlarm() to a later time, and this request succeeds, but we fail to make the alarm durable in SRS, then we will run the alarm at the new later time, and see that SRS had an earlier alarm marked. While this may seem bad, in reality, the output gate would have been broken during the alarm write and so the client should retry the operation (or reconcile storage somehow). If the DO starts up for any reason, then by (1 -- Startup Reconciliation) we would self-heal/rollback to SRS's confirmed alarm time. Additionally, the client would likely setAlarm() again.

Even if the application uses allowUnconfirmed = true with setAlarm(), it wouldn't know that the alarm was confirmed durable until it successfully sync()ed. If the alarm write experienced a full or parital failure, we'd break the output gate, and so the sync() would fail. If the sync() succeeds, then both the scheduleRun() and the SRS commit were confirmed.

src/workerd/io/actor-sqlite.c++

MellowYarker · 2025-10-17T22:22:44Z

I put up a test fix + rebased. I want to add a few more tests to cover some new behavior (mostly just reconciliation after crashing), but after that I'll move it out of draft. Only 1 internal test is failing. I'll do all that on Monday, brain is mush after going through all the tests today 😅.

jqmmes · 2025-10-22T15:10:34Z

src/workerd/io/actor-sqlite.c++

-      alarmScheduledNoLaterThan = requestedTime;
-    }
+  return hooks.scheduleRun(requestedTime).then([this, requestedTime]() {
+    KJ_LOG(INFO, "scheduleRun set alarm time to ", requestedTime);


Isn't this going to get really noisy, really fast? Isn't this also going to sentry?

We don't print INFO logs in production, it's only there so I can expect it in a test in the internal repo.

I initially thought that this would spam every Durable Object started by workerd because this function is called by tryReconcileAlarm() in the constructor, but I guess the hooks check would fix that.

jqmmes · 2025-10-22T16:06:49Z

Overall, I think this new "simplified" (or maybe more strict) alarm synchronization looks good, but I'd probably be more comfortable if we released this gradually with an autogate. After all, this is changing how we are setting alarms in SRS.

MellowYarker · 2025-10-22T23:48:37Z

src/workerd/io/actor-sqlite.h

+        // If the setAlarm fails, we need to break the output gate. We can set `broken`
+        // so subsequent storage operations fail with an exception.
+        //
+        // TODO(now): I want this to log for us internally, but print as an internal error to the


@jclee or @jqmmes either of you know how I can do this?

MellowYarker · 2025-10-22T23:49:44Z

Overall, I think this new "simplified" (or maybe more strict) alarm synchronization looks good, but I'd probably be more comfortable if we released this gradually with an autogate. After all, this is changing how we are setting alarms in SRS.

@jqmmes I agree, though it's going to be rough having both versions of the code 🙁. I'll try and push something tomorrow.

codspeed-hq · 2025-10-23T01:17:37Z

CodSpeed Performance Report

Merging #5325 will improve performances by 9.61%

_{Comparing milan/STOR-4521 (50fbfaa) with main (6fd515f)}

Summary

⚡ 1 improvement
✅ 32 untouched
⏩ 9 skipped¹

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
⚡	`simpleStringBody[Response]`	22.9 µs	20.9 µs	+9.61%

9 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

src/workerd/io/actor-sqlite.h

src/workerd/io/actor-sqlite-test.c++

justin-mp · 2025-10-23T19:22:55Z

src/workerd/io/actor-sqlite.c++

-      alarmScheduledNoLaterThan = requestedTime;
-    }
+  return hooks.scheduleRun(requestedTime).then([this, requestedTime]() {
+    KJ_LOG(INFO, "scheduleRun set alarm time to ", requestedTime);


I initially thought that this would spam every Durable Object started by workerd because this function is called by tryReconcileAlarm() in the constructor, but I guess the hooks check would fix that.

The next commit will add new SRS alarm logic, so we want to put the old logic behind an autogate check first.

Previously, we tried to maintain an invariant where we would always strive to update alarms to run earlier, but deprioritize setting alarms later. We did this to try and ensure alarms always fire at the durably committed time (since, if they fired early, we could just reschedule them, but if they fired late, then they fired late, and that's bad). This was a bit confusing to reason about, and it looked like retries in backgrounded (deprioritized) "setAlarm to run later" tasks could actually interleave with later commits that set the alarm early. This commit attempts to simplify the flow by doing the following: - Renaming `alarmScheduledNoLaterThan` to `currentGuaranteedAlarmTime`. - Removing while loops that updated the current alarm time possibly multiple times per commit. - Add reconciliation on startup to sync source of truth with the durable value from SRS, handling cases where SRS persisted but scheduleRun() failed. The new approach relies on three guarantees instead of the previous invariant: 1. Serialization: All alarm updates go through `commitTasks`, preventing interleaving and ensuring one update completes before the next begins. 2. Atomicity: The output gate ensures clients only see success if both SRS and scheduleRun() succeed. On failure, clients know to retry. 3. Reconciliation: On actor startup, we sync the source of truth with the durable SRS value, fixing any divergence from partial failures.

When the `SERIALIZE_SRS_ALARMS` autogate is enabled, we always send a `scheduleRun()` before we commit to SRS. This changes the expected order of RPCs to our mock servers.

MellowYarker requested review from a team as code owners October 15, 2025 22:14

MellowYarker requested review from jclee and justin-mp October 16, 2025 02:24

justin-mp reviewed Oct 16, 2025

View reviewed changes

src/workerd/io/actor-sqlite.c++ Outdated Show resolved Hide resolved

jclee reviewed Oct 16, 2025

View reviewed changes

src/workerd/io/actor-sqlite.c++ Outdated Show resolved Hide resolved

MellowYarker force-pushed the milan/STOR-4521 branch from 82c92f5 to adafad2 Compare October 17, 2025 20:27

MellowYarker marked this pull request as draft October 17, 2025 20:28

MellowYarker force-pushed the milan/STOR-4521 branch 4 times, most recently from 0356b62 to 9823e33 Compare October 17, 2025 21:45

MellowYarker changed the title ~~Schedule the latest alarm when we merge commits~~ Simplify SRS alarm guarantees Oct 17, 2025

MellowYarker force-pushed the milan/STOR-4521 branch 2 times, most recently from 12ed2d8 to 84f3fb1 Compare October 21, 2025 20:04

MellowYarker marked this pull request as ready for review October 21, 2025 20:10

MellowYarker force-pushed the milan/STOR-4521 branch 3 times, most recently from 45e0a20 to 1b3d636 Compare October 21, 2025 22:40

jqmmes reviewed Oct 22, 2025

View reviewed changes

MellowYarker force-pushed the milan/STOR-4521 branch 2 times, most recently from 1694eb0 to 508a8fc Compare October 22, 2025 23:46

MellowYarker commented Oct 22, 2025

View reviewed changes

MellowYarker force-pushed the milan/STOR-4521 branch 2 times, most recently from da7b2a3 to af8e8fa Compare October 23, 2025 20:40

justin-mp reviewed Oct 23, 2025

View reviewed changes

MellowYarker force-pushed the milan/STOR-4521 branch 3 times, most recently from 3bb9d5d to 80e5cf0 Compare October 24, 2025 12:49

MellowYarker added 3 commits October 24, 2025 12:42

Add SERIALIZE_SRS_ALARMS autogate

ba62845

The next commit will add new SRS alarm logic, so we want to put the old logic behind an autogate check first.

Modify actor-sqlite-test RPC order expectations

50fbfaa

When the `SERIALIZE_SRS_ALARMS` autogate is enabled, we always send a `scheduleRun()` before we commit to SRS. This changes the expected order of RPCs to our mock servers.

MellowYarker force-pushed the milan/STOR-4521 branch from 80e5cf0 to 50fbfaa Compare October 24, 2025 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify SRS alarm guarantees #5325

Simplify SRS alarm guarantees #5325

MellowYarker commented Oct 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

MellowYarker commented Oct 17, 2025

Uh oh!

jqmmes Oct 22, 2025

Uh oh!

MellowYarker Oct 22, 2025

Uh oh!

justin-mp Oct 23, 2025

Uh oh!

jqmmes commented Oct 22, 2025

Uh oh!

MellowYarker Oct 22, 2025

Uh oh!

MellowYarker commented Oct 22, 2025

Uh oh!

codspeed-hq bot commented Oct 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

justin-mp Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Simplify SRS alarm guarantees #5325

Are you sure you want to change the base?

Simplify SRS alarm guarantees #5325

Conversation

MellowYarker commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MellowYarker commented Oct 17, 2025

Uh oh!

jqmmes Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

MellowYarker Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

justin-mp Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

jqmmes commented Oct 22, 2025

Uh oh!

MellowYarker Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

MellowYarker commented Oct 22, 2025

Uh oh!

codspeed-hq bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #5325 will improve performances by 9.61%

Summary

Benchmarks breakdown

Footnotes

Uh oh!

Uh oh!

Uh oh!

justin-mp Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MellowYarker commented Oct 15, 2025 •

edited

Loading

codspeed-hq bot commented Oct 23, 2025 •

edited

Loading