Skip to content

Conversation

@MellowYarker
Copy link
Contributor

@MellowYarker MellowYarker commented Oct 15, 2025

The existing SRS alarms implementation prioritizes moving the alarm time forward, and deprioritizes moving it backwards. In theory, this means upon catastrophic failure when we try to move it earlier, we'll silently "roll it back" when we armAlarmHandler, as if the partial write never happened. If we fail while trying to move the alarm time to later and experience a failure, when we run the alarm early we'll see the alarm is supposed to run later (according to SRS, which committed the update) and move the execution time later in armAlarmHandler.

In practice, this implementation is hard to reason about (especially with foreground and background tasks that may interleave upon retries), and if the client sees an exception from storage, it should not be making assumptions about the state of storage. Instead, it should re-run the DO and reconcile storage in a new request however it sees fit.

This PR attempts to loosen some guarantees of SRS alarms by relying on a few things:

  1. Startup Reconciliation: Upon the construction of ActorSqlite, we scheduleRun() with whatever durable alarm time is in SRS. This way, if scheduleRun() completes and the SRS write does not, we rollback on start up. If they are synced, then it's just an extra scheduleRun() call.
  2. Atomic operations via Output Gate: If the output gate breaks during a scheduleRun() request, the client knows the state of storage is unknown and so they have to retry with a new request. The gate only opens successfully if both our scheduleRun() succeeds (first), and our commit to SRS succeeds (second).
  3. Serialization of requests: We rely on commitTasks and waiting on the previous commit to finish, and only send one scheduleRun() per-commit. We also send a "fixup" scheduleRun() once if there are any waiters of a pendingCommit. Ex. if there are 4 waiting commits, each having setAlarm(), the first waiter calls scheduleRun() with the most recently set alarm time.
  4. Eagerly scheduling all alarms: We don't distinguish earlier vs. later alarms, we just always scheduleRun() if there is a new, different alarm state being requested. No background tasks (that may interleave and cause races). We do this before we commit to SRS, and can reconcile catastrophic loss by restarting the actor on the next request (which as previously mentioned, the client should initiate).

In theory, if we scheduleAlarm() to a later time, and this request succeeds, but we fail to make the alarm durable in SRS, then we will run the alarm at the new later time, and see that SRS had an earlier alarm marked. While this may seem bad, in reality, the output gate would have been broken during the alarm write and so the client should retry the operation (or reconcile storage somehow). If the DO starts up for any reason, then by (1 -- Startup Reconciliation) we would self-heal/rollback to SRS's confirmed alarm time. Additionally, the client would likely setAlarm() again.

Even if the application uses allowUnconfirmed = true with setAlarm(), it wouldn't know that the alarm was confirmed durable until it successfully sync()ed. If the alarm write experienced a full or parital failure, we'd break the output gate, and so the sync() would fail. If the sync() succeeds, then both the scheduleRun() and the SRS commit were confirmed.

@MellowYarker MellowYarker requested review from a team as code owners October 15, 2025 22:14
@MellowYarker MellowYarker marked this pull request as draft October 17, 2025 20:28
@MellowYarker MellowYarker force-pushed the milan/STOR-4521 branch 4 times, most recently from 0356b62 to 9823e33 Compare October 17, 2025 21:45
@MellowYarker
Copy link
Contributor Author

I put up a test fix + rebased. I want to add a few more tests to cover some new behavior (mostly just reconciliation after crashing), but after that I'll move it out of draft. Only 1 internal test is failing. I'll do all that on Monday, brain is mush after going through all the tests today 😅.

@MellowYarker MellowYarker changed the title Schedule the latest alarm when we merge commits Simplify SRS alarm guarantees Oct 17, 2025
@MellowYarker MellowYarker force-pushed the milan/STOR-4521 branch 2 times, most recently from 12ed2d8 to 84f3fb1 Compare October 21, 2025 20:04
@MellowYarker MellowYarker marked this pull request as ready for review October 21, 2025 20:10
@MellowYarker MellowYarker force-pushed the milan/STOR-4521 branch 3 times, most recently from 45e0a20 to 1b3d636 Compare October 21, 2025 22:40
alarmScheduledNoLaterThan = requestedTime;
}
return hooks.scheduleRun(requestedTime).then([this, requestedTime]() {
KJ_LOG(INFO, "scheduleRun set alarm time to ", requestedTime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going to get really noisy, really fast? Isn't this also going to sentry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't print INFO logs in production, it's only there so I can expect it in a test in the internal repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially thought that this would spam every Durable Object started by workerd because this function is called by tryReconcileAlarm() in the constructor, but I guess the hooks check would fix that.

@jqmmes
Copy link
Contributor

jqmmes commented Oct 22, 2025

Overall, I think this new "simplified" (or maybe more strict) alarm synchronization looks good, but I'd probably be more comfortable if we released this gradually with an autogate. After all, this is changing how we are setting alarms in SRS.

@MellowYarker MellowYarker force-pushed the milan/STOR-4521 branch 2 times, most recently from 1694eb0 to 508a8fc Compare October 22, 2025 23:46
// If the setAlarm fails, we need to break the output gate. We can set `broken`
// so subsequent storage operations fail with an exception.
//
// TODO(now): I want this to log for us internally, but print as an internal error to the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jclee or @jqmmes either of you know how I can do this?

@MellowYarker
Copy link
Contributor Author

Overall, I think this new "simplified" (or maybe more strict) alarm synchronization looks good, but I'd probably be more comfortable if we released this gradually with an autogate. After all, this is changing how we are setting alarms in SRS.

@jqmmes I agree, though it's going to be rough having both versions of the code 🙁. I'll try and push something tomorrow.

@codspeed-hq
Copy link

codspeed-hq bot commented Oct 23, 2025

CodSpeed Performance Report

Merging #5325 will improve performances by 9.61%

Comparing milan/STOR-4521 (50fbfaa) with main (6fd515f)

Summary

⚡ 1 improvement
✅ 32 untouched
⏩ 9 skipped1

Benchmarks breakdown

Benchmark BASE HEAD Change
simpleStringBody[Response] 22.9 µs 20.9 µs +9.61%

Footnotes

  1. 9 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@MellowYarker MellowYarker force-pushed the milan/STOR-4521 branch 2 times, most recently from da7b2a3 to af8e8fa Compare October 23, 2025 20:40
alarmScheduledNoLaterThan = requestedTime;
}
return hooks.scheduleRun(requestedTime).then([this, requestedTime]() {
KJ_LOG(INFO, "scheduleRun set alarm time to ", requestedTime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially thought that this would spam every Durable Object started by workerd because this function is called by tryReconcileAlarm() in the constructor, but I guess the hooks check would fix that.

@MellowYarker MellowYarker force-pushed the milan/STOR-4521 branch 3 times, most recently from 3bb9d5d to 80e5cf0 Compare October 24, 2025 12:49
The next commit will add new SRS alarm logic, so we want to put the old
logic behind an autogate check first.
Previously, we tried to maintain an invariant where we would always
strive to update alarms to run earlier, but deprioritize setting alarms
later. We did this to try and ensure alarms always fire at the durably
committed time (since, if they fired early, we could just reschedule
them, but if they fired late, then they fired late, and that's bad).

This was a bit confusing to reason about, and it looked like retries in
backgrounded (deprioritized) "setAlarm to run later" tasks could
actually interleave with later commits that set the alarm early.

This commit attempts to simplify the flow by doing the following:

	- Renaming `alarmScheduledNoLaterThan` to
	  `currentGuaranteedAlarmTime`.

	- Removing while loops that updated the current alarm time
	  possibly multiple times per commit.

 	- Add reconciliation on startup to sync source of truth with the
	  durable value from SRS, handling cases where SRS persisted but
	  scheduleRun() failed.

The new approach relies on three guarantees instead of the previous
invariant:

  1. Serialization: All alarm updates go through `commitTasks`,
     preventing interleaving and ensuring one update completes before
     the next begins.

  2. Atomicity: The output gate ensures clients only see success
     if both SRS and scheduleRun() succeed. On failure, clients know to
     retry.

  3. Reconciliation: On actor startup, we sync the source of truth
     with the durable SRS value, fixing any divergence from partial
     failures.
When the `SERIALIZE_SRS_ALARMS` autogate is enabled, we always send a
`scheduleRun()` before we commit to SRS. This changes the expected order
of RPCs to our mock servers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants