-
Notifications
You must be signed in to change notification settings - Fork 461
Simplify SRS alarm guarantees #5325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
82c92f5 to
adafad2
Compare
0356b62 to
9823e33
Compare
|
I put up a test fix + rebased. I want to add a few more tests to cover some new behavior (mostly just reconciliation after crashing), but after that I'll move it out of draft. Only 1 internal test is failing. I'll do all that on Monday, brain is mush after going through all the tests today 😅. |
12ed2d8 to
84f3fb1
Compare
45e0a20 to
1b3d636
Compare
src/workerd/io/actor-sqlite.c++
Outdated
| alarmScheduledNoLaterThan = requestedTime; | ||
| } | ||
| return hooks.scheduleRun(requestedTime).then([this, requestedTime]() { | ||
| KJ_LOG(INFO, "scheduleRun set alarm time to ", requestedTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this going to get really noisy, really fast? Isn't this also going to sentry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't print INFO logs in production, it's only there so I can expect it in a test in the internal repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially thought that this would spam every Durable Object started by workerd because this function is called by tryReconcileAlarm() in the constructor, but I guess the hooks check would fix that.
|
Overall, I think this new "simplified" (or maybe more strict) alarm synchronization looks good, but I'd probably be more comfortable if we released this gradually with an autogate. After all, this is changing how we are setting alarms in SRS. |
1694eb0 to
508a8fc
Compare
src/workerd/io/actor-sqlite.h
Outdated
| // If the setAlarm fails, we need to break the output gate. We can set `broken` | ||
| // so subsequent storage operations fail with an exception. | ||
| // | ||
| // TODO(now): I want this to log for us internally, but print as an internal error to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jqmmes I agree, though it's going to be rough having both versions of the code 🙁. I'll try and push something tomorrow. |
CodSpeed Performance ReportMerging #5325 will improve performances by 9.61%Comparing Summary
Benchmarks breakdown
Footnotes
|
da7b2a3 to
af8e8fa
Compare
src/workerd/io/actor-sqlite.c++
Outdated
| alarmScheduledNoLaterThan = requestedTime; | ||
| } | ||
| return hooks.scheduleRun(requestedTime).then([this, requestedTime]() { | ||
| KJ_LOG(INFO, "scheduleRun set alarm time to ", requestedTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially thought that this would spam every Durable Object started by workerd because this function is called by tryReconcileAlarm() in the constructor, but I guess the hooks check would fix that.
3bb9d5d to
80e5cf0
Compare
The next commit will add new SRS alarm logic, so we want to put the old logic behind an autogate check first.
Previously, we tried to maintain an invariant where we would always
strive to update alarms to run earlier, but deprioritize setting alarms
later. We did this to try and ensure alarms always fire at the durably
committed time (since, if they fired early, we could just reschedule
them, but if they fired late, then they fired late, and that's bad).
This was a bit confusing to reason about, and it looked like retries in
backgrounded (deprioritized) "setAlarm to run later" tasks could
actually interleave with later commits that set the alarm early.
This commit attempts to simplify the flow by doing the following:
- Renaming `alarmScheduledNoLaterThan` to
`currentGuaranteedAlarmTime`.
- Removing while loops that updated the current alarm time
possibly multiple times per commit.
- Add reconciliation on startup to sync source of truth with the
durable value from SRS, handling cases where SRS persisted but
scheduleRun() failed.
The new approach relies on three guarantees instead of the previous
invariant:
1. Serialization: All alarm updates go through `commitTasks`,
preventing interleaving and ensuring one update completes before
the next begins.
2. Atomicity: The output gate ensures clients only see success
if both SRS and scheduleRun() succeed. On failure, clients know to
retry.
3. Reconciliation: On actor startup, we sync the source of truth
with the durable SRS value, fixing any divergence from partial
failures.
When the `SERIALIZE_SRS_ALARMS` autogate is enabled, we always send a `scheduleRun()` before we commit to SRS. This changes the expected order of RPCs to our mock servers.
80e5cf0 to
50fbfaa
Compare
The existing SRS alarms implementation prioritizes moving the alarm time forward, and deprioritizes moving it backwards. In theory, this means upon catastrophic failure when we try to move it earlier, we'll silently "roll it back" when we
armAlarmHandler, as if the partial write never happened. If we fail while trying to move the alarm time to later and experience a failure, when we run the alarm early we'll see the alarm is supposed to run later (according to SRS, which committed the update) and move the execution time later inarmAlarmHandler.In practice, this implementation is hard to reason about (especially with foreground and background tasks that may interleave upon retries), and if the client sees an exception from storage, it should not be making assumptions about the state of storage. Instead, it should re-run the DO and reconcile storage in a new request however it sees fit.
This PR attempts to loosen some guarantees of SRS alarms by relying on a few things:
ActorSqlite, wescheduleRun()with whatever durable alarm time is in SRS. This way, ifscheduleRun()completes and the SRS write does not, we rollback on start up. If they are synced, then it's just an extrascheduleRun()call.scheduleRun()request, the client knows the state of storage is unknown and so they have to retry with a new request. The gate only opens successfully if both ourscheduleRun()succeeds (first), and our commit to SRS succeeds (second).commitTasksand waiting on the previous commit to finish, and only send onescheduleRun()per-commit. We also send a "fixup"scheduleRun()once if there are any waiters of apendingCommit. Ex. if there are 4 waiting commits, each havingsetAlarm(), the first waiter callsscheduleRun()with the most recently set alarm time.scheduleRun()if there is a new, different alarm state being requested. No background tasks (that may interleave and cause races). We do this before we commit to SRS, and can reconcile catastrophic loss by restarting the actor on the next request (which as previously mentioned, the client should initiate).In theory, if we
scheduleAlarm()to a later time, and this request succeeds, but we fail to make the alarm durable in SRS, then we will run the alarm at the new later time, and see that SRS had an earlier alarm marked. While this may seem bad, in reality, the output gate would have been broken during the alarm write and so the client should retry the operation (or reconcile storage somehow). If the DO starts up for any reason, then by (1 -- Startup Reconciliation) we would self-heal/rollback to SRS's confirmed alarm time. Additionally, the client would likelysetAlarm()again.Even if the application uses
allowUnconfirmed = truewithsetAlarm(), it wouldn't know that the alarm was confirmed durable until it successfullysync()ed. If the alarm write experienced a full or parital failure, we'd break the output gate, and so thesync()would fail. If thesync()succeeds, then both thescheduleRun()and the SRS commit were confirmed.