-
Notifications
You must be signed in to change notification settings - Fork 67
[nexus] FM alert requests and rendezvous task #9552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
smklein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall structure looks good, but I've got some questions about lifetimes of things
| lookup_ereports_assigned_to_fm_case | ||
| ON omicron.public.fm_ereport_in_case (sitrep_id, case_id); | ||
|
|
||
| CREATE TABLE IF NOT EXISTS omicron.public.fm_alert_request ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because it's a little quirky - could we document the lifetime of entries in the fm_alert_request table?
You said in the PR description that these get inserted "at sitrep insertion time", and then they may or may not be used depending on whether the sitrep gets activated. When do we remove them?
| -- UUID of the sitrep in which the alert is requested. | ||
| sitrep_id UUID NOT NULL, | ||
| -- UUID of the sitrep in which the alert request was created. | ||
| requested_sitrep_id UUID NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm: This value of requested_sitrep_id is static, and the sitrep_id value will be the one changing as the fm_alert_request is carried from sitrep to sitrep, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct --- is there something that would have made the comment clearer here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a nitpick; maybe just:
-- UUID of the ongoing sitrep in which the alert is requested.
...
-- UUID of the original sitrep in which the alert request was created.
(it's clear when seeing the pattern in the other tables, but it has been a few weeks since I looked at this code, and I had to check things)
| variant.as_str().starts_with("test.") && !variant.is_test() | ||
| }) | ||
| .collect::<Vec<_>>(); | ||
| assert_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice quality-of-life test, though it makes me wonder if test. as a stringified prefix should just imply variant.is_test() for us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ehh I'm seeing this is just moved code; consider this alternate is_test suggestion just musing rather than a request.
| // XXX(eliza): is it better to allocate all of these into a big array | ||
| // and do a single `INSERT INTO` query, or iterate over them one by one | ||
| // (not allocating) but insert one at a time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to consider: There are a ton of sources of conflicts due to raciness here:
- If we have multiple Nexus instances trying to perform rendezvous for a single sitrep simultaneously, they'll potentially overlap
- They could also be operating on distinct sitreps (e.g., a slow Nexus is doing rendezvous for "what it thinks is the latest sitrep", but it becomes out-of-date immediately after it starts the rendezvous process) and they'll have partially overlapping sets of alerts
I think this would require a batched version of alert_create (maybe alerts_create?) to use on_conflict...do_nothing.
However, I think I'm okay with the usual: "Let's keep it simple now, instrument it, and optimize it later when we need to"
| let class = class.into(); | ||
| match self | ||
| .datastore | ||
| .alert_create(&opctx, id, class, payload.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're deleting alert records yet - AFAICT, we're marking them dispatched, but leaving rows in CRDB for them - but when we do, this will be something we need to consider.
- Suppose we want to delete an alert record from cockroachdb
- Suppose there is a really laggy Nexus somewhere, running this rendezvous task. It's stuck doing rendezvous for a very old sitrep.
- If we do "actual SQL DELETE" of the alert, this background task could theoretically bring it back to life (which would be a bug)
I don't think this problem has been totally solved for blueprints either - I'm not seeing such guards in reconcile_blueprint_rendezvous_tables either - but from a discussion with @jgallagher , the priority there was lower, because the rendezvous tables for blueprints are much lower-churn than they presumably will be for alerts.
I wrote up an issue for this on the blueprint side with #9592 , but I think it'll be relevant here much sooner, especially as each alert is injecting an arbitrary JSON payload, which means the table is going to grow in size more quickly.
Co-authored-by: Sean Klein <[email protected]>
This branch builds on #9492 by adding alert requests to fault management cases. This is a mechanism to allow a sitrep to specify that a set of alerts should exist. UUIDs and payloads for these alerts are specified in the sitrep. We don't want entries in the
alerttable to be created immediately when a sitrep is inserted, as that sitrep may not be made current. If the alert dispatcher operates on alerts created in sitreps that were not made current, it could dispatch duplicate or spurious alerts. Instead, we indirect the creation of alert records by having the sitrep insertion create alert requests, and if that sitrep becomes current, a backgroundfm_rendezvoustask reconciles the requested alerts in the sitrep with the actualalerttable. Eventually, this task will be responsible for updating other rendezvous tables based on the current sitrep.I also did a bit of refactoring of the alert class types so that the structured enum of alert classes could be used by the sitrep.
This change was originally factored out from #9346, but I ultimately ended up rewriting a lot of it.