[nexus] FM alert requests and rendezvous task #9552

hawkw · 2025-12-19T21:57:23Z

This branch builds on #9492 by adding alert requests to fault management cases. This is a mechanism to allow a sitrep to specify that a set of alerts should exist. UUIDs and payloads for these alerts are specified in the sitrep. We don't want entries in the alert table to be created immediately when a sitrep is inserted, as that sitrep may not be made current. If the alert dispatcher operates on alerts created in sitreps that were not made current, it could dispatch duplicate or spurious alerts. Instead, we indirect the creation of alert records by having the sitrep insertion create alert requests, and if that sitrep becomes current, a background fm_rendezvous task reconciles the requested alerts in the sitrep with the actual alert table. Eventually, this task will be responsible for updating other rendezvous tables based on the current sitrep.

I also did a bit of refactoring of the alert class types so that the structured enum of alert classes could be used by the sitrep.

This change was originally factored out from #9346, but I ultimately ended up rewriting a lot of it.

smklein

Overall structure looks good, but I've got some questions about lifetimes of things

smklein · 2026-01-05T19:28:10Z

schema/crdb/dbinit.sql

    lookup_ereports_assigned_to_fm_case
 ON omicron.public.fm_ereport_in_case (sitrep_id, case_id);

+CREATE TABLE IF NOT EXISTS omicron.public.fm_alert_request (


Just because it's a little quirky - could we document the lifetime of entries in the fm_alert_request table?

You said in the PR description that these get inserted "at sitrep insertion time", and then they may or may not be used depending on whether the sitrep gets activated. When do we remove them?

smklein · 2026-01-05T19:30:24Z

schema/crdb/dbinit.sql

+    -- UUID of the sitrep in which the alert is requested.
+    sitrep_id UUID NOT NULL,
+    -- UUID of the sitrep in which the alert request was created.
+    requested_sitrep_id UUID NOT NULL,


To confirm: This value of requested_sitrep_id is static, and the sitrep_id value will be the one changing as the fm_alert_request is carried from sitrep to sitrep, right?

That's correct --- is there something that would have made the comment clearer here?

This feels like a nitpick; maybe just:

-- UUID of the ongoing sitrep in which the alert is requested. ... -- UUID of the original sitrep in which the alert request was created.

(it's clear when seeing the pattern in the other tables, but it has been a few weeks since I looked at this code, and I had to check things)

smklein · 2026-01-05T19:34:23Z

nexus/types/src/alert.rs

+                variant.as_str().starts_with("test.") && !variant.is_test()
+            })
+            .collect::<Vec<_>>();
+        assert_eq!(


Nice quality-of-life test, though it makes me wonder if test. as a stringified prefix should just imply variant.is_test() for us

ehh I'm seeing this is just moved code; consider this alternate is_test suggestion just musing rather than a request.

nexus/types/src/internal_api/background.rs

smklein · 2026-01-05T19:50:26Z

nexus/src/app/background/tasks/fm_rendezvous.rs

+        // XXX(eliza): is it better to allocate all of these into a big array
+        // and do a single `INSERT INTO` query, or iterate over them one by one
+        // (not allocating) but insert one at a time?


Something to consider: There are a ton of sources of conflicts due to raciness here:

If we have multiple Nexus instances trying to perform rendezvous for a single sitrep simultaneously, they'll potentially overlap

They could also be operating on distinct sitreps (e.g., a slow Nexus is doing rendezvous for "what it thinks is the latest sitrep", but it becomes out-of-date immediately after it starts the rendezvous process) and they'll have partially overlapping sets of alerts

I think this would require a batched version of alert_create (maybe alerts_create?) to use on_conflict...do_nothing.

However, I think I'm okay with the usual: "Let's keep it simple now, instrument it, and optimize it later when we need to"

smklein · 2026-01-05T20:33:59Z

nexus/src/app/background/tasks/fm_rendezvous.rs

+            let class = class.into();
+            match self
+                .datastore
+                .alert_create(&opctx, id, class, payload.clone())


I don't think we're deleting alert records yet - AFAICT, we're marking them dispatched, but leaving rows in CRDB for them - but when we do, this will be something we need to consider.

Suppose we want to delete an alert record from cockroachdb

Suppose there is a really laggy Nexus somewhere, running this rendezvous task. It's stuck doing rendezvous for a very old sitrep.

If we do "actual SQL DELETE" of the alert, this background task could theoretically bring it back to life (which would be a bug)

I don't think this problem has been totally solved for blueprints either - I'm not seeing such guards in reconcile_blueprint_rendezvous_tables either - but from a discussion with @jgallagher , the priority there was lower, because the rendezvous tables for blueprints are much lower-churn than they presumably will be for alerts.

I wrote up an issue for this on the blueprint side with #9592 , but I think it'll be relevant here much sooner, especially as each alert is injecting an arbitrary JSON payload, which means the table is going to grow in size more quickly.

nexus/db-queries/src/db/datastore/fm.rs

Co-authored-by: Sean Klein <[email protected]>

hawkw added 10 commits December 17, 2025 17:04

most of alert requests

e48ad41

reticulating

f6f5adf

draw more of the owl

69dcfbd

reticulating docs

5f0ae87

tests

c9da743

migrations

742de01

add config files

9e59159

actually delete cases

4197cdf

fixup stuff

98d9dd5

omdb

68a78cd

hawkw requested review from davepacheco and smklein December 19, 2025 21:57

hawkw self-assigned this Dec 19, 2025

hawkw added nexus Related to nexus fault-management Everything related to the fault-management initiative (RFD480 and others) labels Dec 19, 2025

Merge branch 'main' into eliza/fm-alerts

6d2ca31

hawkw changed the title ~~[nexus] fm alert requests and rendezvous task~~ [nexus] FM alert requests and rendezvous task Dec 19, 2025

hawkw mentioned this pull request Dec 23, 2025

[fm] add a SitrepBuilder, to help with building sitreps #9566

Draft

smklein assigned smklein and unassigned hawkw Jan 5, 2026

smklein mentioned this pull request Jan 5, 2026

Blueprint Rendezvous Table Garbage Collection #9592

Open

smklein reviewed Jan 5, 2026

View reviewed changes

smklein assigned hawkw and unassigned smklein Jan 5, 2026

Update nexus/types/src/internal_api/background.rs

20b53eb

Co-authored-by: Sean Klein <[email protected]>

smklein mentioned this pull request Jan 5, 2026

It's possible to load a torn blueprint / sitrep / inventory #9594

Closed

improve commentary

1a87101

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[nexus] FM alert requests and rendezvous task #9552

[nexus] FM alert requests and rendezvous task #9552

Uh oh!

hawkw commented Dec 19, 2025

Uh oh!

smklein left a comment

Uh oh!

smklein Jan 5, 2026

Uh oh!

smklein Jan 5, 2026

Uh oh!

hawkw Jan 5, 2026

Uh oh!

smklein Jan 5, 2026

Uh oh!

smklein Jan 5, 2026

Uh oh!

smklein Jan 5, 2026

Uh oh!

Uh oh!

smklein Jan 5, 2026

Uh oh!

smklein Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[nexus] FM alert requests and rendezvous task #9552

Are you sure you want to change the base?

[nexus] FM alert requests and rendezvous task #9552

Uh oh!

Conversation

hawkw commented Dec 19, 2025

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants