TQ: Support adding sleds via trust quorum#9650
Conversation
04173b6 to
2d8fc67
Compare
jgallagher
left a comment
There was a problem hiding this comment.
Looks good! Just a bunch of nits and small suggestions
|
Thanks for the comprehensive review @jgallagher. I think I fixed up everything. Let me know if you see anything else! |
jgallagher
left a comment
There was a problem hiding this comment.
Changes LGTM, just a couple minor style suggestions I didn't notice the first time around.
Happy to approve with a couple caveats:
- Your comment about resetting
TRUST_QUORUM_INTEGRATION_ENABLEDstill needs to be applied - Would strongly prefer an external-API-focused set of eyes to review those bits (maybe @ahl?)
nexus/external-api/src/lib.rs
Outdated
| rqctx: RequestContext<Self::Context>, | ||
| path_params: Path<params::RackPath>, | ||
| req: TypedBody<params::AddSledsRequest>, | ||
| ) -> Result<HttpResponseOk<Epoch>, HttpError>; |
There was a problem hiding this comment.
Do the type names leak out into the API doc? If so, maybe RackMembershipEpoch would be more clear? If not, ignore this.
There was a problem hiding this comment.
No leaking as far as I can tell.
"properties": {
"epoch": {
"description": "The generation / version of the configuration",
"type": "integer",
"format": "uint64",
"minimum": 0
},
There was a problem hiding this comment.
Yeah--in this case the type is #[serde(transparent)] so it just looks like a number. However, @andrewjstone, I think you're looking at the wrong place in the OpenAPI output:
"responses": {
"200": {
"description": "successful operation",
"content": {
"application/json": {
"schema": {
"title": "uint64",
"type": "integer",
"format": "uint64",
"minimum": 0
}
}
}It is at least a little weird (for us) to be returning a literal integer rather than, say, an object with a field that is an integer.
|
Tested in a4x2 with all the latest changes. I even was able to catch the bg task doing something in OMDB: All that remains is for someone to take a look over the external API. |
ahl
left a comment
There was a problem hiding this comment.
I think I've added a bunch of quibbling comments that reflect a lack of my own understanding about how this is intended to be exposed to and used by customers. If there's something I should read or if you have some time to talk, I'd love to understand this better.
nexus/external-api/src/lib.rs
Outdated
| rqctx: RequestContext<Self::Context>, | ||
| path_params: Path<params::RackPath>, | ||
| req: TypedBody<params::AddSledsRequest>, | ||
| ) -> Result<HttpResponseOk<Epoch>, HttpError>; |
There was a problem hiding this comment.
Yeah--in this case the type is #[serde(transparent)] so it just looks like a number. However, @andrewjstone, I think you're looking at the wrong place in the OpenAPI output:
"responses": {
"200": {
"description": "successful operation",
"content": {
"application/json": {
"schema": {
"title": "uint64",
"type": "integer",
"format": "uint64",
"minimum": 0
}
}
}It is at least a little weird (for us) to be returning a literal integer rather than, say, an object with a field that is an integer.
This PR introduces two new external APIs to allow adding multiple sleds to a rack
at once and to query status about the ongoing operation. Both are
currently experimental and live under `/v1/trust-quorum`. They need to
be moved under `/system/hardware` like the original `sled-add` command.
They also need to be reworked to not report trust quorum specific
details if it can be avoided. Most of that should be in omdb for
debugging. I may add some of that support to this PR.
This PR also introduces a background task for driving the trust quorum
reconfiguration to completion. Reconfiguration is driven by two steps.
Synchronously updating the DB in the new external endpoint handler and
then asynchronously trying to commit the operation via the background
task.
I tested this on a4x2 and it works as expected. See the trace from the original external API test below:
```
➜ oxide.rs git:(main) ✗ echo '{"rack_id": "0dbef452-a6dd-4831-bbdc-769ea3353f28", "sled_ids": [{"part": "PPP-PPPPPPP","serial": "00000000002"}]}' | target/debug/oxide --profile recovery api /v1/trust-quorum/new-members --method POST --input -
➜ oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api /v1/trust-quorum/config/latest/0dbef452-a6dd-4831-bbdc-769ea3353f28
{
"abort_reason": null,
"commit_crash_tolerance": 1,
"coordinator": {
"part_number": "PPP-PPPPPPP",
"serial_number": "00000000003"
},
"encrypted_rack_secrets": null,
"epoch": 2,
"last_committed_epoch": 1,
"members": {
"PPP-PPPPPPP:00000000000": {
"share_digest": null,
"state": "unacked",
"time_committed": null,
"time_prepared": null
},
"PPP-PPPPPPP:00000000001": {
"share_digest": null,
"state": "unacked",
"time_committed": null,
"time_prepared": null
},
"PPP-PPPPPPP:00000000002": {
"share_digest": null,
"state": "unacked",
"time_committed": null,
"time_prepared": null
},
"PPP-PPPPPPP:00000000003": {
"share_digest": null,
"state": "unacked",
"time_committed": null,
"time_prepared": null
}
},
"rack_id": "0dbef452-a6dd-4831-bbdc-769ea3353f28",
"state": "preparing",
"threshold": 3,
"time_aborted": null,
"time_committed": null,
"time_committing": null,
"time_created": "2026-01-14T21:32:18.780136Z"
}
➜ oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api /v1/trust-quorum/config/latest/0dbef452-a6dd-4831-bbdc-769ea3353f28
{
"abort_reason": null,
"commit_crash_tolerance": 1,
"coordinator": {
"part_number": "PPP-PPPPPPP",
"serial_number": "00000000003"
},
"encrypted_rack_secrets": null,
"epoch": 2,
"last_committed_epoch": 1,
"members": {
"PPP-PPPPPPP:00000000000": {
"share_digest": "fcfb09128c84d82cc81b200c6c682510f63160a4417856f4041b1886445e8b14",
"state": "prepared",
"time_committed": null,
"time_prepared": "2026-01-14T21:32:55.826622Z"
},
"PPP-PPPPPPP:00000000001": {
"share_digest": "d8cad02bd3bccd08109a79e3bf6d8dab0d460a0ba879bf42887dc0fc8d855786",
"state": "prepared",
"time_committed": null,
"time_prepared": "2026-01-14T21:32:55.848235Z"
},
"PPP-PPPPPPP:00000000002": {
"share_digest": "dd57ad8e271734fabfe97d6180d6da3e5c3805e17dacf58e0f2a6d5ed7f1242b",
"state": "prepared",
"time_committed": null,
"time_prepared": "2026-01-14T21:32:55.806644Z"
},
"PPP-PPPPPPP:00000000003": {
"share_digest": "6b27327ca49976ccca83972e6578ef195c99489e62811e8d0a0cb061fca9c0c4",
"state": "prepared",
"time_committed": null,
"time_prepared": "2026-01-14T21:32:55.837154Z"
}
},
"rack_id": "0dbef452-a6dd-4831-bbdc-769ea3353f28",
"state": "preparing",
"threshold": 3,
"time_aborted": null,
"time_committed": null,
"time_committing": null,
"time_created": "2026-01-14T21:32:18.780136Z"
}
➜ oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api /v1/trust-quorum/config/latest/0dbef452-a6dd-4831-bbdc-769ea3353f28
{
"abort_reason": null,
"commit_crash_tolerance": 1,
"coordinator": {
"part_number": "PPP-PPPPPPP",
"serial_number": "00000000003"
},
"encrypted_rack_secrets": {
"data": "53de7731deec3f298a7f5067e256a63bb2869a91c9710d9b23dbf3d261d1b730039d9cb11b543c14906ff77cd409d32953959e9ff8933858",
"salt": "ec609ed5ff7aee94e2e88ad94af56e0cbb8a66a683294005c7888f60a627956a"
},
"epoch": 2,
"last_committed_epoch": 1,
"members": {
"PPP-PPPPPPP:00000000000": {
"share_digest": "fcfb09128c84d82cc81b200c6c682510f63160a4417856f4041b1886445e8b14",
"state": "committed",
"time_committed": "2026-01-14T21:33:03.864617Z",
"time_prepared": "2026-01-14T21:32:55.826622Z"
},
"PPP-PPPPPPP:00000000001": {
"share_digest": "d8cad02bd3bccd08109a79e3bf6d8dab0d460a0ba879bf42887dc0fc8d855786",
"state": "committed",
"time_committed": "2026-01-14T21:33:03.864617Z",
"time_prepared": "2026-01-14T21:32:55.848235Z"
},
"PPP-PPPPPPP:00000000002": {
"share_digest": "dd57ad8e271734fabfe97d6180d6da3e5c3805e17dacf58e0f2a6d5ed7f1242b",
"state": "committed",
"time_committed": "2026-01-14T21:33:03.864617Z",
"time_prepared": "2026-01-14T21:32:55.806644Z"
},
"PPP-PPPPPPP:00000000003": {
"share_digest": "6b27327ca49976ccca83972e6578ef195c99489e62811e8d0a0cb061fca9c0c4",
"state": "committed",
"time_committed": "2026-01-14T21:33:03.864617Z",
"time_prepared": "2026-01-14T21:32:55.837154Z"
}
},
"rack_id": "0dbef452-a6dd-4831-bbdc-769ea3353f28",
"state": "committed",
"threshold": 3,
"time_aborted": null,
"time_committed": "2026-01-14T21:33:04.652543Z",
"time_committing": "2026-01-14T21:32:55.861158Z",
"time_created": "2026-01-14T21:32:18.780136Z"
}
➜ oxide.rs git:(main) ✗
```
ahl
left a comment
There was a problem hiding this comment.
love it! some small doc suggestions that you should feel free to ignore.
| pub rack_id: Uuid, | ||
| pub version: RackMembershipVersion, | ||
| pub state: RackMembershipChangeState, | ||
| /// All members of the rack for this epoch |
There was a problem hiding this comment.
| /// All members of the rack for this epoch | |
| /// All members of the rack for this version |
|
|
||
| #[derive(Clone, Debug, Deserialize, Serialize, JsonSchema)] | ||
| pub struct RackMembershipVersionParam { | ||
| pub version: Option<shared::RackMembershipVersion>, |
There was a problem hiding this comment.
I'm not sure this is right
| pub version: Option<shared::RackMembershipVersion>, | |
| /// Membership version as returned from other membership interfaces | |
| pub version: Option<shared::RackMembershipVersion>, |
There was a problem hiding this comment.
Skipped this one.
| #[derive(Clone, Debug, Deserialize, Serialize, JsonSchema)] | ||
| pub struct RackMembershipStatus { | ||
| pub rack_id: Uuid, | ||
| pub version: RackMembershipVersion, |
There was a problem hiding this comment.
Suggesting a comment, but I don't love what I've written:
| pub version: RackMembershipVersion, | |
| /// Membership version that can be used to check status. | |
| pub version: RackMembershipVersion, |
There was a problem hiding this comment.
I used a slightly different wording
| /// Status of the rack membership uniquely identified by the (rack_id, version) | ||
| /// pair | ||
| #[derive(Clone, Debug, Deserialize, Serialize, JsonSchema)] | ||
| pub struct RackMembershipStatus { |
There was a problem hiding this comment.
Comments on these fields become part of the public docs so it might be worthwhile to put something for each of them (though not critical).
There was a problem hiding this comment.
Mostly skipped.
416eaaf to
2f873e2
Compare
|
Tested via latest API changes: |
This PR introduces two new external APIs to allow adding multiple sleds to a rack at once and to query status about the ongoing operation. It also adds an omdb command for more detailed status. Much more omdb to come in the near future.
This PR also introduces a background task for driving the trust quorum reconfiguration to completion. Reconfiguration is driven by two steps. Synchronously updating the DB in the new external endpoint handler and then asynchronously trying to commit the operation via the background task.
I tested this on a4x2 and it works as expected. See the trace from the original external API test below: