feat: Gate Flow power/firmware ops on host assignment state#581
feat: Gate Flow power/firmware ops on host assignment state#581kunzhao-nv wants to merge 2 commits into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis PR implements rack-assignment safety gates for compute and fabric operations. It extends the NICo API with topology queries (host machines by rack, switch-to-rack, power-shelf-to-rack mappings), introduces an AssignmentChecker that polls for machines and racks to exit assigned lifecycle states, and integrates this checker across compute, NVSwitch, and power-shelf component managers to prevent disruptive operations during active tenant assignments. ChangesRack Assignment Safety Infrastructure
🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
flow/internal/task/componentmanager/compute/nico/nico_test.go (1)
544-558: 💤 Low valueConsider adding a test for
FirmwareControlrefusal when assigned.The test suite covers
PowerControlandBringUpControlrefusal scenarios, butFirmwareControlalso invokesensureMachinesOperable. Adding a parallel test would ensure complete coverage of the safety gate across all three operations.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@flow/internal/task/componentmanager/compute/nico/nico_test.go` around lines 544 - 558, Add a parallel unit test that verifies FirmwareControl refuses machines in the "Assigned/Provisioning" state: create a mock client via nicoapi.NewMockClient(), AddMachine with MachineDetail{MachineID: "machine-1", State: "Assigned/Provisioning"}, construct the manager using newManagerForSafetyTest(t, client), build a common.Target with Type devicetypes.ComponentTypeCompute and ComponentIDs ["machine-1"], call m.FirmwareControl(context.Background(), target) and assert an error is returned and its message contains both "refused" and "Assigned state" (mirroring the existing TestBringUpControl_RefusesAssignedMachine pattern and exercising ensureMachinesOperable).flow/internal/task/componentmanager/nvswitch/nico/nico.go (1)
144-183: 💤 Low valueMinor inefficiency:
rackIDsmay contain duplicates when multiple switches belong to the same rack.While
WaitForRacksUnassignedinternally deduplicates viadedupSorted, the slice passed to it could contain duplicates. This is a minor inefficiency—the downstream call handles it—but deduplicating here would improve clarity and reduce unnecessary iterations.♻️ Proposed optimization
- rackIDs := make([]string, 0, len(rackBySwitch)) - for _, rid := range rackBySwitch { - rackIDs = append(rackIDs, rid) - } + seen := make(map[string]struct{}, len(rackBySwitch)) + rackIDs := make([]string, 0, len(rackBySwitch)) + for _, rid := range rackBySwitch { + if _, ok := seen[rid]; !ok { + seen[rid] = struct{}{} + rackIDs = append(rackIDs, rid) + } + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@flow/internal/task/componentmanager/nvswitch/nico/nico.go` around lines 144 - 183, The rackIDs slice built in ensureRackOperable may contain duplicate rack IDs when multiple switches map to the same rack; before calling m.assignment.WaitForRacksUnassigned(ctx, rackIDs) deduplicate rackIDs (e.g. use a temporary map[string]struct{} to collect unique IDs or sort+unique) so you pass only unique rack IDs. Update the logic around rackBySwitch and rackIDs (keeping orphan handling unchanged) to populate a unique set and then convert it back to a slice for the WaitForRacksUnassigned call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@flow/internal/nicoapi/mock.go`:
- Around line 146-149: The mock implementation of
mockClient.FindHostMachineIdsByRack currently returns (nil, nil) when rackID ==
"" which diverges from the real client that rejects empty rack IDs; update
mockClient.FindHostMachineIdsByRack to return the same error the real client
uses for invalid/empty rack IDs (e.g., ErrInvalidRackID or a descriptive
fmt.Errorf("empty rackID")) instead of nil, so tests mirror production
validation behavior.
In `@flow/internal/task/componentmanager/providers/nico/assignment.go`:
- Around line 123-125: Split the inline assignment in the if-statement: call
sleep(ctx, c.pollInterval) and assign its result to a named variable (e.g., err)
on its own line, then follow with a separate if err != nil { return err } check;
update the code around the sleep(ctx, c.pollInterval) invocation in
assignment.go so the assignment and condition are two statements (referencing
the sleep function, ctx, and c.pollInterval).
---
Nitpick comments:
In `@flow/internal/task/componentmanager/compute/nico/nico_test.go`:
- Around line 544-558: Add a parallel unit test that verifies FirmwareControl
refuses machines in the "Assigned/Provisioning" state: create a mock client via
nicoapi.NewMockClient(), AddMachine with MachineDetail{MachineID: "machine-1",
State: "Assigned/Provisioning"}, construct the manager using
newManagerForSafetyTest(t, client), build a common.Target with Type
devicetypes.ComponentTypeCompute and ComponentIDs ["machine-1"], call
m.FirmwareControl(context.Background(), target) and assert an error is returned
and its message contains both "refused" and "Assigned state" (mirroring the
existing TestBringUpControl_RefusesAssignedMachine pattern and exercising
ensureMachinesOperable).
In `@flow/internal/task/componentmanager/nvswitch/nico/nico.go`:
- Around line 144-183: The rackIDs slice built in ensureRackOperable may contain
duplicate rack IDs when multiple switches map to the same rack; before calling
m.assignment.WaitForRacksUnassigned(ctx, rackIDs) deduplicate rackIDs (e.g. use
a temporary map[string]struct{} to collect unique IDs or sort+unique) so you
pass only unique rack IDs. Update the logic around rackBySwitch and rackIDs
(keeping orphan handling unchanged) to populate a unique set and then convert it
back to a slice for the WaitForRacksUnassigned call.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3ecf4648-0d82-4bd5-9b37-db4f9c8c3fa1
📒 Files selected for processing (13)
flow/internal/nicoapi/grpc.goflow/internal/nicoapi/mock.goflow/internal/nicoapi/mod.goflow/internal/task/componentmanager/builtin/builtin_test.goflow/internal/task/componentmanager/compute/nico/nico.goflow/internal/task/componentmanager/compute/nico/nico_test.goflow/internal/task/componentmanager/nvswitch/nico/nico.goflow/internal/task/componentmanager/nvswitch/nico/nico_test.goflow/internal/task/componentmanager/nvswitch/nvswitchmanager/nvswitchmanager.goflow/internal/task/componentmanager/powershelf/nico/nico.goflow/internal/task/componentmanager/powershelf/nico/nico_test.goflow/internal/task/componentmanager/providers/nico/assignment.goflow/internal/task/componentmanager/providers/nico/assignment_test.go
Disruptive operations on a Machine — power control, firmware update, and bring-up — power-cycle the host, which is unsafe to run while that host is still attached to a tenant Instance. Same applies, at rack scope, to NVSwitch and PowerShelf operations: those reset the rack's fabric/power feed, so any Assigned host on the rack would see the disturbance. This change adds a per-Manager policy gate (`ensureMachinesOperable` for compute, `ensureRackOperable` for nvswitch/powershelf) that blocks the operation until every relevant host has left `ManagedHostState::Assigned/*`, polling Core at a fixed interval and failing the task with a clear error after a 30-minute timeout. Today the gate delegates to a new `AssignmentChecker` primitive in the nicoprovider package; future operator-approved overrides will be composed inside the same `ensure...Operable` helpers without touching the call sites. For the NSM-based NVSwitch manager (which previously had no Core client), the `nico` provider is now also required so the rack check can run; the existing `nico`-based managers already had it. To support the rack-scope check, three nicoapi.Client methods are added: `FindHostMachineIdsByRack`, `FindSwitchRackIDs`, and `FindPowerShelfRackIDs`. Mock-only setters mirror them so tests can configure rack topology without standing up a fake Core. The mock's `FindHostMachineIdsByRack` rejects empty rack IDs to match the gRPC client, so tests can't pass an empty input silently and miss the production validation path. Switches and power shelves that Core does not yet associate with a rack (e.g. mid bring-up, pre-ingest) are logged and skipped rather than blocked: failing closed on a missing rack association would deadlock the very flow that's supposed to populate it. Tests cover the AssignmentChecker primitive (timeout, cancellation, rack→machine resolution) and add focused integration-style tests on each Manager that exercise both the refused (host Assigned) and allowed (host Ready) paths via the mock client for PowerControl, FirmwareControl, and BringUpControl. The NSM nvswitch manager descriptor test is updated to expect the new `nico` required-provider entry. Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
675da63 to
ca28792
Compare
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-05-28 06:14:14 UTC | Commit: ca28792 |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
flow/internal/task/componentmanager/nvswitch/nico/nico_test.go (2)
221-230: 💤 Low valueConsider extracting timeout constants for reusability.
The timeout and poll interval values (50ms, 10ms) are reasonable but hard-coded. If additional safety tests are added in the future, extracting these as package-level test constants would improve maintainability.
♻️ Proposed refactor
+const ( + testAssignmentTimeout = 50 * time.Millisecond + testAssignmentInterval = 10 * time.Millisecond +) + // newManagerForSafetyTest swaps the long default 30-minute assignment // timeout for a tight one so the wait loop actually times out within the // test budget. Tests in this file use the same package, so they can reach // the unexported assignment field directly. func newManagerForSafetyTest(t *testing.T, client nicoapi.Client) *Manager { t.Helper() m := New(client) - m.assignment = nicoprovider.NewAssignmentChecker(client, 50*time.Millisecond, 10*time.Millisecond) + m.assignment = nicoprovider.NewAssignmentChecker(client, testAssignmentTimeout, testAssignmentInterval) return m }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go` around lines 221 - 230, Extract the hard-coded timeouts in newManagerForSafetyTest into package-level test constants (e.g., safetyAssignmentTimeout and safetyPollInterval) and use them when calling nicoprovider.NewAssignmentChecker; update the function newManagerForSafetyTest (which creates a Manager via New and sets m.assignment) to reference those constants so future tests can reuse and adjust the timeout/poll values centrally.
271-290: ⚡ Quick winAdd complementary test for FirmwareControl allowance.
PowerControl has both refusal (line 232) and allowance (line 253) tests, but FirmwareControl only tests the refusal path. Since both operations share the same
ensureRackOperablesafety gate, symmetric test coverage is warranted.🧪 Recommended test addition
func TestFirmwareControl_AllowsWhenRackHostsReady(t *testing.T) { client := nicoapi.NewMockClient() client.SetSwitchRackID("sw-1", "rack-A") client.SetRackHostMachineIDs("rack-A", []string{"host-1"}) client.AddMachine(nicoapi.MachineDetail{MachineID: "host-1", State: "Ready"}) m := newManagerForSafetyTest(t, client) target := common.Target{ Type: devicetypes.ComponentTypeNVSwitch, ComponentIDs: []string{"sw-1"}, } err := m.FirmwareControl(context.Background(), target, operations.FirmwareControlTaskInfo{ Operation: operations.FirmwareOperationUpgrade, TargetVersion: "1.0.0", }) require.NoError(t, err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go` around lines 271 - 290, Add a complementary positive test for FirmwareControl to mirror the existing refusal test: create a test (e.g., TestFirmwareControl_AllowsWhenRackHostsReady) that uses nicoapi.NewMockClient(), SetSwitchRackID("sw-1","rack-A"), SetRackHostMachineIDs("rack-A", []string{"host-1"}), and AddMachine with MachineDetail.State "Ready"; instantiate the manager with newManagerForSafetyTest and call m.FirmwareControl(context.Background(), target, operations.FirmwareControlTaskInfo{Operation: operations.FirmwareOperationUpgrade, TargetVersion: "1.0.0"}) and assert require.NoError(t, err). This ensures FirmwareControl (which uses ensureRackOperable) allows the operation when rack hosts are Ready.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go`:
- Around line 271-290: The test TestFirmwareControl_RefusesWhenRackHostAssigned
should also assert that the machine ID is included in the error message; after
calling m.FirmwareControl and verifying err is non-nil and contains "refused"
and "Assigned state", add an assertion that err.Error() contains "host-1" to
match the behavior of the assignment checker (same pattern as the PowerControl
test), referencing the test name and the m.FirmwareControl call to locate the
change.
---
Nitpick comments:
In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go`:
- Around line 221-230: Extract the hard-coded timeouts in
newManagerForSafetyTest into package-level test constants (e.g.,
safetyAssignmentTimeout and safetyPollInterval) and use them when calling
nicoprovider.NewAssignmentChecker; update the function newManagerForSafetyTest
(which creates a Manager via New and sets m.assignment) to reference those
constants so future tests can reuse and adjust the timeout/poll values
centrally.
- Around line 271-290: Add a complementary positive test for FirmwareControl to
mirror the existing refusal test: create a test (e.g.,
TestFirmwareControl_AllowsWhenRackHostsReady) that uses nicoapi.NewMockClient(),
SetSwitchRackID("sw-1","rack-A"), SetRackHostMachineIDs("rack-A",
[]string{"host-1"}), and AddMachine with MachineDetail.State "Ready";
instantiate the manager with newManagerForSafetyTest and call
m.FirmwareControl(context.Background(), target,
operations.FirmwareControlTaskInfo{Operation:
operations.FirmwareOperationUpgrade, TargetVersion: "1.0.0"}) and assert
require.NoError(t, err). This ensures FirmwareControl (which uses
ensureRackOperable) allows the operation when rack hosts are Ready.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: bbb1a347-1e55-4084-8dba-f389b39a37b0
📒 Files selected for processing (13)
flow/internal/nicoapi/grpc.goflow/internal/nicoapi/mock.goflow/internal/nicoapi/mod.goflow/internal/task/componentmanager/builtin/builtin_test.goflow/internal/task/componentmanager/compute/nico/nico.goflow/internal/task/componentmanager/compute/nico/nico_test.goflow/internal/task/componentmanager/nvswitch/nico/nico.goflow/internal/task/componentmanager/nvswitch/nico/nico_test.goflow/internal/task/componentmanager/nvswitch/nvswitchmanager/nvswitchmanager.goflow/internal/task/componentmanager/powershelf/nico/nico.goflow/internal/task/componentmanager/powershelf/nico/nico_test.goflow/internal/task/componentmanager/providers/nico/assignment.goflow/internal/task/componentmanager/providers/nico/assignment_test.go
🚧 Files skipped from review as they are similar to previous changes (12)
- flow/internal/task/componentmanager/builtin/builtin_test.go
- flow/internal/nicoapi/mod.go
- flow/internal/task/componentmanager/powershelf/nico/nico.go
- flow/internal/nicoapi/mock.go
- flow/internal/nicoapi/grpc.go
- flow/internal/task/componentmanager/powershelf/nico/nico_test.go
- flow/internal/task/componentmanager/compute/nico/nico.go
- flow/internal/task/componentmanager/providers/nico/assignment_test.go
- flow/internal/task/componentmanager/compute/nico/nico_test.go
- flow/internal/task/componentmanager/nvswitch/nico/nico.go
- flow/internal/task/componentmanager/providers/nico/assignment.go
- flow/internal/task/componentmanager/nvswitch/nvswitchmanager/nvswitchmanager.go
| func TestFirmwareControl_RefusesWhenRackHostAssigned(t *testing.T) { | ||
| client := nicoapi.NewMockClient() | ||
| client.SetSwitchRackID("sw-1", "rack-A") | ||
| client.SetRackHostMachineIDs("rack-A", []string{"host-1"}) | ||
| client.AddMachine(nicoapi.MachineDetail{MachineID: "host-1", State: "Assigned/Provisioning"}) | ||
|
|
||
| m := newManagerForSafetyTest(t, client) | ||
| target := common.Target{ | ||
| Type: devicetypes.ComponentTypeNVSwitch, | ||
| ComponentIDs: []string{"sw-1"}, | ||
| } | ||
|
|
||
| err := m.FirmwareControl(context.Background(), target, operations.FirmwareControlTaskInfo{ | ||
| Operation: operations.FirmwareOperationUpgrade, | ||
| TargetVersion: "1.0.0", | ||
| }) | ||
| require.Error(t, err) | ||
| assert.Contains(t, err.Error(), "refused") | ||
| assert.Contains(t, err.Error(), "Assigned state") | ||
| } |
There was a problem hiding this comment.
Add assertion for machine ID in error message.
The error message from the assignment checker includes the machine IDs that remain in Assigned state (as verified in the PowerControl test at line 250). For consistency and thoroughness, this test should also verify that "host-1" appears in the error message.
🔍 Proposed fix
require.Error(t, err)
assert.Contains(t, err.Error(), "refused")
assert.Contains(t, err.Error(), "Assigned state")
+ assert.Contains(t, err.Error(), "host-1")
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go` around lines
271 - 290, The test TestFirmwareControl_RefusesWhenRackHostAssigned should also
assert that the machine ID is included in the error message; after calling
m.FirmwareControl and verifying err is non-nil and contains "refused" and
"Assigned state", add an assertion that err.Error() contains "host-1" to match
the behavior of the assignment checker (same pattern as the PowerControl test),
referencing the test name and the m.FirmwareControl call to locate the change.
Description
Adds a safety gate that refuses Flow's disruptive operations — power control, firmware update, and bring-up — while any affected host is still attached to a tenant Instance (
ManagedHostState::Assigned/*). Why this matters: all three operations power-cycle the host.Type of Change
Services Affected
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes