Skip to content

feat: Gate Flow power/firmware ops on host assignment state#581

Draft
kunzhao-nv wants to merge 2 commits into
mainfrom
feat/flow-check-assigned
Draft

feat: Gate Flow power/firmware ops on host assignment state#581
kunzhao-nv wants to merge 2 commits into
mainfrom
feat/flow-check-assigned

Conversation

@kunzhao-nv
Copy link
Copy Markdown
Contributor

@kunzhao-nv kunzhao-nv commented May 28, 2026

Description

Adds a safety gate that refuses Flow's disruptive operations — power control, firmware update, and bring-up — while any affected host is still attached to a tenant Instance (ManagedHostState::Assigned/*). Why this matters: all three operations power-cycle the host.

Type of Change

  • Feature - New feature or functionality (feat:)
  • Fix - Bug fixes (fix:)
  • Chore - Modification or removal of existing functionality (chore:)
  • Refactor - Refactoring of existing functionality (refactor:)
  • Docs - Changes in documentation or OpenAPI schema (docs:)
  • CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
  • Version - Issuing a new release version (version:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • Flow - Flow service updated
  • Powershelf Manager - Powershelf Manager updated
  • NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@kunzhao-nv kunzhao-nv requested a review from a team as a code owner May 28, 2026 01:03
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Review Change Stack

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 811f09fe-87d2-46ad-8905-8176269efad4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

This PR implements rack-assignment safety gates for compute and fabric operations. It extends the NICo API with topology queries (host machines by rack, switch-to-rack, power-shelf-to-rack mappings), introduces an AssignmentChecker that polls for machines and racks to exit assigned lifecycle states, and integrates this checker across compute, NVSwitch, and power-shelf component managers to prevent disruptive operations during active tenant assignments.

Changes

Rack Assignment Safety Infrastructure

Layer / File(s) Summary
NICo topology query API and implementations
flow/internal/nicoapi/mod.go, flow/internal/nicoapi/grpc.go, flow/internal/nicoapi/mock.go
Extends the Client interface with FindHostMachineIdsByRack, FindSwitchRackIDs, FindPowerShelfRackIDs and adds mock state and setters. Implements GRPC helpers that filter/omit unmapped entries.
AssignmentChecker state polling implementation
flow/internal/task/componentmanager/providers/nico/assignment.go
Defines AssignmentChecker with configurable timeout/interval, IsAssignedState predicate, WaitForMachinesUnassigned polling (dedupe, deadline, transient-missing tolerant), WaitForRacksUnassigned resolving racks→machines, and helpers sleep/dedupSorted.
AssignmentChecker unit tests
flow/internal/task/componentmanager/providers/nico/assignment_test.go
Covers IsAssignedState, short-circuits (nil/empty inputs), success/timeouts, missing-state handling, rack-resolution, empty-rack pass-through, and context-cancellation behavior.
Compute NICo component host-level assignment gates
flow/internal/task/componentmanager/compute/nico/nico.go, flow/internal/task/componentmanager/compute/nico/nico_test.go
Manager adds AssignmentChecker. PowerControl, FirmwareControl, and BringUpControl call ensureMachinesOperable (WaitForMachinesUnassigned) and return refused: errors when targets remain assigned. Tests validate refusal/allow scenarios.
NVSwitch NICo component rack-level assignment gates
flow/internal/task/componentmanager/nvswitch/nico/nico.go, flow/internal/task/componentmanager/nvswitch/nico/nico_test.go
Manager adds AssignmentChecker and ensureRackOperable which maps switches→racks, logs orphan switches, and waits for racks to be unassigned before disruptive operations. Tests added for refusal and allowance cases.
NVSwitch Manager rack-level assignment gates with dual provider wiring
flow/internal/task/componentmanager/nvswitch/nvswitchmanager/nvswitchmanager.go
Manager now accepts NSM and NICo clients and initializes AssignmentChecker. Factory retrieves both providers and Descriptor lists both required providers. PowerControl and FirmwareControl invoke ensureRackOperable before NSM operations.
Power shelf NICo component rack-level assignment gates
flow/internal/task/componentmanager/powershelf/nico/nico.go, flow/internal/task/componentmanager/powershelf/nico/nico_test.go
Manager adds AssignmentChecker. PowerControl and FirmwareControl call ensureRackOperable resolving shelves→racks and block when racks/hosts remain assigned; tests cover refusal and allow scenarios with short-timeout helpers.
Test fixture update for dual-provider dependency
flow/internal/task/componentmanager/builtin/builtin_test.go
Updates TestServiceCatalog nvswitch/nvswitchmanager fixture to expect both nsmprovider and nicoprovider in RequiredProviders.

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat: Gate Flow power/firmware ops on host assignment state' accurately describes the primary change: introducing a safety gate for Flow's disruptive operations based on host assignment state.
Description check ✅ Passed The PR description provides clear context about the safety gate mechanism, explains the rationale (power-cycling hosts), and aligns with the changeset across multiple managers and the assignment checker implementation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/flow-check-assigned

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 28, 2026

Test Results

9 799 tests  +20   9 799 ✅ +20   7m 13s ⏱️ -13s
  161 suites ± 0       0 💤 ± 0 
   14 files   ± 0       0 ❌ ± 0 

Results for commit 779031c. ± Comparison against base commit 13b919f.

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
flow/internal/task/componentmanager/compute/nico/nico_test.go (1)

544-558: 💤 Low value

Consider adding a test for FirmwareControl refusal when assigned.

The test suite covers PowerControl and BringUpControl refusal scenarios, but FirmwareControl also invokes ensureMachinesOperable. Adding a parallel test would ensure complete coverage of the safety gate across all three operations.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flow/internal/task/componentmanager/compute/nico/nico_test.go` around lines
544 - 558, Add a parallel unit test that verifies FirmwareControl refuses
machines in the "Assigned/Provisioning" state: create a mock client via
nicoapi.NewMockClient(), AddMachine with MachineDetail{MachineID: "machine-1",
State: "Assigned/Provisioning"}, construct the manager using
newManagerForSafetyTest(t, client), build a common.Target with Type
devicetypes.ComponentTypeCompute and ComponentIDs ["machine-1"], call
m.FirmwareControl(context.Background(), target) and assert an error is returned
and its message contains both "refused" and "Assigned state" (mirroring the
existing TestBringUpControl_RefusesAssignedMachine pattern and exercising
ensureMachinesOperable).
flow/internal/task/componentmanager/nvswitch/nico/nico.go (1)

144-183: 💤 Low value

Minor inefficiency: rackIDs may contain duplicates when multiple switches belong to the same rack.

While WaitForRacksUnassigned internally deduplicates via dedupSorted, the slice passed to it could contain duplicates. This is a minor inefficiency—the downstream call handles it—but deduplicating here would improve clarity and reduce unnecessary iterations.

♻️ Proposed optimization
-	rackIDs := make([]string, 0, len(rackBySwitch))
-	for _, rid := range rackBySwitch {
-		rackIDs = append(rackIDs, rid)
-	}
+	seen := make(map[string]struct{}, len(rackBySwitch))
+	rackIDs := make([]string, 0, len(rackBySwitch))
+	for _, rid := range rackBySwitch {
+		if _, ok := seen[rid]; !ok {
+			seen[rid] = struct{}{}
+			rackIDs = append(rackIDs, rid)
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flow/internal/task/componentmanager/nvswitch/nico/nico.go` around lines 144 -
183, The rackIDs slice built in ensureRackOperable may contain duplicate rack
IDs when multiple switches map to the same rack; before calling
m.assignment.WaitForRacksUnassigned(ctx, rackIDs) deduplicate rackIDs (e.g. use
a temporary map[string]struct{} to collect unique IDs or sort+unique) so you
pass only unique rack IDs. Update the logic around rackBySwitch and rackIDs
(keeping orphan handling unchanged) to populate a unique set and then convert it
back to a slice for the WaitForRacksUnassigned call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@flow/internal/nicoapi/mock.go`:
- Around line 146-149: The mock implementation of
mockClient.FindHostMachineIdsByRack currently returns (nil, nil) when rackID ==
"" which diverges from the real client that rejects empty rack IDs; update
mockClient.FindHostMachineIdsByRack to return the same error the real client
uses for invalid/empty rack IDs (e.g., ErrInvalidRackID or a descriptive
fmt.Errorf("empty rackID")) instead of nil, so tests mirror production
validation behavior.

In `@flow/internal/task/componentmanager/providers/nico/assignment.go`:
- Around line 123-125: Split the inline assignment in the if-statement: call
sleep(ctx, c.pollInterval) and assign its result to a named variable (e.g., err)
on its own line, then follow with a separate if err != nil { return err } check;
update the code around the sleep(ctx, c.pollInterval) invocation in
assignment.go so the assignment and condition are two statements (referencing
the sleep function, ctx, and c.pollInterval).

---

Nitpick comments:
In `@flow/internal/task/componentmanager/compute/nico/nico_test.go`:
- Around line 544-558: Add a parallel unit test that verifies FirmwareControl
refuses machines in the "Assigned/Provisioning" state: create a mock client via
nicoapi.NewMockClient(), AddMachine with MachineDetail{MachineID: "machine-1",
State: "Assigned/Provisioning"}, construct the manager using
newManagerForSafetyTest(t, client), build a common.Target with Type
devicetypes.ComponentTypeCompute and ComponentIDs ["machine-1"], call
m.FirmwareControl(context.Background(), target) and assert an error is returned
and its message contains both "refused" and "Assigned state" (mirroring the
existing TestBringUpControl_RefusesAssignedMachine pattern and exercising
ensureMachinesOperable).

In `@flow/internal/task/componentmanager/nvswitch/nico/nico.go`:
- Around line 144-183: The rackIDs slice built in ensureRackOperable may contain
duplicate rack IDs when multiple switches map to the same rack; before calling
m.assignment.WaitForRacksUnassigned(ctx, rackIDs) deduplicate rackIDs (e.g. use
a temporary map[string]struct{} to collect unique IDs or sort+unique) so you
pass only unique rack IDs. Update the logic around rackBySwitch and rackIDs
(keeping orphan handling unchanged) to populate a unique set and then convert it
back to a slice for the WaitForRacksUnassigned call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3ecf4648-0d82-4bd5-9b37-db4f9c8c3fa1

📥 Commits

Reviewing files that changed from the base of the PR and between 10d3187 and 675da63.

📒 Files selected for processing (13)
  • flow/internal/nicoapi/grpc.go
  • flow/internal/nicoapi/mock.go
  • flow/internal/nicoapi/mod.go
  • flow/internal/task/componentmanager/builtin/builtin_test.go
  • flow/internal/task/componentmanager/compute/nico/nico.go
  • flow/internal/task/componentmanager/compute/nico/nico_test.go
  • flow/internal/task/componentmanager/nvswitch/nico/nico.go
  • flow/internal/task/componentmanager/nvswitch/nico/nico_test.go
  • flow/internal/task/componentmanager/nvswitch/nvswitchmanager/nvswitchmanager.go
  • flow/internal/task/componentmanager/powershelf/nico/nico.go
  • flow/internal/task/componentmanager/powershelf/nico/nico_test.go
  • flow/internal/task/componentmanager/providers/nico/assignment.go
  • flow/internal/task/componentmanager/providers/nico/assignment_test.go

Comment thread flow/internal/nicoapi/mock.go
Comment thread flow/internal/task/componentmanager/providers/nico/assignment.go Outdated
Disruptive operations on a Machine — power control, firmware update,
and bring-up — power-cycle the host, which is unsafe to run while
that host is still attached to a tenant Instance. Same applies, at
rack scope, to NVSwitch and PowerShelf operations: those reset the
rack's fabric/power feed, so any Assigned host on the rack would see
the disturbance.

This change adds a per-Manager policy gate (`ensureMachinesOperable`
for compute, `ensureRackOperable` for nvswitch/powershelf) that
blocks the operation until every relevant host has left
`ManagedHostState::Assigned/*`, polling Core at a fixed interval and
failing the task with a clear error after a 30-minute timeout. Today
the gate delegates to a new `AssignmentChecker` primitive in the
nicoprovider package; future operator-approved overrides will be
composed inside the same `ensure...Operable` helpers without touching
the call sites.

For the NSM-based NVSwitch manager (which previously had no Core
client), the `nico` provider is now also required so the rack check
can run; the existing `nico`-based managers already had it.

To support the rack-scope check, three nicoapi.Client methods are
added: `FindHostMachineIdsByRack`, `FindSwitchRackIDs`, and
`FindPowerShelfRackIDs`. Mock-only setters mirror them so tests can
configure rack topology without standing up a fake Core. The mock's
`FindHostMachineIdsByRack` rejects empty rack IDs to match the gRPC
client, so tests can't pass an empty input silently and miss the
production validation path.

Switches and power shelves that Core does not yet associate with a
rack (e.g. mid bring-up, pre-ingest) are logged and skipped rather
than blocked: failing closed on a missing rack association would
deadlock the very flow that's supposed to populate it.

Tests cover the AssignmentChecker primitive (timeout, cancellation,
rack→machine resolution) and add focused integration-style tests on
each Manager that exercise both the refused (host Assigned) and
allowed (host Ready) paths via the mock client for PowerControl,
FirmwareControl, and BringUpControl. The NSM nvswitch manager
descriptor test is updated to expect the new `nico` required-provider
entry.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
@kunzhao-nv kunzhao-nv force-pushed the feat/flow-check-assigned branch from 675da63 to ca28792 Compare May 28, 2026 06:13
@github-actions
Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-05-28 06:14:14 UTC | Commit: ca28792

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
flow/internal/task/componentmanager/nvswitch/nico/nico_test.go (2)

221-230: 💤 Low value

Consider extracting timeout constants for reusability.

The timeout and poll interval values (50ms, 10ms) are reasonable but hard-coded. If additional safety tests are added in the future, extracting these as package-level test constants would improve maintainability.

♻️ Proposed refactor
+const (
+	testAssignmentTimeout  = 50 * time.Millisecond
+	testAssignmentInterval = 10 * time.Millisecond
+)
+
 // newManagerForSafetyTest swaps the long default 30-minute assignment
 // timeout for a tight one so the wait loop actually times out within the
 // test budget. Tests in this file use the same package, so they can reach
 // the unexported assignment field directly.
 func newManagerForSafetyTest(t *testing.T, client nicoapi.Client) *Manager {
 	t.Helper()
 	m := New(client)
-	m.assignment = nicoprovider.NewAssignmentChecker(client, 50*time.Millisecond, 10*time.Millisecond)
+	m.assignment = nicoprovider.NewAssignmentChecker(client, testAssignmentTimeout, testAssignmentInterval)
 	return m
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go` around lines
221 - 230, Extract the hard-coded timeouts in newManagerForSafetyTest into
package-level test constants (e.g., safetyAssignmentTimeout and
safetyPollInterval) and use them when calling nicoprovider.NewAssignmentChecker;
update the function newManagerForSafetyTest (which creates a Manager via New and
sets m.assignment) to reference those constants so future tests can reuse and
adjust the timeout/poll values centrally.

271-290: ⚡ Quick win

Add complementary test for FirmwareControl allowance.

PowerControl has both refusal (line 232) and allowance (line 253) tests, but FirmwareControl only tests the refusal path. Since both operations share the same ensureRackOperable safety gate, symmetric test coverage is warranted.

🧪 Recommended test addition
func TestFirmwareControl_AllowsWhenRackHostsReady(t *testing.T) {
	client := nicoapi.NewMockClient()
	client.SetSwitchRackID("sw-1", "rack-A")
	client.SetRackHostMachineIDs("rack-A", []string{"host-1"})
	client.AddMachine(nicoapi.MachineDetail{MachineID: "host-1", State: "Ready"})

	m := newManagerForSafetyTest(t, client)
	target := common.Target{
		Type:         devicetypes.ComponentTypeNVSwitch,
		ComponentIDs: []string{"sw-1"},
	}

	err := m.FirmwareControl(context.Background(), target, operations.FirmwareControlTaskInfo{
		Operation:     operations.FirmwareOperationUpgrade,
		TargetVersion: "1.0.0",
	})
	require.NoError(t, err)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go` around lines
271 - 290, Add a complementary positive test for FirmwareControl to mirror the
existing refusal test: create a test (e.g.,
TestFirmwareControl_AllowsWhenRackHostsReady) that uses nicoapi.NewMockClient(),
SetSwitchRackID("sw-1","rack-A"), SetRackHostMachineIDs("rack-A",
[]string{"host-1"}), and AddMachine with MachineDetail.State "Ready";
instantiate the manager with newManagerForSafetyTest and call
m.FirmwareControl(context.Background(), target,
operations.FirmwareControlTaskInfo{Operation:
operations.FirmwareOperationUpgrade, TargetVersion: "1.0.0"}) and assert
require.NoError(t, err). This ensures FirmwareControl (which uses
ensureRackOperable) allows the operation when rack hosts are Ready.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go`:
- Around line 271-290: The test TestFirmwareControl_RefusesWhenRackHostAssigned
should also assert that the machine ID is included in the error message; after
calling m.FirmwareControl and verifying err is non-nil and contains "refused"
and "Assigned state", add an assertion that err.Error() contains "host-1" to
match the behavior of the assignment checker (same pattern as the PowerControl
test), referencing the test name and the m.FirmwareControl call to locate the
change.

---

Nitpick comments:
In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go`:
- Around line 221-230: Extract the hard-coded timeouts in
newManagerForSafetyTest into package-level test constants (e.g.,
safetyAssignmentTimeout and safetyPollInterval) and use them when calling
nicoprovider.NewAssignmentChecker; update the function newManagerForSafetyTest
(which creates a Manager via New and sets m.assignment) to reference those
constants so future tests can reuse and adjust the timeout/poll values
centrally.
- Around line 271-290: Add a complementary positive test for FirmwareControl to
mirror the existing refusal test: create a test (e.g.,
TestFirmwareControl_AllowsWhenRackHostsReady) that uses nicoapi.NewMockClient(),
SetSwitchRackID("sw-1","rack-A"), SetRackHostMachineIDs("rack-A",
[]string{"host-1"}), and AddMachine with MachineDetail.State "Ready";
instantiate the manager with newManagerForSafetyTest and call
m.FirmwareControl(context.Background(), target,
operations.FirmwareControlTaskInfo{Operation:
operations.FirmwareOperationUpgrade, TargetVersion: "1.0.0"}) and assert
require.NoError(t, err). This ensures FirmwareControl (which uses
ensureRackOperable) allows the operation when rack hosts are Ready.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bbb1a347-1e55-4084-8dba-f389b39a37b0

📥 Commits

Reviewing files that changed from the base of the PR and between 675da63 and ca28792.

📒 Files selected for processing (13)
  • flow/internal/nicoapi/grpc.go
  • flow/internal/nicoapi/mock.go
  • flow/internal/nicoapi/mod.go
  • flow/internal/task/componentmanager/builtin/builtin_test.go
  • flow/internal/task/componentmanager/compute/nico/nico.go
  • flow/internal/task/componentmanager/compute/nico/nico_test.go
  • flow/internal/task/componentmanager/nvswitch/nico/nico.go
  • flow/internal/task/componentmanager/nvswitch/nico/nico_test.go
  • flow/internal/task/componentmanager/nvswitch/nvswitchmanager/nvswitchmanager.go
  • flow/internal/task/componentmanager/powershelf/nico/nico.go
  • flow/internal/task/componentmanager/powershelf/nico/nico_test.go
  • flow/internal/task/componentmanager/providers/nico/assignment.go
  • flow/internal/task/componentmanager/providers/nico/assignment_test.go
🚧 Files skipped from review as they are similar to previous changes (12)
  • flow/internal/task/componentmanager/builtin/builtin_test.go
  • flow/internal/nicoapi/mod.go
  • flow/internal/task/componentmanager/powershelf/nico/nico.go
  • flow/internal/nicoapi/mock.go
  • flow/internal/nicoapi/grpc.go
  • flow/internal/task/componentmanager/powershelf/nico/nico_test.go
  • flow/internal/task/componentmanager/compute/nico/nico.go
  • flow/internal/task/componentmanager/providers/nico/assignment_test.go
  • flow/internal/task/componentmanager/compute/nico/nico_test.go
  • flow/internal/task/componentmanager/nvswitch/nico/nico.go
  • flow/internal/task/componentmanager/providers/nico/assignment.go
  • flow/internal/task/componentmanager/nvswitch/nvswitchmanager/nvswitchmanager.go

Comment on lines +271 to +290
func TestFirmwareControl_RefusesWhenRackHostAssigned(t *testing.T) {
client := nicoapi.NewMockClient()
client.SetSwitchRackID("sw-1", "rack-A")
client.SetRackHostMachineIDs("rack-A", []string{"host-1"})
client.AddMachine(nicoapi.MachineDetail{MachineID: "host-1", State: "Assigned/Provisioning"})

m := newManagerForSafetyTest(t, client)
target := common.Target{
Type: devicetypes.ComponentTypeNVSwitch,
ComponentIDs: []string{"sw-1"},
}

err := m.FirmwareControl(context.Background(), target, operations.FirmwareControlTaskInfo{
Operation: operations.FirmwareOperationUpgrade,
TargetVersion: "1.0.0",
})
require.Error(t, err)
assert.Contains(t, err.Error(), "refused")
assert.Contains(t, err.Error(), "Assigned state")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add assertion for machine ID in error message.

The error message from the assignment checker includes the machine IDs that remain in Assigned state (as verified in the PowerControl test at line 250). For consistency and thoroughness, this test should also verify that "host-1" appears in the error message.

🔍 Proposed fix
 	require.Error(t, err)
 	assert.Contains(t, err.Error(), "refused")
 	assert.Contains(t, err.Error(), "Assigned state")
+	assert.Contains(t, err.Error(), "host-1")
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go` around lines
271 - 290, The test TestFirmwareControl_RefusesWhenRackHostAssigned should also
assert that the machine ID is included in the error message; after calling
m.FirmwareControl and verifying err is non-nil and contains "refused" and
"Assigned state", add an assertion that err.Error() contains "host-1" to match
the behavior of the assignment checker (same pattern as the PowerControl test),
referencing the test name and the m.FirmwareControl call to locate the change.

@kunzhao-nv kunzhao-nv marked this pull request as draft May 29, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant