Skip to content

WIP: refactor Start and Stop.#1607

Draft
elezar wants to merge 10 commits intoNVIDIA:mainfrom
elezar:healthcheck
Draft

WIP: refactor Start and Stop.#1607
elezar wants to merge 10 commits intoNVIDIA:mainfrom
elezar:healthcheck

Conversation

@elezar
Copy link
Member

@elezar elezar commented Jan 29, 2026

No description provided.

elezar and others added 10 commits January 28, 2026 12:27
Add NVML mock packages from go-nvml for use in health check tests.
This includes the dgxa100 mock server for simulating GPU hardware.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Refactor the health monitoring implementation:
- Extract nvmlHealthProvider struct to encapsulate health check state
- Add registerDeviceEvents() and runEventMonitor() methods
- Rename xids field to xidsDisabled for clarity
- Migrate from stop channel to context.Context for cancellation

Add comprehensive tests:
- TestCheckHealth validates XID event handling with mocks
- TestRegisterDeviceEventsNotSupported ensures old GPUs returning
  ERROR_NOT_SUPPORTED are not incorrectly marked unhealthy

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Update ResourceManager interface and implementations to use
context.Context for cancellation instead of a stop channel.
This is more idiomatic Go and allows for better lifecycle control.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Replace stop channel with context-based cancellation for health checks:
- Add healthCtx, healthCancel, and healthWg for lifecycle management
- Capture context/channel references in ListAndWatch to avoid race
  with cleanup() which may nil these fields
- Properly wait for health goroutine completion during cleanup

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Initialize healthCtx and healthCancel in devicePluginForResource()
instead of in initialize() to eliminate race condition where
ListAndWatch() might access healthCtx before initialization completes.

This addresses Elezar's concern about synchronization guarantees
and follows Go best practices for immutable initialization.

Refs: NVIDIA#1601
Task: 1/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Remove healthCtx and healthCancel initialization from initialize()
since they are now initialized at construction time in
devicePluginForResource().

This eliminates redundant initialization and ensures context
lifetime is tied to the plugin instance rather than the
start/stop cycle.

Refs: NVIDIA#1601
Task: 2/6
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Recreate healthCtx and healthCancel in cleanup() after cancellation
to support plugin restart. Remove nil assignments for these fields
as they need to persist across restart cycles.

This addresses Elezar's concern NVIDIA#2 about why we nil these fields -
we no longer do. The context is recreated fresh for each restart,
ensuring health checks work correctly when the plugin is restarted.

Fixes plugin restart blocker identified in architecture review.

Refs: NVIDIA#1601
Task: 3/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Close the health channel in cleanup() before niling to prevent
panics in ListAndWatch() if cleanup happens during channel read.

The WaitGroup ensures the health goroutine has completed before
we close the channel, making this operation safe.

Fixes critical blocker: health channel was never closed, leading
to potential panics and resource leaks.

Refs: NVIDIA#1601
Task: 4/6
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add ok check when receiving from health channel to gracefully
handle channel closure during cleanup. This prevents potential
panics and ensures clean shutdown when the plugin stops.

Reading from a closed channel returns zero value and ok=false,
which we now check and return gracefully.

Refs: NVIDIA#1601
Task: 5/6
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Replace simple if statement with switch to distinguish between
three error cases:
1. Success (nil) - log completion
2. Canceled (context.Canceled) - log clean shutdown at V(4)
3. Error (other) - log error

This provides better observability and distinguishes expected
shutdown from actual errors. Addresses Elezar's concern NVIDIA#3
about error handling improvements.

Task: 6/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants