WIP: refactor Start and Stop. by elezar · Pull Request #1607 · NVIDIA/k8s-device-plugin

elezar · 2026-01-29T11:13:55Z

No description provided.

Add NVML mock packages from go-nvml for use in health check tests. This includes the dgxa100 mock server for simulating GPU hardware. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Refactor the health monitoring implementation: - Extract nvmlHealthProvider struct to encapsulate health check state - Add registerDeviceEvents() and runEventMonitor() methods - Rename xids field to xidsDisabled for clarity - Migrate from stop channel to context.Context for cancellation Add comprehensive tests: - TestCheckHealth validates XID event handling with mocks - TestRegisterDeviceEventsNotSupported ensures old GPUs returning ERROR_NOT_SUPPORTED are not incorrectly marked unhealthy Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Update ResourceManager interface and implementations to use context.Context for cancellation instead of a stop channel. This is more idiomatic Go and allows for better lifecycle control. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Replace stop channel with context-based cancellation for health checks: - Add healthCtx, healthCancel, and healthWg for lifecycle management - Capture context/channel references in ListAndWatch to avoid race with cleanup() which may nil these fields - Properly wait for health goroutine completion during cleanup Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Initialize healthCtx and healthCancel in devicePluginForResource() instead of in initialize() to eliminate race condition where ListAndWatch() might access healthCtx before initialization completes. This addresses Elezar's concern about synchronization guarantees and follows Go best practices for immutable initialization. Refs: NVIDIA#1601 Task: 1/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Remove healthCtx and healthCancel initialization from initialize() since they are now initialized at construction time in devicePluginForResource(). This eliminates redundant initialization and ensures context lifetime is tied to the plugin instance rather than the start/stop cycle. Refs: NVIDIA#1601 Task: 2/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Recreate healthCtx and healthCancel in cleanup() after cancellation to support plugin restart. Remove nil assignments for these fields as they need to persist across restart cycles. This addresses Elezar's concern NVIDIA#2 about why we nil these fields - we no longer do. The context is recreated fresh for each restart, ensuring health checks work correctly when the plugin is restarted. Fixes plugin restart blocker identified in architecture review. Refs: NVIDIA#1601 Task: 3/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Close the health channel in cleanup() before niling to prevent panics in ListAndWatch() if cleanup happens during channel read. The WaitGroup ensures the health goroutine has completed before we close the channel, making this operation safe. Fixes critical blocker: health channel was never closed, leading to potential panics and resource leaks. Refs: NVIDIA#1601 Task: 4/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Add ok check when receiving from health channel to gracefully handle channel closure during cleanup. This prevents potential panics and ensures clean shutdown when the plugin stops. Reading from a closed channel returns zero value and ok=false, which we now check and return gracefully. Refs: NVIDIA#1601 Task: 5/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Replace simple if statement with switch to distinguish between three error cases: 1. Success (nil) - log completion 2. Canceled (context.Canceled) - log clean shutdown at V(4) 3. Error (other) - log error This provides better observability and distinguishes expected shutdown from actual errors. Addresses Elezar's concern NVIDIA#3 about error handling improvements. Task: 6/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

elezar and others added 10 commits January 28, 2026 12:27

chore(deps): vendor nvml mock packages

786ee5b

Add NVML mock packages from go-nvml for use in health check tests. This includes the dgxa100 mock server for simulating GPU hardware. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: refactor Start and Stop.#1607

WIP: refactor Start and Stop.#1607
elezar wants to merge 10 commits intoNVIDIA:mainfrom
elezar:healthcheck

elezar commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

elezar commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants