Conversation
There was a problem hiding this comment.
Pull request overview
Refactors the device health-check lifecycle to use context.Context cancellation instead of a stop channel, and adds NVML mock infrastructure/tests to validate health-check behavior.
Changes:
- Switch
ResourceManager.CheckHealthtoCheckHealth(ctx context.Context, unhealthy chan<- *Device)and propagate through plugin + managers. - Refactor NVML health monitoring into a provider (
nvmlHealthProvider) and add tests covering XID handling andERROR_NOT_SUPPORTEDbehavior. - Vendor in NVML mock packages (including a dgxa100 mock server) to support deterministic health-check testing.
Reviewed changes
Copilot reviewed 7 out of 20 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
internal/rm/rm.go |
Updates ResourceManager interface to use context.Context for health checks. |
internal/rm/nvml_manager.go |
Adapts NVML RM CheckHealth to the new context-based signature. |
internal/rm/tegra_manager.go |
Updates Tegra RM CheckHealth signature (still effectively disabled). |
internal/rm/health.go |
Refactors health-check implementation into nvmlHealthProvider and context-driven loop. |
internal/rm/health_test.go |
Adds tests using NVML mocks to validate event-driven health behavior. |
internal/rm/rm_mock.go |
Regenerates moq mock to match the new CheckHealth signature. |
internal/plugin/server.go |
Replaces stop channel with cancelable context + waitgroup for health goroutine lifecycle. |
vendor/modules.txt |
Adds vendored paths for newly included NVML mock packages. |
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/*.go |
Adds generated moq mocks for NVML interfaces used in tests. |
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100/mig-profile.go |
Adds MIG profile/placement data for the dgxa100 mock server. |
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100/dgxa100.go |
Adds dgxa100 NVML mock server used by health-check tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ad1263e to
c352a5a
Compare
elezar
left a comment
There was a problem hiding this comment.
Thanks @ArangoGutierrez. I have a minor question regarding the cleaning up of the context.
internal/plugin/server.go
Outdated
| // Capture references at start to avoid race with cleanup() which may nil these fields. | ||
| healthCtx := plugin.healthCtx | ||
| health := plugin.health |
There was a problem hiding this comment.
Do we need to add synchronisation to ensure that these are actually initialised by the time that ListAndWatch is called, or is it guaranteed that this will be called only after the context has been initialized. Looking at the cleanup code, it is not clear why we would set plugin.healthCtx to nil there.
Note that if we were to pass the context in to the server construction, then we would not have to do this since the lifetime would be controlled at the caller and we would not set the context to nil in plugin cleanup.
There was a problem hiding this comment.
Addressed, now healthCtx and healthCancel now created in devicePluginForResource()
internal/plugin/server.go
Outdated
| plugin.healthCtx = nil | ||
| plugin.healthCancel = nil |
There was a problem hiding this comment.
Why are we setting these to nil now?
There was a problem hiding this comment.
removed.
Fields preserved (not niled) for restart support
internal/plugin/server.go
Outdated
| if err != nil && !errors.Is(err, context.Canceled) { | ||
| klog.Errorf("Failed to start health check: %v; continuing with health checks disabled", err) | ||
| } |
There was a problem hiding this comment.
What about logging the happy path case? (Also in chich cases will we return contex.Cancelled?
| if err != nil && !errors.Is(err, context.Canceled) { | |
| klog.Errorf("Failed to start health check: %v; continuing with health checks disabled", err) | |
| } | |
| switch { | |
| case err == nil: | |
| fallthrough | |
| case errors.Is(err, context.Cancelled): | |
| klog.Infof("Stopping health check") | |
| case err != nil: | |
| klog.Errorf("Failed to start health check: %v; continuing with health checks disabled", err) | |
| } | |
| } |
Initialize healthCtx and healthCancel in devicePluginForResource() instead of in initialize() to eliminate race condition where ListAndWatch() might access healthCtx before initialization completes. This addresses Elezar's concern about synchronization guarantees and follows Go best practices for immutable initialization. Refs: NVIDIA#1601 Task: 1/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Remove healthCtx and healthCancel initialization from initialize() since they are now initialized at construction time in devicePluginForResource(). This eliminates redundant initialization and ensures context lifetime is tied to the plugin instance rather than the start/stop cycle. Refs: NVIDIA#1601 Task: 2/6
Recreate healthCtx and healthCancel in cleanup() after cancellation to support plugin restart. Remove nil assignments for these fields as they need to persist across restart cycles. This addresses Elezar's concern NVIDIA#2 about why we nil these fields - we no longer do. The context is recreated fresh for each restart, ensuring health checks work correctly when the plugin is restarted. Fixes plugin restart blocker identified in architecture review. Refs: NVIDIA#1601 Task: 3/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Close the health channel in cleanup() before niling to prevent panics in ListAndWatch() if cleanup happens during channel read. The WaitGroup ensures the health goroutine has completed before we close the channel, making this operation safe. Fixes critical blocker: health channel was never closed, leading to potential panics and resource leaks. Refs: NVIDIA#1601 Task: 4/6
Add ok check when receiving from health channel to gracefully handle channel closure during cleanup. This prevents potential panics and ensures clean shutdown when the plugin stops. Reading from a closed channel returns zero value and ok=false, which we now check and return gracefully. Refs: NVIDIA#1601 Task: 5/6
c352a5a to
7a0f2aa
Compare
Initialize healthCtx and healthCancel in devicePluginForResource() instead of in initialize() to eliminate race condition where ListAndWatch() might access healthCtx before initialization completes. This addresses Elezar's concern about synchronization guarantees and follows Go best practices for immutable initialization. Refs: NVIDIA#1601 Task: 1/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Remove healthCtx and healthCancel initialization from initialize() since they are now initialized at construction time in devicePluginForResource(). This eliminates redundant initialization and ensures context lifetime is tied to the plugin instance rather than the start/stop cycle. Refs: NVIDIA#1601 Task: 2/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Recreate healthCtx and healthCancel in cleanup() after cancellation to support plugin restart. Remove nil assignments for these fields as they need to persist across restart cycles. This addresses Elezar's concern NVIDIA#2 about why we nil these fields - we no longer do. The context is recreated fresh for each restart, ensuring health checks work correctly when the plugin is restarted. Fixes plugin restart blocker identified in architecture review. Refs: NVIDIA#1601 Task: 3/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
7a0f2aa to
14c5914
Compare
Close the health channel in cleanup() before niling to prevent panics in ListAndWatch() if cleanup happens during channel read. The WaitGroup ensures the health goroutine has completed before we close the channel, making this operation safe. Fixes critical blocker: health channel was never closed, leading to potential panics and resource leaks. Refs: NVIDIA#1601 Task: 4/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add ok check when receiving from health channel to gracefully handle channel closure during cleanup. This prevents potential panics and ensures clean shutdown when the plugin stops. Reading from a closed channel returns zero value and ok=false, which we now check and return gracefully. Refs: NVIDIA#1601 Task: 5/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
internal/plugin/server.go
Outdated
| // healthCtx and healthCancel control the health check goroutine lifecycle. | ||
| // healthWg is used to wait for the health check goroutine to complete during cleanup. | ||
| healthCtx context.Context | ||
| healthCancel context.CancelFunc | ||
| healthWg sync.WaitGroup |
There was a problem hiding this comment.
| // healthCtx and healthCancel control the health check goroutine lifecycle. | |
| // healthWg is used to wait for the health check goroutine to complete during cleanup. | |
| healthCtx context.Context | |
| healthCancel context.CancelFunc | |
| healthWg sync.WaitGroup | |
| // healthCtx and healthCancel control the health check goroutine lifecycle. | |
| healthCtx context.Context | |
| healthCancel context.CancelFunc | |
| // healthWg is used to wait for the health check goroutine to complete during cleanup. | |
| healthWg sync.WaitGroup |
There was a problem hiding this comment.
Thanks will implement
| // nvidiaDevicePlugin implements the Kubernetes device plugin API | ||
| type nvidiaDevicePlugin struct { | ||
| pluginapi.UnimplementedDevicePluginServer | ||
| ctx context.Context |
There was a problem hiding this comment.
Question: Why do we need a second context?
There was a problem hiding this comment.
So we can cancel the healthchecks independently.
internal/plugin/server.go
Outdated
| // Recreate context for potential plugin restart. The same plugin instance | ||
| // may be restarted via Start() after Stop(), so we need a fresh context. | ||
| plugin.healthCtx, plugin.healthCancel = context.WithCancel(plugin.ctx) |
There was a problem hiding this comment.
When are we ever setting these to nil now? This should ONLY be done in the constructor.
Initialize healthCtx and healthCancel in devicePluginForResource() instead of in initialize() to eliminate race condition where ListAndWatch() might access healthCtx before initialization completes. This addresses Elezar's concern about synchronization guarantees and follows Go best practices for immutable initialization. Refs: NVIDIA#1601 Task: 1/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Remove healthCtx and healthCancel initialization from initialize() since they are now initialized at construction time in devicePluginForResource(). This eliminates redundant initialization and ensures context lifetime is tied to the plugin instance rather than the start/stop cycle. Refs: NVIDIA#1601 Task: 2/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Recreate healthCtx and healthCancel in cleanup() after cancellation to support plugin restart. Remove nil assignments for these fields as they need to persist across restart cycles. This addresses Elezar's concern NVIDIA#2 about why we nil these fields - we no longer do. The context is recreated fresh for each restart, ensuring health checks work correctly when the plugin is restarted. Fixes plugin restart blocker identified in architecture review. Refs: NVIDIA#1601 Task: 3/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
5a400f5 to
f5db035
Compare
Add ok check when receiving from health channel to gracefully handle channel closure during cleanup. This prevents potential panics and ensures clean shutdown when the plugin stops. Reading from a closed channel returns zero value and ok=false, which we now check and return gracefully. Refs: NVIDIA#1601 Task: 5/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
| eventMask := uint64(nvml.EventTypeXidCriticalError | nvml.EventTypeDoubleBitEccError | nvml.EventTypeSingleBitEccError) | ||
| for _, d := range devices { | ||
| uuid, gi, ci, err := r.getDevicePlacement(d) | ||
| uuid, gi, ci, err := (&withDevicePlacements{r.nvml}).getDevicePlacement(d) |
There was a problem hiding this comment.
Could you provide some context as to why we landed on this change / refactor?
There was a problem hiding this comment.
The goal of this refactor is to take the DevicePlugin to a similar state as our DRA after @guptaNswati Health check work.
This in order to prepare both DRA and DevicePlugin for NVsentinel integration via the Device-API
There was a problem hiding this comment.
Lets add a detailed comment on how it will help. I also had to look into it. Because rn when we are only using NVML, its not making sense to decouple DevicePlacement logic from NVMLResourceManager. We need to note how this placement code can be reused with multiple providers and not just NVML.
type nvmlResourceManager struct {
resourceManager
nvml nvml.Interface
to having a new wrapper
type withDevicePlacements struct {
nvml.Interface
}
internal/rm/health.go
Outdated
| @@ -107,14 +137,16 @@ func (r *nvmlResourceManager) checkHealth(stop <-chan interface{}, devices Devic | |||
| klog.Warningf("Device %v is too old to support healthchecking.", d.ID) | |||
| case ret != nvml.SUCCESS: | |||
| klog.Infof("Marking device %v as unhealthy: %v", d.ID, ret) | |||
There was a problem hiding this comment.
nit: could we update the log message here to indicate why we are marking the device unhealthy? (we failed during NVML event registration)
There was a problem hiding this comment.
+1. on dra, its unable to register events for %s: %v; marking it as unhealthy", d.ID, ret
AGENTS.md
Outdated
There was a problem hiding this comment.
Could you explain how you are using this file / how it was generated? If you used a coding agent, could you acknowledge that in the PR description and describe how you used it?
This file needs to be removed before we can merge this PR.
internal/plugin/server.go
Outdated
| return nil | ||
| } | ||
| // We stop the health checks | ||
| plugin.healthCancel() |
There was a problem hiding this comment.
nit: add a nil check
if plugin.healthCancel != nil { plugin.healthCancel() }
internal/plugin/server.go
Outdated
| } | ||
| klog.Infof("Registered device plugin for '%s' with Kubelet", plugin.rm.Resource()) | ||
|
|
||
| plugin.health = make(chan *rm.Device) |
There was a problem hiding this comment.
should these be initialized before registering the plugin?
internal/plugin/server.go
Outdated
| plugin.healthCancel() | ||
| plugin.healthWg.Wait() | ||
| // Close the health channel after waiting for the health check goroutine to terminate. | ||
| close(plugin.health) |
There was a problem hiding this comment.
nil check here also.
if plugin.health != nil { close(plugin.health); plugin.health = nil }
internal/plugin/server.go
Outdated
| case d := <-plugin.health: | ||
| case d, ok := <-plugin.health: | ||
| if !ok { | ||
| // Health channel closed, health checks stopped |
There was a problem hiding this comment.
i would log here also. helps in debugging
| _ = eventSet.Free() | ||
| }() | ||
|
|
||
| // Construct the device maps. |
There was a problem hiding this comment.
oh you already called it out. Yes, we should refractor it into helpers
| envEnableHealthChecks = "DP_ENABLE_HEALTHCHECKS" | ||
| ) | ||
|
|
||
| type nvmlHealthProvider struct { |
There was a problem hiding this comment.
why not nvmlDeviceHealthProvider
|
|
||
| // getMigDeviceParts returns the parent GI and CI ids of the MIG device. | ||
| func (r *nvmlResourceManager) getMigDeviceParts(d *Device) (string, uint32, uint32, error) { | ||
| func (r *withDevicePlacements) getMigDeviceParts(d *Device) (string, uint32, uint32, error) { |
There was a problem hiding this comment.
nit: can we call it dp or d instead of r
| type nvmlHealthProvider struct { | ||
| nvmllib nvml.Interface | ||
| devices Devices | ||
| parentToDeviceMap map[string]*Device |
There was a problem hiding this comment.
can we make it a type similar to type devicePlacementMap map[string]map[uint32]map[uint32]*Device
https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/cmd/gpu-kubelet-plugin/device_health.go#L36
internal/rm/health.go
Outdated
| } | ||
|
|
||
| gpu, ret := r.nvml.DeviceGetHandleByUUID(uuid) | ||
| func (r *nvmlHealthProvider) registerDeviceEvents(eventSet nvml.EventSet) { |
| } | ||
|
|
||
| p := nvmlHealthProvider{ | ||
| nvmllib: r.nvml, |
There was a problem hiding this comment.
is this needed for tests?
There was a problem hiding this comment.
@ArangoGutierrez why do we need to have the lib here? nvmlResourceManager.nvml is used everywhere
|
sorry for the delayed review. @ArangoGutierrez thank you for your patience. I looked at it from the perspective of what we did for DRA health checks as the gaol is to align with DRA and integrate with the device-api. I have minor nits but overall the refractor LGTM. I still want to spend some more time on reading and trying the mocks tests. |
f5db035 to
4a84232
Compare
Add vendored NVML mock packages from go-nvml including the dgxa100 mock server. These mocks enable deterministic unit testing of NVML health check behavior without requiring actual GPU hardware. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Refactor health check implementation: - Extract health monitoring logic into nvmlHealthProvider struct with registerDeviceEvents() and runEventMonitor() methods, making health checks testable in isolation - Migrate CheckHealth interface from stop channel to context.Context for idiomatic cancellation - Extract device placement logic into withDevicePlacements struct to decouple from resource manager - Clarify log message for event registration failure to explain why device is marked unhealthy (aligns with DRA driver convention) - Add unit tests for XID event handling and ERROR_NOT_SUPPORTED regression using NVML mock infrastructure Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Replace the stop channel with context.Context cancellation and sync.WaitGroup for health check goroutine lifecycle management: - Initialize health channel and healthCtx before Serve() to eliminate race where kubelet could call ListAndWatch before fields are set - Use sync.WaitGroup to ensure health goroutine completes before channel close in Stop() - Add nil guards in Stop() to handle partial Start() failure safely - Add structured error handling for health check completion (success, canceled, error cases) - Add debug logging in ListAndWatch for context cancellation and channel closure - Fix updateResponseForMPS receiver from value to pointer Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> Co-authored-by: Evan Lezar <elezar@nvidia.com>
4a84232 to
d6c787c
Compare
|
I have addressed all your comments @guptaNswati / @cdesiniotis PTAL |
|
Thank you. I dint get a chance to test it. Can you please paste the test logs |
Unfortunately, not, didn't save the logs, and the cluster I used is down |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 20 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return fmt.Errorf("error waiting for MPS daemon: %w", err) | ||
| } | ||
|
|
||
| plugin.health = make(chan *rm.Device) |
There was a problem hiding this comment.
plugin.health is created as an unbuffered channel and the health-check goroutine can send to it even when no ListAndWatch stream is active. Since Stop() now cancels the context and then waits for the health goroutine to exit, a blocked send on plugin.health can deadlock Stop() (context cancellation won’t unblock a channel send). Consider making the channel buffered (e.g., sized to number of devices) and/or ensuring all sends in health-check code are non-blocking/select on ctx.Done().
| plugin.health = make(chan *rm.Device) | |
| // Use a buffered channel for health notifications to avoid blocking when | |
| // no ListAndWatch stream is currently consuming from plugin.health. | |
| healthBufSize := len(plugin.rm.Devices()) | |
| if healthBufSize < 1 { | |
| healthBufSize = 1 | |
| } | |
| plugin.health = make(chan *rm.Device, healthBufSize) |
| err := plugin.Serve() | ||
| if err != nil { | ||
| klog.Errorf("Could not start device plugin for '%s': %s", plugin.rm.Resource(), err) | ||
| plugin.cleanup() | ||
| return err | ||
| } |
There was a problem hiding this comment.
On Serve() failure, Start() returns immediately after creating plugin.health and plugin.healthCtx but does not cancel/close them. If the same plugin instance is retried or kept around, this leaks resources and leaves fields in a partially-initialized state. Consider deferring a cleanup (cancel + close channel) on the Serve() error path.
| deviceIDToGiMap[d.ID] = gi | ||
| deviceIDToCiMap[d.ID] = ci | ||
| parentToDeviceMap[uuid] = d |
There was a problem hiding this comment.
parentToDeviceMap is keyed by uuid returned from getDevicePlacement(), which is the parent UUID for MIG devices. If multiple MIG devices share the same parent GPU, this assignment overwrites earlier entries, so events for that parent can only ever map to one MIG device. To correctly handle multiple MIG instances per parent, use a mapping that can represent multiple devices per parent (e.g., parent UUID -> list, or parent UUID -> (GI,CI)->device).
| if ret != nvml.SUCCESS { | ||
| klog.Infof("Error waiting for event: %v; Marking all devices as unhealthy", ret) | ||
| for _, d := range devices { | ||
| unhealthy <- d | ||
| for _, d := range h.devices { | ||
| h.unhealthy <- d | ||
| } | ||
| continue |
There was a problem hiding this comment.
When eventSet.Wait() returns a non-timeout error, the loop marks all devices unhealthy and continues. If the error is persistent, this can spin forever and repeatedly attempt blocking sends to h.unhealthy, preventing shutdown if the receiver isn’t draining fast enough. Consider returning the error (or backing off) and make unhealthy notifications context-aware (select on ctx.Done()) to avoid blocking indefinitely.
i know these are straightforward changes and you must have tested them very well. I feel more comfortable when i see the test logs so i will try to set up a cluster next week and see if we can test them. |
No description provided.