Skip to content

Refactor health checks#1601

Open
ArangoGutierrez wants to merge 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:healthcheck
Open

Refactor health checks#1601
ArangoGutierrez wants to merge 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:healthcheck

Conversation

@ArangoGutierrez
Copy link
Collaborator

No description provided.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the device health-check lifecycle to use context.Context cancellation instead of a stop channel, and adds NVML mock infrastructure/tests to validate health-check behavior.

Changes:

  • Switch ResourceManager.CheckHealth to CheckHealth(ctx context.Context, unhealthy chan<- *Device) and propagate through plugin + managers.
  • Refactor NVML health monitoring into a provider (nvmlHealthProvider) and add tests covering XID handling and ERROR_NOT_SUPPORTED behavior.
  • Vendor in NVML mock packages (including a dgxa100 mock server) to support deterministic health-check testing.

Reviewed changes

Copilot reviewed 7 out of 20 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
internal/rm/rm.go Updates ResourceManager interface to use context.Context for health checks.
internal/rm/nvml_manager.go Adapts NVML RM CheckHealth to the new context-based signature.
internal/rm/tegra_manager.go Updates Tegra RM CheckHealth signature (still effectively disabled).
internal/rm/health.go Refactors health-check implementation into nvmlHealthProvider and context-driven loop.
internal/rm/health_test.go Adds tests using NVML mocks to validate event-driven health behavior.
internal/rm/rm_mock.go Regenerates moq mock to match the new CheckHealth signature.
internal/plugin/server.go Replaces stop channel with cancelable context + waitgroup for health goroutine lifecycle.
vendor/modules.txt Adds vendored paths for newly included NVML mock packages.
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/*.go Adds generated moq mocks for NVML interfaces used in tests.
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100/mig-profile.go Adds MIG profile/placement data for the dgxa100 mock server.
vendor/github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100/dgxa100.go Adds dgxa100 NVML mock server used by health-check tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArangoGutierrez. I have a minor question regarding the cleaning up of the context.

Comment on lines 279 to 281
// Capture references at start to avoid race with cleanup() which may nil these fields.
healthCtx := plugin.healthCtx
health := plugin.health
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add synchronisation to ensure that these are actually initialised by the time that ListAndWatch is called, or is it guaranteed that this will be called only after the context has been initialized. Looking at the cleanup code, it is not clear why we would set plugin.healthCtx to nil there.

Note that if we were to pass the context in to the server construction, then we would not have to do this since the lifetime would be controlled at the caller and we would not set the context to nil in plugin cleanup.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed, now healthCtx and healthCancel now created in devicePluginForResource()

Comment on lines 127 to 128
plugin.healthCtx = nil
plugin.healthCancel = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we setting these to nil now?

Copy link
Collaborator Author

@ArangoGutierrez ArangoGutierrez Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.
Fields preserved (not niled) for restart support

Comment on lines 165 to 167
if err != nil && !errors.Is(err, context.Canceled) {
klog.Errorf("Failed to start health check: %v; continuing with health checks disabled", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about logging the happy path case? (Also in chich cases will we return contex.Cancelled?

Suggested change
if err != nil && !errors.Is(err, context.Canceled) {
klog.Errorf("Failed to start health check: %v; continuing with health checks disabled", err)
}
switch {
case err == nil:
fallthrough
case errors.Is(err, context.Cancelled):
klog.Infof("Stopping health check")
case err != nil:
klog.Errorf("Failed to start health check: %v; continuing with health checks disabled", err)
}
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Initialize healthCtx and healthCancel in devicePluginForResource()
instead of in initialize() to eliminate race condition where
ListAndWatch() might access healthCtx before initialization completes.

This addresses Elezar's concern about synchronization guarantees
and follows Go best practices for immutable initialization.

Refs: NVIDIA#1601
Task: 1/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Remove healthCtx and healthCancel initialization from initialize()
since they are now initialized at construction time in
devicePluginForResource().

This eliminates redundant initialization and ensures context
lifetime is tied to the plugin instance rather than the
start/stop cycle.

Refs: NVIDIA#1601
Task: 2/6
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Recreate healthCtx and healthCancel in cleanup() after cancellation
to support plugin restart. Remove nil assignments for these fields
as they need to persist across restart cycles.

This addresses Elezar's concern NVIDIA#2 about why we nil these fields -
we no longer do. The context is recreated fresh for each restart,
ensuring health checks work correctly when the plugin is restarted.

Fixes plugin restart blocker identified in architecture review.

Refs: NVIDIA#1601
Task: 3/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Close the health channel in cleanup() before niling to prevent
panics in ListAndWatch() if cleanup happens during channel read.

The WaitGroup ensures the health goroutine has completed before
we close the channel, making this operation safe.

Fixes critical blocker: health channel was never closed, leading
to potential panics and resource leaks.

Refs: NVIDIA#1601
Task: 4/6
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Add ok check when receiving from health channel to gracefully
handle channel closure during cleanup. This prevents potential
panics and ensures clean shutdown when the plugin stops.

Reading from a closed channel returns zero value and ok=false,
which we now check and return gracefully.

Refs: NVIDIA#1601
Task: 5/6
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Initialize healthCtx and healthCancel in devicePluginForResource()
instead of in initialize() to eliminate race condition where
ListAndWatch() might access healthCtx before initialization completes.

This addresses Elezar's concern about synchronization guarantees
and follows Go best practices for immutable initialization.

Refs: NVIDIA#1601
Task: 1/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Remove healthCtx and healthCancel initialization from initialize()
since they are now initialized at construction time in
devicePluginForResource().

This eliminates redundant initialization and ensures context
lifetime is tied to the plugin instance rather than the
start/stop cycle.

Refs: NVIDIA#1601
Task: 2/6
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Recreate healthCtx and healthCancel in cleanup() after cancellation
to support plugin restart. Remove nil assignments for these fields
as they need to persist across restart cycles.

This addresses Elezar's concern NVIDIA#2 about why we nil these fields -
we no longer do. The context is recreated fresh for each restart,
ensuring health checks work correctly when the plugin is restarted.

Fixes plugin restart blocker identified in architecture review.

Refs: NVIDIA#1601
Task: 3/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Close the health channel in cleanup() before niling to prevent
panics in ListAndWatch() if cleanup happens during channel read.

The WaitGroup ensures the health goroutine has completed before
we close the channel, making this operation safe.

Fixes critical blocker: health channel was never closed, leading
to potential panics and resource leaks.

Refs: NVIDIA#1601
Task: 4/6
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 28, 2026
Add ok check when receiving from health channel to gracefully
handle channel closure during cleanup. This prevents potential
panics and ensures clean shutdown when the plugin stops.

Reading from a closed channel returns zero value and ok=false,
which we now check and return gracefully.

Refs: NVIDIA#1601
Task: 5/6
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Comment on lines 67 to 71
// healthCtx and healthCancel control the health check goroutine lifecycle.
// healthWg is used to wait for the health check goroutine to complete during cleanup.
healthCtx context.Context
healthCancel context.CancelFunc
healthWg sync.WaitGroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// healthCtx and healthCancel control the health check goroutine lifecycle.
// healthWg is used to wait for the health check goroutine to complete during cleanup.
healthCtx context.Context
healthCancel context.CancelFunc
healthWg sync.WaitGroup
// healthCtx and healthCancel control the health check goroutine lifecycle.
healthCtx context.Context
healthCancel context.CancelFunc
// healthWg is used to wait for the health check goroutine to complete during cleanup.
healthWg sync.WaitGroup

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks will implement

// nvidiaDevicePlugin implements the Kubernetes device plugin API
type nvidiaDevicePlugin struct {
pluginapi.UnimplementedDevicePluginServer
ctx context.Context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why do we need a second context?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can cancel the healthchecks independently.

Comment on lines 131 to 133
// Recreate context for potential plugin restart. The same plugin instance
// may be restarted via Start() after Stop(), so we need a fresh context.
plugin.healthCtx, plugin.healthCancel = context.WithCancel(plugin.ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When are we ever setting these to nil now? This should ONLY be done in the constructor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by da507ef

ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 29, 2026
Initialize healthCtx and healthCancel in devicePluginForResource()
instead of in initialize() to eliminate race condition where
ListAndWatch() might access healthCtx before initialization completes.

This addresses Elezar's concern about synchronization guarantees
and follows Go best practices for immutable initialization.

Refs: NVIDIA#1601
Task: 1/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 29, 2026
Remove healthCtx and healthCancel initialization from initialize()
since they are now initialized at construction time in
devicePluginForResource().

This eliminates redundant initialization and ensures context
lifetime is tied to the plugin instance rather than the
start/stop cycle.

Refs: NVIDIA#1601
Task: 2/6
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 29, 2026
Recreate healthCtx and healthCancel in cleanup() after cancellation
to support plugin restart. Remove nil assignments for these fields
as they need to persist across restart cycles.

This addresses Elezar's concern NVIDIA#2 about why we nil these fields -
we no longer do. The context is recreated fresh for each restart,
ensuring health checks work correctly when the plugin is restarted.

Fixes plugin restart blocker identified in architecture review.

Refs: NVIDIA#1601
Task: 3/6

Co-authored-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/k8s-device-plugin that referenced this pull request Jan 29, 2026
Add ok check when receiving from health channel to gracefully
handle channel closure during cleanup. This prevents potential
panics and ensures clean shutdown when the plugin stops.

Reading from a closed channel returns zero value and ok=false,
which we now check and return gracefully.

Refs: NVIDIA#1601
Task: 5/6
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
eventMask := uint64(nvml.EventTypeXidCriticalError | nvml.EventTypeDoubleBitEccError | nvml.EventTypeSingleBitEccError)
for _, d := range devices {
uuid, gi, ci, err := r.getDevicePlacement(d)
uuid, gi, ci, err := (&withDevicePlacements{r.nvml}).getDevicePlacement(d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide some context as to why we landed on this change / refactor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this refactor is to take the DevicePlugin to a similar state as our DRA after @guptaNswati Health check work.
This in order to prepare both DRA and DevicePlugin for NVsentinel integration via the Device-API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a detailed comment on how it will help. I also had to look into it. Because rn when we are only using NVML, its not making sense to decouple DevicePlacement logic from NVMLResourceManager. We need to note how this placement code can be reused with multiple providers and not just NVML.

type nvmlResourceManager struct {
	resourceManager
	nvml nvml.Interface

to having a new wrapper

type withDevicePlacements struct {
	nvml.Interface
}

@@ -107,14 +137,16 @@ func (r *nvmlResourceManager) checkHealth(stop <-chan interface{}, devices Devic
klog.Warningf("Device %v is too old to support healthchecking.", d.ID)
case ret != nvml.SUCCESS:
klog.Infof("Marking device %v as unhealthy: %v", d.ID, ret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we update the log message here to indicate why we are marking the device unhealthy? (we failed during NVML event registration)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. on dra, its unable to register events for %s: %v; marking it as unhealthy", d.ID, ret

AGENTS.md Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain how you are using this file / how it was generated? If you used a coding agent, could you acknowledge that in the PR description and describe how you used it?

This file needs to be removed before we can merge this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

return nil
}
// We stop the health checks
plugin.healthCancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a nil check
if plugin.healthCancel != nil { plugin.healthCancel() }

}
klog.Infof("Registered device plugin for '%s' with Kubelet", plugin.rm.Resource())

plugin.health = make(chan *rm.Device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be initialized before registering the plugin?

plugin.healthCancel()
plugin.healthWg.Wait()
// Close the health channel after waiting for the health check goroutine to terminate.
close(plugin.health)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nil check here also.
if plugin.health != nil { close(plugin.health); plugin.health = nil }

case d := <-plugin.health:
case d, ok := <-plugin.health:
if !ok {
// Health channel closed, health checks stopped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would log here also. helps in debugging

_ = eventSet.Free()
}()

// Construct the device maps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh you already called it out. Yes, we should refractor it into helpers

envEnableHealthChecks = "DP_ENABLE_HEALTHCHECKS"
)

type nvmlHealthProvider struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not nvmlDeviceHealthProvider


// getMigDeviceParts returns the parent GI and CI ids of the MIG device.
func (r *nvmlResourceManager) getMigDeviceParts(d *Device) (string, uint32, uint32, error) {
func (r *withDevicePlacements) getMigDeviceParts(d *Device) (string, uint32, uint32, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we call it dp or d instead of r

type nvmlHealthProvider struct {
nvmllib nvml.Interface
devices Devices
parentToDeviceMap map[string]*Device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make it a type similar to type devicePlacementMap map[string]map[uint32]map[uint32]*Device
https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/cmd/gpu-kubelet-plugin/device_health.go#L36

}

gpu, ret := r.nvml.DeviceGetHandleByUUID(uuid)
func (r *nvmlHealthProvider) registerDeviceEvents(eventSet nvml.EventSet) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: r -> h or hp

}

p := nvmlHealthProvider{
nvmllib: r.nvml,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed for tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArangoGutierrez why do we need to have the lib here? nvmlResourceManager.nvml is used everywhere

@guptaNswati
Copy link
Contributor

sorry for the delayed review. @ArangoGutierrez thank you for your patience.

I looked at it from the perspective of what we did for DRA health checks as the gaol is to align with DRA and integrate with the device-api.

I have minor nits but overall the refractor LGTM.

I still want to spend some more time on reading and trying the mocks tests.

ArangoGutierrez and others added 3 commits February 12, 2026 15:24
Add vendored NVML mock packages from go-nvml including the dgxa100 mock
server. These mocks enable deterministic unit testing of NVML health
check behavior without requiring actual GPU hardware.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Refactor health check implementation:

- Extract health monitoring logic into nvmlHealthProvider struct with
  registerDeviceEvents() and runEventMonitor() methods, making health
  checks testable in isolation
- Migrate CheckHealth interface from stop channel to context.Context for
  idiomatic cancellation
- Extract device placement logic into withDevicePlacements struct to
  decouple from resource manager
- Clarify log message for event registration failure to explain why
  device is marked unhealthy (aligns with DRA driver convention)
- Add unit tests for XID event handling and ERROR_NOT_SUPPORTED
  regression using NVML mock infrastructure

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Replace the stop channel with context.Context cancellation and
sync.WaitGroup for health check goroutine lifecycle management:

- Initialize health channel and healthCtx before Serve() to eliminate
  race where kubelet could call ListAndWatch before fields are set
- Use sync.WaitGroup to ensure health goroutine completes before
  channel close in Stop()
- Add nil guards in Stop() to handle partial Start() failure safely
- Add structured error handling for health check completion (success,
  canceled, error cases)
- Add debug logging in ListAndWatch for context cancellation and
  channel closure
- Fix updateResponseForMPS receiver from value to pointer

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Co-authored-by: Evan Lezar <elezar@nvidia.com>
@ArangoGutierrez
Copy link
Collaborator Author

I have addressed all your comments @guptaNswati / @cdesiniotis PTAL

@guptaNswati
Copy link
Contributor

Thank you. I dint get a chance to test it. Can you please paste the test logs

@ArangoGutierrez
Copy link
Collaborator Author

Thank you. I dint get a chance to test it. Can you please paste the test logs

Unfortunately, not, didn't save the logs, and the cluster I used is down

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 20 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return fmt.Errorf("error waiting for MPS daemon: %w", err)
}

plugin.health = make(chan *rm.Device)
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plugin.health is created as an unbuffered channel and the health-check goroutine can send to it even when no ListAndWatch stream is active. Since Stop() now cancels the context and then waits for the health goroutine to exit, a blocked send on plugin.health can deadlock Stop() (context cancellation won’t unblock a channel send). Consider making the channel buffered (e.g., sized to number of devices) and/or ensuring all sends in health-check code are non-blocking/select on ctx.Done().

Suggested change
plugin.health = make(chan *rm.Device)
// Use a buffered channel for health notifications to avoid blocking when
// no ListAndWatch stream is currently consuming from plugin.health.
healthBufSize := len(plugin.rm.Devices())
if healthBufSize < 1 {
healthBufSize = 1
}
plugin.health = make(chan *rm.Device, healthBufSize)

Copilot uses AI. Check for mistakes.
Comment on lines 125 to 129
err := plugin.Serve()
if err != nil {
klog.Errorf("Could not start device plugin for '%s': %s", plugin.rm.Resource(), err)
plugin.cleanup()
return err
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Serve() failure, Start() returns immediately after creating plugin.health and plugin.healthCtx but does not cancel/close them. If the same plugin instance is retried or kept around, this leaks resources and leaves fields in a partially-initialized state. Consider deferring a cleanup (cancel + close channel) on the Serve() error path.

Copilot uses AI. Check for mistakes.
Comment on lines 99 to 101
deviceIDToGiMap[d.ID] = gi
deviceIDToCiMap[d.ID] = ci
parentToDeviceMap[uuid] = d
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parentToDeviceMap is keyed by uuid returned from getDevicePlacement(), which is the parent UUID for MIG devices. If multiple MIG devices share the same parent GPU, this assignment overwrites earlier entries, so events for that parent can only ever map to one MIG device. To correctly handle multiple MIG instances per parent, use a mapping that can represent multiple devices per parent (e.g., parent UUID -> list, or parent UUID -> (GI,CI)->device).

Copilot uses AI. Check for mistakes.
Comment on lines 157 to 162
if ret != nvml.SUCCESS {
klog.Infof("Error waiting for event: %v; Marking all devices as unhealthy", ret)
for _, d := range devices {
unhealthy <- d
for _, d := range h.devices {
h.unhealthy <- d
}
continue
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When eventSet.Wait() returns a non-timeout error, the loop marks all devices unhealthy and continues. If the error is persistent, this can spin forever and repeatedly attempt blocking sends to h.unhealthy, preventing shutdown if the receiver isn’t draining fast enough. Consider returning the error (or backing off) and make unhealthy notifications context-aware (select on ctx.Done()) to avoid blocking indefinitely.

Copilot uses AI. Check for mistakes.
@guptaNswati
Copy link
Contributor

Thank you. I dint get a chance to test it. Can you please paste the test logs

Unfortunately, not, didn't save the logs, and the cluster I used is down

i know these are straightforward changes and you must have tested them very well. I feel more comfortable when i see the test logs so i will try to set up a cluster next week and see if we can test them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants