Draft
Conversation
Add NVML mock packages from go-nvml for use in health check tests. This includes the dgxa100 mock server for simulating GPU hardware. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Refactor the health monitoring implementation: - Extract nvmlHealthProvider struct to encapsulate health check state - Add registerDeviceEvents() and runEventMonitor() methods - Rename xids field to xidsDisabled for clarity - Migrate from stop channel to context.Context for cancellation Add comprehensive tests: - TestCheckHealth validates XID event handling with mocks - TestRegisterDeviceEventsNotSupported ensures old GPUs returning ERROR_NOT_SUPPORTED are not incorrectly marked unhealthy Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Update ResourceManager interface and implementations to use context.Context for cancellation instead of a stop channel. This is more idiomatic Go and allows for better lifecycle control. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Replace stop channel with context-based cancellation for health checks: - Add healthCtx, healthCancel, and healthWg for lifecycle management - Capture context/channel references in ListAndWatch to avoid race with cleanup() which may nil these fields - Properly wait for health goroutine completion during cleanup Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Initialize healthCtx and healthCancel in devicePluginForResource() instead of in initialize() to eliminate race condition where ListAndWatch() might access healthCtx before initialization completes. This addresses Elezar's concern about synchronization guarantees and follows Go best practices for immutable initialization. Refs: NVIDIA#1601 Task: 1/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Remove healthCtx and healthCancel initialization from initialize() since they are now initialized at construction time in devicePluginForResource(). This eliminates redundant initialization and ensures context lifetime is tied to the plugin instance rather than the start/stop cycle. Refs: NVIDIA#1601 Task: 2/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Recreate healthCtx and healthCancel in cleanup() after cancellation to support plugin restart. Remove nil assignments for these fields as they need to persist across restart cycles. This addresses Elezar's concern NVIDIA#2 about why we nil these fields - we no longer do. The context is recreated fresh for each restart, ensuring health checks work correctly when the plugin is restarted. Fixes plugin restart blocker identified in architecture review. Refs: NVIDIA#1601 Task: 3/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Close the health channel in cleanup() before niling to prevent panics in ListAndWatch() if cleanup happens during channel read. The WaitGroup ensures the health goroutine has completed before we close the channel, making this operation safe. Fixes critical blocker: health channel was never closed, leading to potential panics and resource leaks. Refs: NVIDIA#1601 Task: 4/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add ok check when receiving from health channel to gracefully handle channel closure during cleanup. This prevents potential panics and ensures clean shutdown when the plugin stops. Reading from a closed channel returns zero value and ok=false, which we now check and return gracefully. Refs: NVIDIA#1601 Task: 5/6 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Replace simple if statement with switch to distinguish between three error cases: 1. Success (nil) - log completion 2. Canceled (context.Canceled) - log clean shutdown at V(4) 3. Error (other) - log error This provides better observability and distinguishes expected shutdown from actual errors. Addresses Elezar's concern NVIDIA#3 about error handling improvements. Task: 6/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.