Skip to content

Commit 22afc2a

Browse files
refactor(plugin): initialize health context at construction time
Initialize healthCtx and healthCancel in devicePluginForResource() instead of in initialize() to eliminate race condition where ListAndWatch() might access healthCtx before initialization completes. This addresses Elezar's concern about synchronization guarantees and follows Go best practices for immutable initialization. Refs: NVIDIA#1601 Task: 1/6 Co-authored-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
1 parent 5809a17 commit 22afc2a

File tree

2 files changed

+72
-0
lines changed

2 files changed

+72
-0
lines changed

AGENTS.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Implementation Tasks: PR #1601 - Address Elezar's Review Concerns
2+
3+
## Issue Reference
4+
- PR: #1601
5+
- Review: Elezar's concerns about health check refactoring
6+
- Document: FINAL_SYNTHESIS_PR1601.md
7+
8+
## Tasks
9+
10+
### Phase 1: Constructor Initialization (Eliminates Race Condition)
11+
12+
- [DONE] **Task 1**: Modify `devicePluginForResource()` to initialize health context at construction time
13+
- File: `internal/plugin/server.go`
14+
- Lines: 78-104
15+
- Changes: Create `healthCtx` and `healthCancel` before plugin struct initialization
16+
- Addresses: Elezar's concern #1 (line 281 - synchronization)
17+
- Commit: (pending)
18+
19+
- [TODO] **Task 2**: Remove health context creation from `initialize()`
20+
- File: `internal/plugin/server.go`
21+
- Lines: 114-118
22+
- Changes: Remove `context.WithCancel()` call (already done in constructor)
23+
- Addresses: Cleanup redundant initialization
24+
25+
### Phase 2: Restart-Safe Cleanup
26+
27+
- [TODO] **Task 3**: Modify `cleanup()` to recreate context after cancellation
28+
- File: `internal/plugin/server.go`
29+
- Lines: 120-129
30+
- Changes: Recreate `healthCtx` and `healthCancel` after cancelling for restart support
31+
- Addresses: Elezar's concern #2 (line 128 - why nil these fields), fixes plugin restart
32+
33+
### Phase 3: Health Channel Lifecycle
34+
35+
- [TODO] **Task 4**: Close health channel properly in `cleanup()`
36+
- File: `internal/plugin/server.go`
37+
- Lines: 120-129
38+
- Changes: Close channel before niling to prevent panics
39+
- Addresses: Devil's advocate blocker - channel never closed
40+
41+
- [TODO] **Task 5**: Handle closed channel in `ListAndWatch()`
42+
- File: `internal/plugin/server.go`
43+
- Lines: 287-298
44+
- Changes: Add `ok` check when receiving from health channel
45+
- Addresses: Graceful handling of channel closure
46+
47+
### Phase 4: Error Handling Improvements
48+
49+
- [TODO] **Task 6**: Improve error handling in health check goroutine
50+
- File: `internal/plugin/server.go`
51+
- Lines: 160-168
52+
- Changes: Use switch statement to distinguish error types and log success
53+
- Addresses: Elezar's concern #3 (line 167 - error handling)
54+
55+
## Progress
56+
- Total Tasks: 6
57+
- Completed: 0
58+
- In Progress: 0
59+
- Blocked: 0
60+
61+
## Notes
62+
- All changes are in a single file: `internal/plugin/server.go`
63+
- Changes are backward compatible
64+
- Plugin restart functionality is critical - must test after implementation

internal/plugin/server.go

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,10 @@ func (o *options) devicePluginForResource(ctx context.Context, resourceManager r
8282
return nil, err
8383
}
8484

85+
// Initialize health context at construction time to eliminate race condition
86+
// where ListAndWatch() might access healthCtx before initialize() completes.
87+
healthCtx, healthCancel := context.WithCancel(ctx)
88+
8589
plugin := nvidiaDevicePlugin{
8690
ctx: ctx,
8791
rm: resourceManager,
@@ -100,6 +104,10 @@ func (o *options) devicePluginForResource(ctx context.Context, resourceManager r
100104
// time the plugin server is restarted.
101105
server: nil,
102106
health: nil,
107+
108+
// Health context initialized at construction to prevent race conditions
109+
healthCtx: healthCtx,
110+
healthCancel: healthCancel,
103111
}
104112
return &plugin, nil
105113
}

0 commit comments

Comments
 (0)