-
Notifications
You must be signed in to change notification settings - Fork 0
handle kubelet restart #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Simon Davies <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements automatic restart handling for the Hyperlight device plugin to recover from kubelet restarts. When kubelet restarts, it deletes all device plugin sockets, requiring plugins to detect this event and re-register.
Key changes:
- Adds socket monitoring via
watchKubeletRestart()that polls for socket deletion - Wraps plugin startup in a restart loop that attempts to recover from failures
- Resets the stop channel in
Start()to support multiple start/stop cycles
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Reset stop channel for restart scenarios | ||
| p.stopCh = make(chan struct{}) | ||
|
|
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recreating the stop channel in Start() creates a race condition. The watchKubeletRestart() goroutine and ListAndWatch() function may be reading from the old stopCh while a new one is created here. When Stop() closes the old channel, these goroutines will still be blocked on the new channel. This causes the plugin to not shut down properly on restart cycles.
| // Reset stop channel for restart scenarios | |
| p.stopCh = make(chan struct{}) |
| // If we get here, kubelet restarted - stop current server and re-register | ||
| log.Println("Detected kubelet restart, re-registering...") | ||
| plugin.server.Stop() |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling plugin.server.Stop() directly without calling plugin.Stop() bypasses the cleanup logic in the Stop() method. This leaves the old stopCh channel open (not closed), which means any goroutines waiting on that channel will continue to wait indefinitely. Additionally, the socket file is not removed. The Stop() method should be called instead to ensure proper cleanup.
| // If we get here, kubelet restarted - stop current server and re-register | |
| log.Println("Detected kubelet restart, re-registering...") | |
| plugin.server.Stop() | |
| // If we get here, kubelet restarted - stop current plugin and re-register | |
| log.Println("Detected kubelet restart, re-registering...") | |
| plugin.Stop() |
| go func() { | ||
| for { | ||
| if err := plugin.Start(); err != nil { | ||
| log.Printf("Failed to start device plugin: %v", err) | ||
| time.Sleep(5 * time.Second) | ||
| continue | ||
| } | ||
|
|
||
| // Watch for kubelet restart (socket deletion) | ||
| // When kubelet restarts, it deletes all sockets in /var/lib/kubelet/device-plugins/ | ||
| plugin.watchKubeletRestart() | ||
|
|
||
| // If we get here, kubelet restarted - stop current server and re-register | ||
| log.Println("Detected kubelet restart, re-registering...") | ||
| plugin.server.Stop() | ||
| time.Sleep(time.Second) // Brief pause before restart | ||
| } |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The restart loop does not properly handle the case where watchKubeletRestart() returns due to stopCh being closed (during graceful shutdown). When plugin.Stop() is called in line 371, it closes stopCh, causing watchKubeletRestart() to return. The loop then calls plugin.server.Stop() and attempts to restart, but the program is already in shutdown mode. The loop should check if it's in shutdown mode after watchKubeletRestart() returns.
| ticker := time.NewTicker(time.Second) | ||
| defer ticker.Stop() | ||
|
|
||
| for { | ||
| select { | ||
| case <-p.stopCh: | ||
| return | ||
| case <-ticker.C: | ||
| // Check if our socket still exists | ||
| if _, err := os.Stat(serverSock); os.IsNotExist(err) { | ||
| log.Println("Plugin socket deleted - kubelet may have restarted") | ||
| return | ||
| } | ||
| } | ||
| } |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polling the socket every second is inefficient. Consider using a filesystem watcher (such as fsnotify) to detect socket deletion immediately instead of polling. This would reduce unnecessary CPU usage and provide faster detection of kubelet restarts.
Handle Kubelet restarts in the device plugin