Skip to content

Conversation

@simongdavies
Copy link
Contributor

Handle Kubelet restarts in the device plugin

Signed-off-by: Simon Davies <[email protected]>
Copilot AI review requested due to automatic review settings January 7, 2026 21:14
@simongdavies simongdavies enabled auto-merge (squash) January 7, 2026 21:14
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements automatic restart handling for the Hyperlight device plugin to recover from kubelet restarts. When kubelet restarts, it deletes all device plugin sockets, requiring plugins to detect this event and re-register.

Key changes:

  • Adds socket monitoring via watchKubeletRestart() that polls for socket deletion
  • Wraps plugin startup in a restart loop that attempts to recover from failures
  • Resets the stop channel in Start() to support multiple start/stop cycles

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +240 to +242
// Reset stop channel for restart scenarios
p.stopCh = make(chan struct{})

Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recreating the stop channel in Start() creates a race condition. The watchKubeletRestart() goroutine and ListAndWatch() function may be reading from the old stopCh while a new one is created here. When Stop() closes the old channel, these goroutines will still be blocked on the new channel. This causes the plugin to not shut down properly on restart cycles.

Suggested change
// Reset stop channel for restart scenarios
p.stopCh = make(chan struct{})

Copilot uses AI. Check for mistakes.
Comment on lines +362 to +364
// If we get here, kubelet restarted - stop current server and re-register
log.Println("Detected kubelet restart, re-registering...")
plugin.server.Stop()
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling plugin.server.Stop() directly without calling plugin.Stop() bypasses the cleanup logic in the Stop() method. This leaves the old stopCh channel open (not closed), which means any goroutines waiting on that channel will continue to wait indefinitely. Additionally, the socket file is not removed. The Stop() method should be called instead to ensure proper cleanup.

Suggested change
// If we get here, kubelet restarted - stop current server and re-register
log.Println("Detected kubelet restart, re-registering...")
plugin.server.Stop()
// If we get here, kubelet restarted - stop current plugin and re-register
log.Println("Detected kubelet restart, re-registering...")
plugin.Stop()

Copilot uses AI. Check for mistakes.
Comment on lines +350 to +366
go func() {
for {
if err := plugin.Start(); err != nil {
log.Printf("Failed to start device plugin: %v", err)
time.Sleep(5 * time.Second)
continue
}

// Watch for kubelet restart (socket deletion)
// When kubelet restarts, it deletes all sockets in /var/lib/kubelet/device-plugins/
plugin.watchKubeletRestart()

// If we get here, kubelet restarted - stop current server and re-register
log.Println("Detected kubelet restart, re-registering...")
plugin.server.Stop()
time.Sleep(time.Second) // Brief pause before restart
}
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restart loop does not properly handle the case where watchKubeletRestart() returns due to stopCh being closed (during graceful shutdown). When plugin.Stop() is called in line 371, it closes stopCh, causing watchKubeletRestart() to return. The loop then calls plugin.server.Stop() and attempts to restart, but the program is already in shutdown mode. The loop should check if it's in shutdown mode after watchKubeletRestart() returns.

Copilot uses AI. Check for mistakes.
Comment on lines +319 to +333
ticker := time.NewTicker(time.Second)
defer ticker.Stop()

for {
select {
case <-p.stopCh:
return
case <-ticker.C:
// Check if our socket still exists
if _, err := os.Stat(serverSock); os.IsNotExist(err) {
log.Println("Plugin socket deleted - kubelet may have restarted")
return
}
}
}
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polling the socket every second is inefficient. Consider using a filesystem watcher (such as fsnotify) to detect socket deletion immediately instead of polling. This would reduce unnecessary CPU usage and provide faster detection of kubelet restarts.

Copilot uses AI. Check for mistakes.
@simongdavies simongdavies merged commit ff6cc35 into main Jan 7, 2026
10 checks passed
@simongdavies simongdavies deleted the handle-restart branch January 7, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant