Fix goroutine leak in monitor workOne due to unbuffered channel#4714
Merged
Fix goroutine leak in monitor workOne due to unbuffered channel#4714
Conversation
When the 50s context timeout fires, workOne returns but the execute goroutine blocks forever on sending to the unbuffered allJobsDone channel, since no receiver is waiting anymore. This leaks the goroutine and the entire Monitor struct (K8s clientsets, HTTP connection pools, TLS state) on every timeout. Buffering the channel with capacity 1 allows execute to send and exit even when workOne has already returned. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a goroutine leak in the monitor worker’s workOne timeout path by buffering the allJobsDone completion channel so execute can signal completion without a concurrent receiver.
Changes:
- Change
allJobsDonefrom an unbuffered channel to a buffered channel (capacity 1) to preventexecutefrom blocking forever afterworkOnereturns on timeout.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
TestExecuteReturnsWhenNoReceiver simulates the workOne timeout scenario: context is cancelled and nobody reads from the done channel. With the previous unbuffered channel, execute would block forever on done <- true. With the buffered channel fix, execute returns promptly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
hawkowl
approved these changes
Mar 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a goroutine leak in the monitor's
workOnefunction caused by an unbuffered channel that permanently blocks theexecutegoroutine when the monitoring context times out.Follows up on the analysis in #4706 (comment).
The Bug
In
pkg/monitor/worker.go, theallJobsDonechannel is created unbuffered:The
workOnefunction then launchesexecuteas a background goroutine and waits in aselectfor either completion or timeout:(source: worker.go lines 257–264)
When the timeout fires (
ctx.Done()wins the select),workOnereturns immediately. Meanwhile, theexecutegoroutine is still running — it waits for all monitor goroutines to finish viawg.Wait(), then attempts to signal completion:(source: worker.go lines 267–285)
Since
workOnealready returned, nobody is reading fromallJobsDone. Because the channel is unbuffered,done <- trueblocks indefinitely. Theexecutegoroutine can never exit.What Leaks
Each leaked
executegoroutine holds references (via closures and theWaitGroup) to:cluster.Monitorstruct, which contains:kubernetes.Interface,configclient.Interface,operatorclient.Interface,aroclient.Interface)rest.Configandrest.Interface)clienthelper.Interface)logrus.Entry, dimensions map, and metrics emitterThis memory cannot be garbage collected because the blocked goroutine keeps everything reachable.
Fleet-Scale Impact
The
workerfunction callsworkOnein a ticker loop (line 187) withcontext.Background()— meaning there is oneworkergoroutine per monitored cluster. Every time a cluster's monitoring cycle times out, a newexecutegoroutine leaks. Over hours of operation across hundreds of clusters, this leads to steady memory growth and goroutine accumulation.The Fix
Buffering the channel with capacity 1 allows the
executegoroutine to send its completion signal into the channel and return, even when no goroutine is receiving. The channel (and its single buffered value) are then garbage collected normally.Test Plan
make fmtpassesmake lint-gopasses (via pre-commit hook)go test -v ./pkg/monitor/... -run TestExecutepassesTestExecutevalidates the happy path (monitors complete,allJobsDoneis received)executecan always exit🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com