Skip to content

Fix goroutine leak in monitor workOne due to unbuffered channel#4714

Merged
hawkowl merged 2 commits intomasterfrom
tuxerrante/fix-monitor-goroutine-leak
Mar 27, 2026
Merged

Fix goroutine leak in monitor workOne due to unbuffered channel#4714
hawkowl merged 2 commits intomasterfrom
tuxerrante/fix-monitor-goroutine-leak

Conversation

@tuxerrante
Copy link
Copy Markdown
Collaborator

Summary

Fixes a goroutine leak in the monitor's workOne function caused by an unbuffered channel that permanently blocks the execute goroutine when the monitoring context times out.

Follows up on the analysis in #4706 (comment).

The Bug

In pkg/monitor/worker.go, the allJobsDone channel is created unbuffered:

allJobsDone := make(chan bool)    // ← unbuffered

The workOne function then launches execute as a background goroutine and waits in a select for either completion or timeout:

go execute(ctx, log, allJobsDone, monitors, onPanic)     // line 257

select {
case <-allJobsDone:       // happy path: all monitors finished in time
case <-ctx.Done():        // timeout path: 50s elapsed
}

(source: worker.go lines 257–264)

When the timeout fires (ctx.Done() wins the select), workOne returns immediately. Meanwhile, the execute goroutine is still running — it waits for all monitor goroutines to finish via wg.Wait(), then attempts to signal completion:

func execute(ctx context.Context, log *logrus.Entry, done chan<- bool, monitors []monitoring.Monitor, onPanic func(monitoring.Monitor)) {
    var wg sync.WaitGroup
    // ... launches monitor goroutines ...
    wg.Wait()
    done <- true    // ← BLOCKS FOREVER: channel is unbuffered, no receiver exists
}

(source: worker.go lines 267–285)

Since workOne already returned, nobody is reading from allJobsDone. Because the channel is unbuffered, done <- true blocks indefinitely. The execute goroutine can never exit.

What Leaks

Each leaked execute goroutine holds references (via closures and the WaitGroup) to:

  • The full cluster.Monitor struct, which contains:
    • Kubernetes clientsets (kubernetes.Interface, configclient.Interface, operatorclient.Interface, aroclient.Interface)
    • HTTP connection pools and TLS state (via rest.Config and rest.Interface)
    • Controller-runtime client (clienthelper.Interface)
  • All collector function closures (27 collectors)
  • The logrus.Entry, dimensions map, and metrics emitter

This memory cannot be garbage collected because the blocked goroutine keeps everything reachable.

Fleet-Scale Impact

The worker function calls workOne in a ticker loop (line 187) with context.Background() — meaning there is one worker goroutine per monitored cluster. Every time a cluster's monitoring cycle times out, a new execute goroutine leaks. Over hours of operation across hundreds of clusters, this leads to steady memory growth and goroutine accumulation.

The Fix

-	allJobsDone := make(chan bool)
+	allJobsDone := make(chan bool, 1)

Buffering the channel with capacity 1 allows the execute goroutine to send its completion signal into the channel and return, even when no goroutine is receiving. The channel (and its single buffered value) are then garbage collected normally.

Test Plan

  • make fmt passes
  • make lint-go passes (via pre-commit hook)
  • go test -v ./pkg/monitor/... -run TestExecute passes
  • Existing TestExecute validates the happy path (monitors complete, allJobsDone is received)
  • The timeout path (the bug) was previously untestable without a race — now it works correctly because execute can always exit

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

When the 50s context timeout fires, workOne returns but the execute
goroutine blocks forever on sending to the unbuffered allJobsDone
channel, since no receiver is waiting anymore. This leaks the goroutine
and the entire Monitor struct (K8s clientsets, HTTP connection pools,
TLS state) on every timeout.

Buffering the channel with capacity 1 allows execute to send and exit
even when workOne has already returned.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a goroutine leak in the monitor worker’s workOne timeout path by buffering the allJobsDone completion channel so execute can signal completion without a concurrent receiver.

Changes:

  • Change allJobsDone from an unbuffered channel to a buffered channel (capacity 1) to prevent execute from blocking forever after workOne returns on timeout.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

TestExecuteReturnsWhenNoReceiver simulates the workOne timeout scenario:
context is cancelled and nobody reads from the done channel. With the
previous unbuffered channel, execute would block forever on done <- true.
With the buffered channel fix, execute returns promptly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@hawkowl hawkowl merged commit 50065e4 into master Mar 27, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants