Skip to content

Conversation

@ipochi
Copy link
Contributor

@ipochi ipochi commented Oct 13, 2025

fix(server): Goroutine leak in HTTP-Connect tunnel

This commit fixes a goroutine leak in the HTTP-Connect tunnel
(tunnel.go) that could occur during connection setup.

The leak happened when a backend agent disconnected at a very specific
time: after the server sent a DIAL_REQ but before the connection was
fully established. In this scenario, the cleanup logic was never called,
and the handler goroutine would hang forever.

I've refactored ServeHTTP to make it more robust and prevent this leak:

Added a deferred cleanup function: A defer block now acts as a safety
net. It uses a flag (established) to track whether the connection
succeeded. If the function exits for any reason before the connection is
established, this deferred code runs and guarantees that the pending
dial is removed from our tracking map.

Fixed a race condition with a single select: The old code had separate,
racy checks for different failure modes. I've replaced this with a
single, atomic select block that waits for all possible outcomes at
once: a successful connection, the client disconnecting, or the agent's
context being cancelled. This makes the logic much safer and easier to
follow.

Fixes: #789

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 13, 2025
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 13, 2025
@ipochi ipochi force-pushed the imran/fix-mem-leak-pending-dials branch from 82146cd to 1ccdfbe Compare October 13, 2025 20:30
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 13, 2025
klog.ErrorS(err, "no tunnels available")
conn.Write([]byte(fmt.Sprintf("HTTP/1.1 500 Internal Server Error\r\nContent-Type: text/plain\r\n\r\ncurrently no tunnels available: %v", err)))
conn.Close()
// conn.Close() is handled by the top-level defer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non blocking) Can we fix the comment to indicate is should be closed by the closeOnce above?

established := false
defer func() {
if !established {
if t.Server.PendingDial.Remove(random) != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little concerned that established does not guarantee that random is in PendingDial. If we got a response (DIAL_CLS or DIAL_RSP) but the select below goes to a different case than connection.connected, established will be false but random will be absent from PendingDial and I do not think Remove() is resilient to that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cheftako I hear your concern but Remove is using delete which treats removal of a non-existent entry as a no-op

https://pkg.go.dev/builtin#delete

@ipochi
Copy link
Contributor Author

ipochi commented Oct 17, 2025

/test pull-apiserver-network-proxy-test-master

This commit fixes a goroutine leak in the HTTP-Connect tunnel
(tunnel.go) that could occur during connection setup.

The leak happened when a backend agent disconnected at a very specific
time: after the server sent a DIAL_REQ but before the connection was
fully established. In this scenario, the cleanup logic was never called,
and the handler goroutine would hang forever.

I've refactored ServeHTTP to make it more robust and prevent this leak:

Added a deferred cleanup function: A defer block now acts as a safety
net. It uses a flag (established) to track whether the connection
succeeded. If the function exits for any reason before the connection is
established, this deferred code runs and guarantees that the pending
dial is removed from our tracking map.

Fixed a race condition with a single select: The old code had separate,
racy checks for different failure modes. I've replaced this with a
single, atomic select block that waits for all possible outcomes at
once: a successful connection, the client disconnecting, or the agent's
context being cancelled. This makes the logic much safer and easier to
follow.

Signed-off-by: Imran Pochi <[email protected]>
@ipochi ipochi force-pushed the imran/fix-mem-leak-pending-dials branch from 61b8654 to e12c96f Compare October 17, 2025 20:03
@ipochi ipochi requested a review from cheftako October 17, 2025 20:12
@cheftako
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 23, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako, ipochi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit b79f70b into kubernetes-sigs:master Oct 23, 2025
22 checks passed
ipochi pushed a commit to kinvolk/apiserver-network-proxy that referenced this pull request Oct 30, 2025
…k-pending-dials

fix: memory leak on account of pending dials
ipochi pushed a commit to kinvolk/apiserver-network-proxy that referenced this pull request Oct 30, 2025
…k-pending-dials

fix: memory leak on account of pending dials
ipochi pushed a commit to kinvolk/apiserver-network-proxy that referenced this pull request Oct 30, 2025
…k-pending-dials

fix: memory leak on account of pending dials
k8s-ci-robot added a commit that referenced this pull request Oct 31, 2025
Backporting #790 to release-0.32 from kinvolk/imran/fix-mem-leak-pending-dials
k8s-ci-robot added a commit that referenced this pull request Oct 31, 2025
Backporting #790 to release-0.31 from kinvolk/imran/fix-mem-leak-pending-dials
k8s-ci-robot added a commit that referenced this pull request Oct 31, 2025
Backporting #790 to release-0.33 from kinvolk/imran/fix-mem-leak-pending-dials
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory consumption becomes high leading to oom kill due to pending dials in http-connect mode

3 participants