Skip to content

events: Cancel ProtoStream context on timeout#64460

Open
juliaogris wants to merge 3 commits intomasterfrom
julia/app/stream-cancel
Open

events: Cancel ProtoStream context on timeout#64460
juliaogris wants to merge 3 commits intomasterfrom
julia/app/stream-cancel

Conversation

@juliaogris
Copy link
Contributor

@juliaogris juliaogris commented Mar 10, 2026

Cancel orphaned ProtoStream goroutines on session close timeout.

When ProtoStream.Complete() or Close() return, s.cancel() is
never called, leaving upload goroutines and their ~5 MB slice buffers
alive until the slow writes finish. Under sustained session churn with
exhausted IOPS, orphaned goroutines accumulate faster than they drain.

Add defer s.cancel() to both Complete() and Close() so the
stream's internal context is always canceled when these methods return,
whether they succeed or time out.

A/B testing under IOPS-throttled conditions shows faster memory drain
after load ends, periodic mid-load memory drops when cancel fires on
timed-out sessions, and roughly half the allocation churn.

@juliaogris juliaogris force-pushed the julia/app/stream-cancel branch 17 times, most recently from 35af006 to 40a353d Compare March 13, 2026 01:29
Add tests that verify Complete() and Close() cancel the stream's
internal context when they time out. Without the corresponding fix,
upload goroutines rooted in context.Background() continue running
until their slow writes complete, holding ~5 MB slice buffers each.
Under sustained session churn this causes OOM.

The blockingUploader simulates a permanently stuck disk write by
blocking UploadPart on ctx.Done(). The tests verify that after
Complete/Close time out, the goroutine exits via context cancellation
rather than remaining blocked indefinitely.

Failing tests before fix:
- TestCompleteTimeoutCancelsStream
- TestCloseTimeoutCancelsStream/timeout

Passing test before fix:
- TestCloseTimeoutCancelsStream/success
@juliaogris juliaogris force-pushed the julia/app/stream-cancel branch from 40a353d to dd48652 Compare March 13, 2026 02:50
Add `defer s.cancel()` to both `Complete()` and `Close()` so the
stream's internal context is always cancelled when these methods
return, whether they succeed or time out.

Previously, when `Complete()` timed out, `s.cancel()` was never
called because it only ran on the success path. This left upload
goroutines rooted in `context.Background()` running until their
slow writes finished, each holding a ~5 MB slice buffer. Under
sustained session churn with exhausted emptyDir IOPS, orphaned
goroutines accumulated faster than they drained and the agent
OOMed.

Move the existing `s.cancel()` call in `Complete()` to a defer at
the top so it fires on both success and timeout paths. Add the
same defer to `Close()` which was missing it entirely.
@juliaogris juliaogris force-pushed the julia/app/stream-cancel branch from dd48652 to 846a8c9 Compare March 13, 2026 03:10
@juliaogris juliaogris added the no-test-plan Bypasses the test plan validation bot label Mar 13, 2026
@juliaogris juliaogris requested review from avatus and zmb3 March 13, 2026 05:25
@juliaogris juliaogris marked this pull request as ready for review March 13, 2026 05:25
@github-actions github-actions bot added audit-log Issues related to Teleports Audit Log size/md labels Mar 13, 2026
@juliaogris juliaogris added the no-changelog Indicates that a PR does not require a changelog entry label Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audit-log Issues related to Teleports Audit Log no-changelog Indicates that a PR does not require a changelog entry no-test-plan Bypasses the test plan validation bot size/md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant