events: Cancel ProtoStream context on timeout#64460
Open
juliaogris wants to merge 3 commits intomasterfrom
Open
events: Cancel ProtoStream context on timeout#64460juliaogris wants to merge 3 commits intomasterfrom
juliaogris wants to merge 3 commits intomasterfrom
Conversation
35af006 to
40a353d
Compare
Add tests that verify Complete() and Close() cancel the stream's internal context when they time out. Without the corresponding fix, upload goroutines rooted in context.Background() continue running until their slow writes complete, holding ~5 MB slice buffers each. Under sustained session churn this causes OOM. The blockingUploader simulates a permanently stuck disk write by blocking UploadPart on ctx.Done(). The tests verify that after Complete/Close time out, the goroutine exits via context cancellation rather than remaining blocked indefinitely. Failing tests before fix: - TestCompleteTimeoutCancelsStream - TestCloseTimeoutCancelsStream/timeout Passing test before fix: - TestCloseTimeoutCancelsStream/success
40a353d to
dd48652
Compare
Add `defer s.cancel()` to both `Complete()` and `Close()` so the stream's internal context is always cancelled when these methods return, whether they succeed or time out. Previously, when `Complete()` timed out, `s.cancel()` was never called because it only ran on the success path. This left upload goroutines rooted in `context.Background()` running until their slow writes finished, each holding a ~5 MB slice buffer. Under sustained session churn with exhausted emptyDir IOPS, orphaned goroutines accumulated faster than they drained and the agent OOMed. Move the existing `s.cancel()` call in `Complete()` to a defer at the top so it fires on both success and timeout paths. Add the same defer to `Close()` which was missing it entirely.
dd48652 to
846a8c9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cancel orphaned ProtoStream goroutines on session close timeout.
When
ProtoStream.Complete()orClose()return,s.cancel()isnever called, leaving upload goroutines and their ~5 MB slice buffers
alive until the slow writes finish. Under sustained session churn with
exhausted IOPS, orphaned goroutines accumulate faster than they drain.
Add
defer s.cancel()to bothComplete()andClose()so thestream's internal context is always canceled when these methods return,
whether they succeed or time out.
A/B testing under IOPS-throttled conditions shows faster memory drain
after load ends, periodic mid-load memory drops when cancel fires on
timed-out sessions, and roughly half the allocation churn.