app: Bound memory growth under high session load#64595
Draft
juliaogris wants to merge 2 commits intomasterfrom
Draft
app: Bound memory growth under high session load#64595juliaogris wants to merge 2 commits intomasterfrom
juliaogris wants to merge 2 commits intomasterfrom
Conversation
30bc2b7 to
174b143
Compare
Add tests that exercise the three memory-bounding mechanisms before the implementation commits land. Each test verifies a specific invariant: - `TestSessionChunkSemaphore`: verify that `close()` drains the semaphore slot both in the normal case and when force-closing with in-flight requests. - `TestMaxActiveSessionChunksDefault`: verify that `CheckAndSetDefaults` sets `MaxActiveSessionChunks` to `DefaultMaxActiveSessionChunks` when the caller leaves it at zero. - `TestSessionWriterConfigMaxBufferSize`: verify that the `MaxBufferSize` config defaults to `DefaultMaxBufferSize` when unset and preserves an explicit value. - `TestUpdateStatusTrimsAtIndexZero`: verify that `updateStatus` trims the buffer when the only confirmed event is at buffer index 0. - `TestForwarderUsesBufferPool`: verify that the reverse proxy `Forwarder` is configured with a `BufferPool` to reuse io.Copy buffers.
Add three mechanisms that cap memory growth on app-access agents handling high session volumes with stalled upload streams. Reverse proxy buffer pool: add a `sync.Pool`-backed `httputil.BufferPool` to the reverse proxy `Forwarder`. Without a pool, every proxied request allocates a fresh 32 KiB buffer for `io.Copy` that becomes garbage immediately after the request completes. Under high concurrency this creates GC pressure. Session writer buffer cap: add a `MaxBufferSize` config field (default 4096) to `SessionWriter`. When the internal `[]PreparedSessionEvent` buffer reaches capacity, `processEvents` stops reading from `eventsCh`, creating backpressure through the unbuffered channel back to `RecordEvent` callers. Extract a `handleStreamDone()` helper to deduplicate stream recovery logic between the backpressure and main select blocks. Fix a pre-existing off-by-one in `updateStatus` where `lastIndex > 0` prevented trimming when only one event (index 0) was confirmed. Session chunk semaphore: add a `chunkSem` buffered channel to `ConnectionsHandler` that limits the number of concurrently active session chunks per agent to `MaxActiveSessionChunks` (default 256). A slot is acquired in `newSessionChunk` before opening the recording stream and released in `close()` after the stream shuts down. Use a `success` flag to release the slot on error paths. Log a warning when a chunk is rejected so the rejection is observable without correlating with generic request failure logs. Set `ReloadOnErr: true` on the `FnCache` so that `LimitExceeded` errors from a full semaphore are not cached for the full TTL.
174b143 to
b082b62
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Under sustained app session churn with degraded I/O (e.g., IOPS-exhausted
emptyDir volumes), several unbounded resource paths cause the app agent to
OOM. This PR adds three resource limits that trade graceful degradation
(event loss, request rejection) for process survival.
sync.Pool-backedBufferPoolto the reverse proxyForwarder.Without it, every proxied request allocates a 32 KiB buffer for
io.Copythat becomes garbage immediately, creating GC pressure.
SessionWriterinternal event buffer atMaxBufferSize(default4096 events). When the upload stream stalls, the buffer previously grew
without limit. When full,
processEventsstops reading fromeventsCh,creating backpressure so
RecordEvententers backoff and drops eventsrather than consuming unbounded memory. Extract a
handleStreamDone()helper to deduplicate stream recovery between the backpressure and main
select blocks.
chunkSembuffered channel toConnectionsHandlerthat limitsconcurrent active chunks per agent to
MaxActiveSessionChunks(default256). A slot is acquired in
newSessionChunkbefore opening the recordingstream and released in
close()after the stream shuts down. SetReloadOnErr: trueon theFnCachesoLimitExceedederrors from a fullsemaphore are not cached for the full TTL.
SessionWriter.updateStatuswherelastIndex > 0prevented trimming the buffer when only one event (index 0)was confirmed. Change the condition to
lastIndex >= 0.