Skip to content

app: Bound memory growth under high session load#64595

Draft
juliaogris wants to merge 2 commits intomasterfrom
julia/app/memory-bounds
Draft

app: Bound memory growth under high session load#64595
juliaogris wants to merge 2 commits intomasterfrom
julia/app/memory-bounds

Conversation

@juliaogris
Copy link
Contributor

@juliaogris juliaogris commented Mar 13, 2026

Under sustained app session churn with degraded I/O (e.g., IOPS-exhausted
emptyDir volumes), several unbounded resource paths cause the app agent to
OOM. This PR adds three resource limits that trade graceful degradation
(event loss, request rejection) for process survival.

  • Add a sync.Pool-backed BufferPool to the reverse proxy Forwarder.
    Without it, every proxied request allocates a 32 KiB buffer for io.Copy
    that becomes garbage immediately, creating GC pressure.
  • Cap the SessionWriter internal event buffer at MaxBufferSize (default
    4096 events). When the upload stream stalls, the buffer previously grew
    without limit. When full, processEvents stops reading from eventsCh,
    creating backpressure so RecordEvent enters backoff and drops events
    rather than consuming unbounded memory. Extract a handleStreamDone()
    helper to deduplicate stream recovery between the backpressure and main
    select blocks.
  • Add a chunkSem buffered channel to ConnectionsHandler that limits
    concurrent active chunks per agent to MaxActiveSessionChunks (default
    256). A slot is acquired in newSessionChunk before opening the recording
    stream and released in close() after the stream shuts down. Set
    ReloadOnErr: true on the FnCache so LimitExceeded errors from a full
    semaphore are not cached for the full TTL.
  • Fix a pre-existing off-by-one in SessionWriter.updateStatus where
    lastIndex > 0 prevented trimming the buffer when only one event (index 0)
    was confirmed. Change the condition to lastIndex >= 0.

@juliaogris juliaogris added the no-test-plan Bypasses the test plan validation bot label Mar 13, 2026
@juliaogris juliaogris force-pushed the julia/app/memory-bounds branch 8 times, most recently from 30bc2b7 to 174b143 Compare March 13, 2026 05:20
Add tests that exercise the three memory-bounding mechanisms before
the implementation commits land. Each test verifies a specific
invariant:

- `TestSessionChunkSemaphore`: verify that `close()` drains the
  semaphore slot both in the normal case and when force-closing with
  in-flight requests.

- `TestMaxActiveSessionChunksDefault`: verify that
  `CheckAndSetDefaults` sets `MaxActiveSessionChunks` to
  `DefaultMaxActiveSessionChunks` when the caller leaves it at zero.

- `TestSessionWriterConfigMaxBufferSize`: verify that the
  `MaxBufferSize` config defaults to `DefaultMaxBufferSize` when unset
  and preserves an explicit value.

- `TestUpdateStatusTrimsAtIndexZero`: verify that `updateStatus` trims
  the buffer when the only confirmed event is at buffer index 0.

- `TestForwarderUsesBufferPool`: verify that the reverse proxy
  `Forwarder` is configured with a `BufferPool` to reuse io.Copy
  buffers.
Add three mechanisms that cap memory growth on app-access agents
handling high session volumes with stalled upload streams.

Reverse proxy buffer pool: add a `sync.Pool`-backed
`httputil.BufferPool` to the reverse proxy `Forwarder`. Without a
pool, every proxied request allocates a fresh 32 KiB buffer for
`io.Copy` that becomes garbage immediately after the request
completes. Under high concurrency this creates GC pressure.

Session writer buffer cap: add a `MaxBufferSize` config field
(default 4096) to `SessionWriter`. When the internal
`[]PreparedSessionEvent` buffer reaches capacity, `processEvents`
stops reading from `eventsCh`, creating backpressure through the
unbuffered channel back to `RecordEvent` callers. Extract a
`handleStreamDone()` helper to deduplicate stream recovery logic
between the backpressure and main select blocks. Fix a pre-existing
off-by-one in `updateStatus` where `lastIndex > 0` prevented
trimming when only one event (index 0) was confirmed.

Session chunk semaphore: add a `chunkSem` buffered channel to
`ConnectionsHandler` that limits the number of concurrently active
session chunks per agent to `MaxActiveSessionChunks` (default 256).
A slot is acquired in `newSessionChunk` before opening the recording
stream and released in `close()` after the stream shuts down. Use a
`success` flag to release the slot on error paths. Log a warning
when a chunk is rejected so the rejection is observable without
correlating with generic request failure logs. Set
`ReloadOnErr: true` on the `FnCache` so that `LimitExceeded` errors
from a full semaphore are not cached for the full TTL.
@juliaogris juliaogris force-pushed the julia/app/memory-bounds branch from 174b143 to b082b62 Compare March 13, 2026 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-test-plan Bypasses the test plan validation bot

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant