Skip to content

fix(sync-service): Reduce memory retention in Bandit handler processes#3955

Draft
alco wants to merge 3 commits intomainfrom
alco/garbage-collect-live-requests
Draft

fix(sync-service): Reduce memory retention in Bandit handler processes#3955
alco wants to merge 3 commits intomainfrom
alco/garbage-collect-live-requests

Conversation

@alco
Copy link
Member

@alco alco commented Mar 4, 2026

Summary

  • Garbage collect before blocking in long-poll receive to release accumulated garbage
  • Add ELECTRIC_TWEAKS_HANDLER_FULLSWEEP_AFTER config to control fullsweep frequency in handler processes

Problem

Bandit handler processes can grow their heap to 8MB across multiple requests while live heap remains small. When these processes block in receive waiting for shape changes (up to 40s), they hold onto garbage memory.

Solution

  1. Call :erlang.garbage_collect() before entering the receive block
  2. Allow configuring fullsweep_after spawn option for handler processes via ELECTRIC_TWEAKS_HANDLER_FULLSWEEP_AFTER env var (e.g., set to 5-10 to force more frequent full GC sweeps)

Test plan

  • Manual verification in production environment
  • Monitor handler process memory with different ELECTRIC_TWEAKS_HANDLER_FULLSWEEP_AFTER values

🤖 Generated with Claude Code

alco and others added 2 commits March 4, 2026 17:56
Bandit reuses handler processes across requests (up to conn_max_requests).
These processes can accumulate garbage from previous requests or from
building the current response. Before blocking in receive for up to
long_poll_timeout seconds (40s default), run garbage collection so we
don't hold onto memory while idle.

This addresses observed behavior in production where handler processes
grow their heap to 8MB but have small live heap, indicating accumulated
garbage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add configuration to control the fullsweep_after spawn option for Bandit
handler processes. This helps address memory accumulation in long-lived
handler processes by forcing a full GC sweep after N minor collections.

Bandit reuses handler processes across requests (capped by conn_max_requests).
BEAM's generational GC only does fullsweep after many minor collections by
default. When handler processes block in receive waiting for shape changes,
they can hold onto old-heap garbage for the full long_poll_timeout (40s).

Setting handler_fullsweep_after to a low value (e.g., 5-10) increases GC
frequency but ensures old-heap garbage is reclaimed sooner. This trades
some CPU overhead for lower memory footprint.

Usage: ELECTRIC_TWEAKS_HANDLER_FULLSWEEP_AFTER=10

Reference: https://www.erlang.org/doc/apps/erts/erlang.html#spawn_opt/4

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Mar 4, 2026

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
2488 1 2487 1
View the top 2 failed test(s) by shortest run time
Elixir.Electric.ShapeCacheTest::test get_or_create_shape_handle/2 should not return the same shape_handle for different shapes despite hash collision
Stack Traces | 0.806s run time
35) test get_or_create_shape_handle/2 should not return the same shape_handle for different shapes despite hash collision (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:102
     Assertion failed, no matching message after 400ms
     The following variables were pinned:
       handle = "1234-1772644363396349"
     The process mailbox is empty.
     code: assert_receive {:snapshot, ^handle, _snapshotter_pid}
     stacktrace:
       test/electric/shape_cache_test.exs:1330: anonymous fn/1 in Electric.ShapeCacheTest.wait_snapshot/1
       (elixir 1.19.1) lib/enum.ex:1688: Enum."-map/2-lists^map/1-1-"/2
       test/electric/shape_cache_test.exs:1310: Electric.ShapeCacheTest.wait_shape_init/2
       test/electric/shape_cache_test.exs:124: (test)
Elixir.Electric.Shapes.ConsumerTest::test transaction handling with real storage transactions are buffered until snapshot xmin is known
Stack Traces | 1.54s run time
26) test transaction handling with real storage transactions are buffered until snapshot xmin is known (Electric.Shapes.ConsumerTest)
     .../electric/shapes/consumer_test.exs:790
     match (=) failed
     code:  assert [
              {Storage, :append_to_log!, [log_items_txn2, _]},
              {Storage, :append_to_log!, [log_items_txn4, _]},
              {Storage, :append_to_log!, [log_items_txn5, _]}
            ] = Support.StorageTracer.collect_traced_calls()
     left:  [
              {Electric.ShapeCache.Storage, :append_to_log!, [log_items_txn2, _]},
              {Electric.ShapeCache.Storage, :append_to_log!, [log_items_txn4, _]},
              {Electric.ShapeCache.Storage, :append_to_log!, [log_items_txn5, _]}
            ]
     right: []
     stacktrace:
       .../electric/shapes/consumer_test.exs:896: (test)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@alco alco added the claude label Mar 4, 2026
…aks schema

The NimbleOptions schema for tweaks was missing the new
handler_fullsweep_after option, causing startup failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Mar 4, 2026

Claude Code Review

Summary

This PR reduces memory retention in Bandit handler processes by (1) calling :erlang.garbage_collect() before entering a long-poll receive block, and (2) adding a ELECTRIC_TWEAKS_HANDLER_FULLSWEEP_AFTER env var to control the fullsweep_after GC spawn option for those processes. The approach is sound and well-targeted at a real production memory concern.

What's Working Well

  • Correct diagnosis and fix: Calling GC before a blocking receive is a well-established BEAM pattern. Handler processes can hold large binary references or accumulated heap from processing previous requests, and triggering a collection before a 40s idle window is exactly the right moment.
  • Good inline documentation: The comments in api.ex and config.ex explain the why clearly — Bandit reusing handler processes, generational GC not fullsweeping, etc. This is appreciated for a non-obvious optimization.
  • Consistent with existing tweaks pattern: The new config follows the same shape as ELECTRIC_TWEAKS_CONN_MAX_REQUESTS and ELECTRIC_TWEAKS_HTTP_API_NUM_ACCEPTORS, including NimbleOptions schema registration in StackSupervisor.
  • Minimal change: Only 35 additions across 5 files. No unnecessary refactoring.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

Missing changeset file

packages/sync-service is a publishable package. This PR adds a new env var (ELECTRIC_TWEAKS_HANDLER_FULLSWEEP_AFTER) which is a visible operational change that users may want to know about. A .changeset/*.md file should be included.

No linked issue

The PR has no linked issue. The problem described (Bandit handler heap retention during long-polls) would benefit from an issue documenting the observed memory behaviour, reproduction steps, and any metrics. This helps with future reference and allows others to verify the fix.

Failing CI test

The CI run shows a test failure in Electric.ShapeCacheTest — the process was killed, suggesting a timeout-related race. While likely pre-existing and unrelated, it is worth confirming before merging, particularly since this PR changes GC behaviour which can affect process timing.

Suggestions (Nice to Have)

:pos_integer type excludes fullsweep_after: 0

fullsweep_after: 0 is a valid Erlang option meaning fullsweep on every minor GC. The :pos_integer NimbleOptions type silently rejects it. Since 0 would be a pathological setting in practice, :pos_integer is defensible, but a comment explaining the exclusion would help future readers.

ti_opts flow is slightly indirect

handler_fullsweep_after travels: tweaks -> ti_opts -> merged into opts via opts ++ ti_opts -> read back via Keyword.get(opts, :handler_fullsweep_after) in thousand_island_options/1. This works correctly but is non-obvious. Consider passing the value more directly if thousand_island_options grows further, though this is a minor style concern not worth addressing in isolation.

Issue Conformance

No linked issue. The PR description is clear about the problem and solution, but a linked issue would be better for traceability.

Previous Review Status

This is the first review.


Review iteration: 1 | 2026-03-04

@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Mar 4, 2026

Found 2 test failures on Blacksmith runners:

Failures

Test View Logs
Elixir.Electric.ShapeCacheTest/test get_or_create_shape_handle/
2 should not return the same shape_handle for different shapes despite hash collision
View Logs
Elixir.Electric.Shapes.ConsumerTest/
test transaction handling with real storage transactions are buffered until snapshot xm
in is known
View Logs

Fix in Cursor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant