Skip to content

fix: reduce in-process cluster memory usage#33

Merged
huntharo merged 2 commits intomainfrom
codex/cluster-oom-fix-pr
Mar 24, 2026
Merged

fix: reduce in-process cluster memory usage#33
huntharo merged 2 commits intomainfrom
codex/cluster-oom-fix-pr

Conversation

@huntharo
Copy link
Contributor

Summary

I added opt-in heap diagnostics for cluster and refresh, then reduced peak memory usage in the in-process clustering path by loading and normalizing one source kind at a time instead of materializing the full parsed embedding cache for the repo up front.

What I changed

  • I added --heap-snapshot-dir and --heap-log-interval-ms to the CLI cluster/refresh flows so I can capture heap snapshots and memory samples during long runs.
  • I added a SIGUSR2 heap snapshot hook and lightweight memory logging for cluster investigations.
  • I changed the non-worker clustering path to stream source kinds one at a time instead of building a repo-wide parsed embedding cache before edge generation.
  • I added a regression test that asserts in-process clustering no longer populates the parsed embedding cache.

Verification

  • I ran pnpm --filter ghcrawl test.
  • I ran pnpm --filter @ghcrawl/api-core test.
  • I reproduced the OOM on origin/main with:
    • pnpm --filter ghcrawl cli cluster openclaw/openclaw --heap-snapshot-dir /tmp/ghcrawl-heaps-origin-main --heap-log-interval-ms 5000
  • I reran the same clustering flow on this branch and confirmed it progressed into repeated edge-building updates instead of aborting immediately after loading embeddings.

Notes

  • I also checked the vectorlite experiment path separately and it still shows the same class of memory problem; I left that follow-up out of this PR to keep the scope tight.

(cherry picked from commit 0308a94f6b65257c8f4a28a5a68c971b882e785b)
(cherry picked from commit 509de2e838f47cadc1934a7156b7a01846d2bc67)
@github-actions
Copy link

Cluster Performance

  • Status: PASS
  • Fixture median: 445.2 ms (12 samples, 3 cluster rebuilds/sample)
  • Fixture baseline: 535.1 ms
  • Fixture delta: -89.9 ms (-16.8%)
  • Projected openclaw/openclaw duration: 8m 19.2s
  • Projected openclaw/openclaw baseline: 10m 0.0s
  • Projected delta: -100810.7 ms (-16.8%)
  • Regression threshold: +50.0%
  • Fixture shape: 512 threads x 3 source kinds
  • Sample durations: 449.5 ms, 460.1 ms, 441.0 ms, 450.4 ms, 442.2 ms, 448.2 ms, 437.4 ms, 574.0 ms, 437.3 ms, 453.2 ms, 434.7 ms, 441.1 ms

Run: workflow run for 43a9de8

@huntharo huntharo merged commit 7fb20f1 into main Mar 24, 2026
8 checks passed
@huntharo huntharo deleted the codex/cluster-oom-fix-pr branch March 24, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant