Skip to content

Conversation

@ScotTFO
Copy link
Contributor

@ScotTFO ScotTFO commented Jan 26, 2026

Summary

Fixes a critical Windows memory leak where Node.js processes spawned during auto mode persist indefinitely, consuming 100% memory over time. Even after stopping features, orphaned processes remain because:

  1. Windows process tree issue: Processes spawned via cmd /c or sh.exe create process trees where killing the parent doesn't kill children
  2. stopFeature() didn't wait for cleanup: Called abort() and immediately removed from Map, leaving async streams running
  3. No startup cleanup: Previously orphaned processes from crashed sessions were never cleaned up

Changes

New: Cross-platform process tree management (libs/platform/src/process-utils.ts)

  • killProcessTree(pid, signal) - Uses tree-kill package to terminate entire process trees
  • waitForProcessExit(pid, timeoutMs) - Polls for process exit with timeout
  • forceKillProcessTree(pid, gracePeriodMs) - SIGTERM → SIGKILL escalation (Windows uses SIGKILL immediately)

Fix: Graceful feature cleanup (apps/server/src/services/auto-mode-service.ts)

  • Added completionPromise/signalCompletion to RunningFeature for tracking when execution actually finishes
  • stopFeature() now waits for completion with configurable timeout before force-releasing
  • Added cleanupOrphanedProcesses() for per-feature Node.js process cleanup on Windows
  • Added cleanupOrphanedProcessesOnStartup() to kill leftover processes from previous sessions
  • Added initialize() method called at server startup

Fix: Windows process termination in services

  • dev-server-service.ts - Uses killProcessTree instead of process.kill('SIGTERM') on Windows
  • codex-app-server-service.ts - Uses killProcessTree instead of process.kill('SIGTERM') on Windows
  • subprocess.ts - Uses killProcessTree for abort handling and timeout kills on Windows

Other

  • stop-feature.ts route accepts optional waitForCleanup parameter
  • apps/server/src/index.ts calls autoModeService.initialize() at startup
  • Added tree-kill dependency to @automaker/platform
  • Added unit test coverage for killProcessTree in dev-server-service tests

Test plan

  • Unit tests pass (dev-server-service.test.ts updated with killProcessTree mock)
  • Start auto mode on Windows with MCP servers enabled
  • Run a feature, then stop it — verify no orphaned node.exe processes in Task Manager
  • Stop auto mode entirely — all processes should terminate
  • Monitor memory over time — should remain stable
  • Verify Unix behavior unchanged (uses SIGTERM, no tree-kill)

Platform notes

  • On Windows: Uses tree-kill with SIGKILL (force) since SIGTERM isn't reliably supported
  • On Unix/macOS: Uses standard SIGTERM → SIGKILL escalation
  • PowerShell output parsing handles edge cases ("null" string output, empty results)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Startup initialization to perform process cleanup on Windows
    • Stop action can optionally wait for cleanup completion
  • Bug Fixes

    • Improved orphaned process cleanup and more reliable process-tree termination across services
  • Chores

    • Added cross-platform process-tree management utilities
  • Tests

    • Updated tests to cover cross-platform process termination behavior

✏️ Tip: You can customize this high-level summary in your review settings.

ScotTFO and others added 3 commits January 26, 2026 08:49
On Windows, child processes spawned via cmd.exe or sh.exe don't terminate
when the parent process is killed (unlike Unix where SIGTERM propagates).
This caused dev servers and MCP processes to accumulate, consuming 30GB+
memory over time.

Changes:
- Add tree-kill integration for cross-platform process tree termination
- Add cleanupOrphanedProcesses() to kill lingering processes when features complete
- Add startup cleanup to catch orphans from previous runs
- Fix PowerShell pattern matching (no backslash escaping needed for -like)
- Fix PowerShell script syntax (semicolons required for single-line execution)
- Apply tree-kill to subprocess.ts, dev-server-service, codex-app-server-service
- Add completion tracking for graceful feature cleanup on stop
- Add pid to mock process in tests for tree-kill condition

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Create completionPromise and signalCompletion in acquireRunningFeature
- Call signalCompletion in finally blocks of all execute methods
- Fixes CodeRabbit review: completionPromise was defined but never wired

Co-Authored-By: Claude Opus 4.5 <[email protected]>
PowerShell returns the literal string "null" when no processes match,
which JSON.parse accepts as valid JSON (returning null). Added check
for this case to prevent null reference errors.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@coderabbitai
Copy link

coderabbitai bot commented Jan 26, 2026

📝 Walkthrough

Walkthrough

Adds cross-platform process-tree management and Windows-specific orphan cleanup; exposes AutoModeService.initialize() for startup cleanup; adds per-feature completion signaling and an enhanced stopFeature API that can wait (with timeout) for cleanup before releasing features.

Changes

Cohort / File(s) Summary
Platform process utilities
libs/platform/package.json, libs/platform/src/index.ts, libs/platform/src/process-utils.ts
New process-utils module and exports: killProcessTree, waitForProcessExit, forceKillProcessTree; adds tree-kill dependency and cross-platform escalation/wait logic.
Subprocess integration
libs/platform/src/subprocess.ts
Use killProcessTree/cross-env killProcess helper for aborts, timeouts, and finally paths; replace direct SIGTERM with platform-aware process-tree kills on Windows.
AutoMode lifecycle & API
apps/server/src/index.ts, apps/server/src/services/auto-mode-service.ts, apps/server/src/routes/auto-mode/routes/stop-feature.ts
Add AutoModeService.initialize() and call it at startup; add RunningFeature completionPromise/signalCompletion; implement cleanupOrphanedProcesses() (Windows); update stopFeature(featureId, waitForCleanup = true, timeoutMs = 10000) to optionally await feature completion before release.
Service-level cleanup
apps/server/src/services/codex-app-server-service.ts, apps/server/src/services/dev-server-service.ts
Replace direct SIGTERM-only kills with platform-aware tree-kill on Windows via killProcessTree, retaining SIGTERM on POSIX; ensure child process trees are terminated in success/finally paths.
Tests
apps/server/tests/unit/services/dev-server-service.test.ts
Mock killProcessTree, add pid to mock processes, isolate HOSTNAME env, and update assertions to handle Windows vs POSIX kill behavior.

Sequence Diagram(s)

sequenceDiagram
    actor Server
    participant AutoModeService
    participant ProcessCleanup
    participant RunningFeature

    Server->>AutoModeService: initialize()
    activate AutoModeService
    AutoModeService->>ProcessCleanup: cleanupOrphanedProcesses()
    ProcessCleanup-->>AutoModeService: cleanup complete
    deactivate AutoModeService

    Server->>AutoModeService: startFeature(featureId)
    activate AutoModeService
    AutoModeService->>RunningFeature: create (completionPromise)
    RunningFeature->>RunningFeature: run feature
    RunningFeature-->>AutoModeService: signalCompletion()
    AutoModeService-->>Server: feature finished
    deactivate AutoModeService
Loading
sequenceDiagram
    actor Client
    participant StopFeatureAPI
    participant AutoModeService
    participant RunningFeature
    participant ProcessCleanup

    Client->>StopFeatureAPI: POST /stop-feature (waitForCleanup=true)
    activate StopFeatureAPI
    StopFeatureAPI->>AutoModeService: stopFeature(featureId, true, 10000)
    activate AutoModeService
    AutoModeService->>RunningFeature: abort execution
    AutoModeService->>RunningFeature: await completionPromise (timeout)
    alt completion within timeout
        RunningFeature-->>AutoModeService: resolved
        AutoModeService->>ProcessCleanup: cleanup resources
        ProcessCleanup-->>AutoModeService: done
        AutoModeService-->>StopFeatureAPI: true
    else timeout / not resolved
        AutoModeService->>AutoModeService: force release running feature
        AutoModeService-->>StopFeatureAPI: false
    end
    deactivate AutoModeService
    StopFeatureAPI-->>Client: 200/timeout response
    deactivate StopFeatureAPI
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

Testers-Requested, Do Not Merge, Performance

Suggested reviewers

  • Shironex

Poem

🐇 I hopped in at server start,
Found stray nodes that broke my heart,
I trimmed the trees on Windows side,
Sent completion signals far and wide,
Now processes rest—I've done my part.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main problem being fixed: preventing a Windows memory leak from orphaned Node.js processes, which aligns with the primary objective of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@apps/server/src/services/auto-mode-service.ts`:
- Around line 103-120: The PowerShell command in psCommand interpolates
workDir-derived strings directly (normalizedDir, unixStyleDir) which allows
single quotes or wildcard chars to break or over-match the -like filters; fix by
escaping those values before embedding: implement an escape routine used to
produce escapedNormalizedDir and escapedUnixStyleDir that doubles single quotes
and escapes PowerShell wildcard/special characters (*, ?, [, ]) (e.g., prefix
with backtick or otherwise neutralize them for -like) and then use those escaped
variables in psCommand (retain processFilter and the Get-CimInstance |
Where-Object ... construction).
🧹 Nitpick comments (5)
apps/server/src/services/dev-server-service.ts (1)

597-606: Consider using forceKillProcessTree on Unix for robustness.

The current implementation uses SIGTERM on Unix, which may leave processes running if they ignore the signal. The forceKillProcessTree utility was added specifically to handle escalation from SIGTERM to SIGKILL. This would provide more consistent cleanup behavior across platforms.

♻️ Optional: Use forceKillProcessTree for consistent cleanup
-import { killProcessTree } from '@automaker/platform';
+import { killProcessTree, forceKillProcessTree } from '@automaker/platform';
     // Kill the process tree (important on Windows where child processes aren't auto-terminated)
     if (server.process && !server.process.killed && server.process.pid) {
       if (IS_WINDOWS) {
         // Use tree-kill to terminate the entire process tree
         // This prevents orphaned child processes (e.g., Next.js start-server.js)
         await killProcessTree(server.process.pid);
       } else {
-        server.process.kill('SIGTERM');
+        // Use forceKillProcessTree for graceful termination with escalation
+        await forceKillProcessTree(server.process.pid, 3000);
       }
     }
libs/platform/src/subprocess.ts (1)

261-272: Consider reusing the killProcess helper to reduce duplication.

The abort handler in spawnProcess duplicates the Windows/Unix conditional logic that's already encapsulated in the killProcess helper defined earlier in this file.

♻️ Optional: Reuse killProcess helper
     if (abortController) {
       abortHandler = () => {
         cleanupAbortListener();
-        // Use tree-kill on Windows to terminate entire process tree
-        if (IS_WINDOWS && childProcess.pid) {
-          killProcessTree(childProcess.pid).catch((err) => {
-            console.error('[SubprocessManager] Error killing process tree:', err);
-          });
-        } else {
-          childProcess.kill('SIGTERM');
+        if (childProcess.pid) {
+          killProcess().catch((err) => {
+            console.error('[SubprocessManager] Error killing process:', err);
+          });
         }
         reject(new Error('Process aborted'));
       };

Note: This would require moving killProcess to be defined outside the generator function or passing childProcess as a parameter.

apps/server/tests/unit/services/dev-server-service.test.ts (1)

271-277: Platform-specific assertion may not exercise Windows path in CI.

The test correctly branches on process.platform, but if CI runs on Linux/macOS, the Windows code path with killProcessTree won't be exercised. Consider adding a dedicated test that mocks process.platform to ensure the Windows branch is covered.

♻️ Add explicit Windows path test
it('should use killProcessTree on Windows', async () => {
  // Temporarily override platform check
  const originalPlatform = process.platform;
  Object.defineProperty(process, 'platform', { value: 'win32' });
  
  vi.resetModules();
  
  // Re-import to get fresh module with Windows detection
  const { getDevServerService } = await import('@/services/dev-server-service.js');
  // ... test setup and assertions for killProcessTree
  
  // Restore
  Object.defineProperty(process, 'platform', { value: originalPlatform });
});

Alternatively, consider refactoring the service to accept the platform as a dependency for easier testing.

apps/server/src/services/auto-mode-service.ts (1)

2095-2109: Clear the timeout after Promise.race to avoid stray timers.
Even if cleanup finishes early, the timeout still fires later.

♻️ Suggested refactor
-      try {
-        await Promise.race([
-          running.completionPromise,
-          new Promise<void>((_, reject) =>
-            setTimeout(() => reject(new Error('Cleanup timeout')), timeoutMs)
-          ),
-        ]);
+      let timeoutId: NodeJS.Timeout | undefined;
+      const timeoutPromise = new Promise<void>((_, reject) => {
+        timeoutId = setTimeout(() => reject(new Error('Cleanup timeout')), timeoutMs);
+      });
+      try {
+        await Promise.race([running.completionPromise, timeoutPromise]);
         logger.info(`Feature ${featureId} cleanup completed gracefully`);
       } catch (error) {
         // Timeout or other error - force release
         logger.warn(`Feature ${featureId} cleanup timed out or failed, forcing removal:`, error);
+      } finally {
+        if (timeoutId) clearTimeout(timeoutId);
       }
apps/server/src/index.ts (1)

262-265: Prefer logger.error over console.error for consistency.
Keeps log routing/levels consistent with the rest of the server.

♻️ Suggested tweak
-autoModeService.initialize().catch((err) => {
-  console.error('[AutoModeService] Initialization error:', err);
-});
+autoModeService.initialize().catch((err) => {
+  logger.error('[AutoModeService] Initialization error:', err);
+});

@ScotTFO
Copy link
Contributor Author

ScotTFO commented Jan 26, 2026

Re: PowerShell -like escaping (review comment)

This is a false positive. The characters that would need escaping (*, ?, [, ], ') are all illegal in Windows file/directory names (NTFS). Since workDir is a validated worktree path that already exists on disk, these characters cannot appear in it.

The -like operator is only used for prefix matching against real filesystem paths, so no injection or over-matching is possible here.


Re: Nitpick comments

1. forceKillProcessTree on Unix for dev-server — Intentionally kept simple. Dev servers respond to SIGTERM reliably; adding escalation adds complexity without clear benefit for this use case.

2. Reuse killProcess helper in spawnProcess — The helper is scoped inside the generator function spawnJSONLProcess. Refactoring to share it would require restructuring the module for minimal gain.

3. Platform-specific test coverage — Valid observation, but mocking process.platform is fragile. The Windows path is exercised when running tests on Windows (which is the target platform for this fix).

4. Clear timeout after Promise.race — Good catch, will fix.

5. logger.error over console.error — Good catch, will fix.

Address CodeRabbit review feedback:
- Clear the timeout timer after Promise.race resolves to avoid stray timers
- Use logger.error instead of console.error for consistency

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant