Skip to content

Comments

fix(core): Prevent fatal startup crash in cluster/sandbox environments by deferring psutil calls#68

Open
Tianshi-Xu wants to merge 1 commit intobytedance:mainfrom
Tianshi-Xu:fix/cluster-psutil-startup
Open

fix(core): Prevent fatal startup crash in cluster/sandbox environments by deferring psutil calls#68
Tianshi-Xu wants to merge 1 commit intobytedance:mainfrom
Tianshi-Xu:fix/cluster-psutil-startup

Conversation

@Tianshi-Xu
Copy link

Problem

The application crashes during the import phase when launching the service (uvicorn) or running tests (pytest) in a cluster or sandbox environment (e.g., AMLT).

The error typically manifests as psutil.NoSuchProcess or FileNotFoundError. This issue is cluster-specific and not reproducible locally.

Root Cause

This is a Startup Race Condition.

The sandbox/utils/execution.py file executes psutil code at the module's top level (global scope), attempting to walk up the process tree.

In the cluster environment, our service is spawned by a very short-lived "Launcher" process. This launcher process exits almost immediately after starting our service.

The psutil code, running during import, executes too early and tries to access this just-exited launcher process. This results in a NoSuchProcess error, causing the entire application startup to fail.

Solution

The solution is to Defer Execution of the problematic code.

The process tree-walking logic (the while loop) has been moved entirely from the module's top level into the def cleanup_process(): function. This is the only place this logic is actually used.

This fix ensures:

  • Importing sandbox/utils/execution is now a safe, side-effect-free operation, allowing the service to start correctly.
  • When cleanup_process is actually called later, the startup race condition is long gone, and it will safely compute the real-time process tree.

Changes

  • Modified: sandbox/utils/execution.py
    • Removed current_pid, root_pid, and the while loop from the global scope.
    • Moved this logic to the beginning of the cleanup_process function.
    • Added a try...except psutil.NoSuchProcess block inside the loop for robustness in sandboxed environments.

In cluster or sandbox environments, services are often spawned by
short-lived launcher processes.

The  module previously executed a process-tree walk
at the module's top level (import time). This created a race condition
where psutil would try to access the parent launcher process just as
it was exiting, causing a fatal  crash.

This commit resolves the race condition by deferring execution.
The process-walking logic is moved from the global scope entirely
into the  function, which is its only user.

This ensures the code runs long after the unstable startup phase
has passed and prevents the service from crashing on import.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant