fix(core): Prevent fatal startup crash in cluster/sandbox environments by deferring psutil calls#68
Open
Tianshi-Xu wants to merge 1 commit intobytedance:mainfrom
Open
Conversation
In cluster or sandbox environments, services are often spawned by short-lived launcher processes. The module previously executed a process-tree walk at the module's top level (import time). This created a race condition where psutil would try to access the parent launcher process just as it was exiting, causing a fatal crash. This commit resolves the race condition by deferring execution. The process-walking logic is moved from the global scope entirely into the function, which is its only user. This ensures the code runs long after the unstable startup phase has passed and prevents the service from crashing on import.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The application crashes during the
importphase when launching the service (uvicorn) or running tests (pytest) in a cluster or sandbox environment (e.g., AMLT).The error typically manifests as
psutil.NoSuchProcessorFileNotFoundError. This issue is cluster-specific and not reproducible locally.Root Cause
This is a Startup Race Condition.
The
sandbox/utils/execution.pyfile executespsutilcode at the module's top level (global scope), attempting to walk up the process tree.In the cluster environment, our service is spawned by a very short-lived "Launcher" process. This launcher process exits almost immediately after starting our service.
The
psutilcode, running duringimport, executes too early and tries to access this just-exited launcher process. This results in aNoSuchProcesserror, causing the entire application startup to fail.Solution
The solution is to Defer Execution of the problematic code.
The process tree-walking logic (the
whileloop) has been moved entirely from the module's top level into thedef cleanup_process():function. This is the only place this logic is actually used.This fix ensures:
sandbox/utils/executionis now a safe, side-effect-free operation, allowing the service to start correctly.cleanup_processis actually called later, the startup race condition is long gone, and it will safely compute the real-time process tree.Changes
sandbox/utils/execution.pycurrent_pid,root_pid, and thewhileloop from the global scope.cleanup_processfunction.try...except psutil.NoSuchProcessblock inside the loop for robustness in sandboxed environments.