-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Improve child process termination on POSIX & Windows #1078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0205aa0
to
2be6f09
Compare
80f8217
to
12114f7
Compare
@dsp-ant as discussed platform native ways to terminate processes for Unix & Windows without race conditions between child enumeration and parent shutdown. @jingx8885 I had to break this out of #1044 to isolate this particular problem. If you'd be willing to test whether this branch still resolves your issue like you did on #1044 that would be greatly appreciated! The tests still pass, but just to make sure there's nothing we missed. |
1a561c6
to
f0633a3
Compare
@felixweinberger got it, I'll deal with it in a few days. |
7af9e65
to
c9b43bc
Compare
f0633a3
to
54be4d6
Compare
@Kludex @agronholm given I've seen you both on Python process management related PRs, tagging y'all here in case you have any thoughts on whether this could or should be simpler with This is an edge case where terminating a process doesn't automatically terminate children via For context, we need the subprocesses to be killed in the |
@@ -212,3 +215,306 @@ def sigint_handler(signum, frame): | |||
) | |||
else: | |||
raise | |||
|
|||
|
|||
class TestChildProcessCleanup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aware that using "timeout" anywhere in a test is usually not great practice - however, cross platform challenges and the subject matter being tested (processes) make this a little trickier to resolve than usual, as these tests are fundamentally there to detect whether processes are being closed (which in these tests should take a negligible, but non-zero amount of time).
I ran a deep research on alternative approaches - here's @claude 's POV:
⏺ Deep Analysis: Is the Time-Based Testing Approach Justified?
After researching testing best practices and ultrathinking about this specific case, yes, the approach is justified. Here's why:
The Tests Are Actually Condition-Based, Not Time-Based
The tests check a condition: "Is the file size stable?" This indicates process termination. The sleep() calls create measurement windows, not synchronization points. This is fundamentally
different from hoping timing works out.Why Alternative Approaches Don't Work Here
- Process polling (poll()/wait()): Only works for known process handles. This PR tests termination of entire process trees including unknown grandchildren.
- Mock-based testing: Process groups and Windows Job Objects require real OS interaction. The bugs being fixed are OS-level issues that mocks can't reproduce.
- Event/signal-based coordination: Would require modifying child processes, but the goal is to test termination of arbitrary processes (like npx).
- OS monitoring APIs: Platform-specific and still require polling. The file-based approach is simpler and more portable.
The Approach Is Well-Designed
- Black-box testing: Verifies actual behavior (processes stop working) rather than just process state
- Cross-platform: Works identically on Windows and Unix without platform-specific test code
- Real-world scenario: Tests continuous I/O patterns common in development servers
- Minimal timing: Uses 0.1s intervals and reasonable timeouts
- Comprehensive: Tests single children, nested trees, and race conditions
Critical Context
These tests provide regression coverage for production bugs where child processes weren't properly terminated. The ability to reliably test this behavior outweighs the theoretical concerns
about time-based tests.Recommendations
- Consider adding a comment explaining why the file-based approach is used:
# We use file writing as a cross-platform way to verify process liveness # without needing to know child PIDs or use platform-specific APIs
- The ResourceWarning filter is appropriately documented and justified.
- Future enhancement: Consider adding a process count assertion on Unix using ps to supplement the file-based checks.
Verdict
This PR provides a robust, well-tested solution to critical process management bugs. The testing approach, while using timing windows, is actually condition-based and represents the most
pragmatic way to verify process tree termination across platforms. The implementation quality is high with proper platform abstractions and error handling.
import pywintypes | ||
import win32api | ||
import win32con | ||
import win32job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixweinberger I saw you used win32api instead of psutil, and used win32job.TerminateJobObject in terminate_windows_process_tree function to kill all processes, which is more simple and elegance, I tested in local and it‘s worked, great job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for testing it, really appreciate you taking the time!
c7221f3
to
046bed4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thank you for working through this. My main comment here is about just improving the way we organize platform code. I would consider doing that before pushing the PR.
a328c9b
to
8de813b
Compare
The base branch was changed.
When terminating MCP servers, child processes were being orphaned because only the parent process was killed. This caused resource leaks and prevented proper cleanup, especially with tools like npx that spawn child processes for the actual server implementation. This was happening on both POSIX and Windows systems - however because of implementation details, resolving this is non-trivial and requires introducing psutil to introduce cross-platform utilities for dealing with children and process trees. This addresses critical issues where MCP servers using process spawning tools would leave zombie processes running after client shutdown. resolves #850 resolves #729 Co-authored-by: jingx8885 <[email protected]> Co-authored-by: Surya Prakash Susarla <[email protected]>
Replace manual deadline checking with anyio.move_on_after context manager for cleaner and more idiomatic timeout handling in Unix process group termination.
Per @dsp-ant's review feedback, reorganize platform-specific utilities: - Create src/mcp/os/ with win32/ and posix/ subdirectories - Move all Windows-specific code to mcp.os.win32.utilities - Extract POSIX process termination to mcp.os.posix.utilities - Update logger to use __name__ instead of hardcoded string - Rename timeout parameter to timeout_seconds for clarity - Document that missing PID means no process to terminate This structure allows platform utilities to be shared across client, server, and other components as needed.
8de813b
to
1a9b776
Compare
Thank you for pushing the fix! |
…tocol#1078) Co-authored-by: jingx8885 <[email protected]> Co-authored-by: Surya Prakash Susarla <[email protected]>
Motivation and Context
#729, and #850 address important failures to shut down childprocesses when MCP servers spawn subprocesses.
This is not a trivial thing to fix, as there's no great cross-platform library for terminating process-trees on all platforms so we have to fork between Unix (using
os.killpg
) and Windows (usingJobObjects
viapywin32
which is the best supported external library for this: https://github.com/mhammond/pywin32)This PR adds regressions tests to clearly demonstrate the edge cases addressed + unifies the approaches from the PRs listed above.
How Has This Been Tested?
New regression tests added (each fails without the relevant fix). These were developed and tested on both POSIX and Windows systems.
For #729 and #850 (properly terminate processes and their children across platforms):
Breaking Changes
None.
Types of changes
Checklist
Additional context