Skip to content

Commit b7680e3

Browse files
kiukchungfacebook-github-bot
authored andcommitted
(torchx/local_scheduler) Use os.kill instead of os.killpg when sending SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running
Summary: Fixes macos unittest failures: https://github.com/pytorch/torchx/actions/runs/14844253481/job/41674243877 When looking into the test failure I noticed two things: 1. `local_scheduler` was trying to SIGTERM the process group by passing the replica's pid: `os.killpg(replica.pid, signal.SIGTERM)` . Changed to call `os.kill`. (note that `os.killpg` is not available on iOS which is why the test was failing). 2. The `torchx.runner.test.api_test.test_empty_session_id()` test case doesn't wait for the `echo` test command to finish hence there was a race condition where in certain cases the runner's `__exit__()` SIGTERMs the replica pids but since the `local_scheduler` was (wronfully) using `os.killpg` not `os.kill` it threw an uncaught error in iOS. Differential Revision: D74197282
1 parent 83a2765 commit b7680e3

File tree

2 files changed

+2
-1
lines changed

2 files changed

+2
-1
lines changed

torchx/runner/test/api_test.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,7 @@ def test_empty_session_id(self, _: MagicMock) -> None:
153153
)
154154

155155
app_handle = runner.run(app, "local", self.cfg)
156+
runner.wait(app_handle, wait_interval=0.1)
156157

157158
scheduler, session_name, app_id = parse_app_handle(app_handle)
158159
self.assertEqual(scheduler, "local")

torchx/schedulers/local_scheduler.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -311,7 +311,7 @@ def terminate(self) -> None:
311311
"""
312312
# safe to call terminate on a process that already died
313313
try:
314-
os.killpg(self.proc.pid, signal.SIGTERM)
314+
os.kill(self.proc.pid, signal.SIGTERM)
315315
except ProcessLookupError as e:
316316
log.debug(f"Process {self.proc.pid} already got terminated")
317317

0 commit comments

Comments
 (0)