Commit b7680e3
(torchx/local_scheduler) Use os.kill instead of os.killpg when sending SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running
Summary:
Fixes macos unittest failures: https://github.com/pytorch/torchx/actions/runs/14844253481/job/41674243877
When looking into the test failure I noticed two things:
1. `local_scheduler` was trying to SIGTERM the process group by passing the replica's pid: `os.killpg(replica.pid, signal.SIGTERM)` . Changed to call `os.kill`. (note that `os.killpg` is not available on iOS which is why the test was failing).
2. The `torchx.runner.test.api_test.test_empty_session_id()` test case doesn't wait for the `echo` test command to finish hence there was a race condition where in certain cases the runner's `__exit__()` SIGTERMs the replica pids but since the `local_scheduler` was (wronfully) using `os.killpg` not `os.kill` it threw an uncaught error in iOS.
Differential Revision: D741972821 parent 83a2765 commit b7680e3
2 files changed
+2
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
153 | 153 | | |
154 | 154 | | |
155 | 155 | | |
| 156 | + | |
156 | 157 | | |
157 | 158 | | |
158 | 159 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
311 | 311 | | |
312 | 312 | | |
313 | 313 | | |
314 | | - | |
| 314 | + | |
315 | 315 | | |
316 | 316 | | |
317 | 317 | | |
| |||
0 commit comments