Commit 924482a
committed
Replace NUMA inheritance approach (pytorch#166026)
# Context
Previously, we would modify the parent process's NUMA bindings in order to force child process to inherit them.
However, this would not work correctly if `start_method="forkserver"`, because the subprocesses would actually inherit their bindings from the forkserver middleman process. In this case, the inherited affinity would actually be incorrect for all but the first subprocess (because the forkserver process would get created lazily, and hence inherit and then stick with the bindings intended for the first subprocess).
# This PR
* `str` entrypoints: Use `numactl` CLI
* `Callable` entrypoints: Wrap the `Callable` entrypoint and call `os.sched_setaffinity` inside it.
Hopefully this will be the last necessary iteration.
# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`
## Manual
Verified flops/sec and memory locality wins on several different types of jobs
* `Callable` with forkserver
* `str` entrypoint with spawn
* `Callable` entrypoint with spawn
More details in [this doc (Meta-only).](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.scjv58yswi64)
# Later PR
Update all the documentation when we're confident this has stabilized.
Pull Request resolved: pytorch#166026
Approved by: https://github.com/d4l3k
Co-authored-by: PyTorch MergeBot <[email protected]>1 parent 20be077 commit 924482a
File tree
5 files changed
+382
-257
lines changed- test
- torch
- distributed/elastic/multiprocessing
- subprocess_handler
- multiprocessing
- numa
5 files changed
+382
-257
lines changed
0 commit comments