Spawning isolated processes for each test #67

jiannanWang · 2025-08-12T17:17:25Z

Fix #45 .

This PR introduces a --num-workers flag to enable multiprocessing evaluation. The main purpose is to spawn isolated processes to prevent CUDA errors from waterfalling during benchmarking. Note that multiprocessing does not improve speed because tests run quickly and most overhead comes from initialization, which multiprocessing cannot accelerate. Therefore, multiprocessing is disabled by default and should only be enabled if CUDA errors block the experiment.

BackendBench/eval.py

jiannanWang · 2025-08-13T22:04:33Z

Add eval_multiprocessing.py to run each test in a separate subprocess for better CUDA error isolation.
Enable _set_gpu_device for both correctness and performance testing.
Add test_eval_multiprocessing.py to ensure correctness of eval_multiprocessing.py
Log from test_adverse_cases.py is shown below to confirm cuda error isolation.

Log from test_adverse_cases.py

(pytorch) [[email protected] ~/Workspace/BackendBench (jiannanWang/eval_multiprocessing)]$ CUDA_LAUNCH_BLOCKING=1 uv run --active python test/test_adverse_cases.py
BackendBench: Monkey patched 0 operations with aten backend
================================================================================================ test session starts =================================================================================================
platform linux -- Python 3.12.11+meta, pytest-8.4.1, pluggy-1.6.0 -- /home/jiannanwang/pytorch/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default'
rootdir: /home/jiannanwang/Workspace/BackendBench
configfile: pytest.ini
plugins: hypothesis-6.136.6, anyio-4.9.0, timeout-2.4.0, mock-3.14.1, cov-6.2.1
collected 1 item

test/test_adverse_cases.py::TestAdaptiveAvgPool2dBackward::test_adaptive_avg_pool2d_backward_gpu BackendBench: Monkey patched 0 operations with aten backend
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f317ad785e8 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f317ad0d4a2 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f317b1d3422 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x32953 (0x7f317b1af953 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x32b01 (0x7f317b1afb01 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x65c522 (0x7f317285c522 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x44c5d8 (0x7f317264c5d8 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x7b385 (0x7f317ad59385 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f317ad52f39 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x704f48 (0x7f3172904f48 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x705370 (0x7f3172905370 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #11: /home/jiannanwang/pytorch/bin/python3() [0x402d24]
frame #12: /home/jiannanwang/pytorch/bin/python3() [0x402ea7]
frame #13: _PyEval_EvalFrameDefault + 0xa92 (0x457bb2 in /home/jiannanwang/pytorch/bin/python3)
frame #14: PyEval_EvalCode + 0x89 (0x5e8899 in /home/jiannanwang/pytorch/bin/python3)
frame #15: /home/jiannanwang/pytorch/bin/python3() [0x5e86e8]
frame #16: PyRun_SimpleStringFlags + 0xe1 (0x70d0a1 in /home/jiannanwang/pytorch/bin/python3)
frame #17: Py_RunMain + 0x72f (0x5ed89f in /home/jiannanwang/pytorch/bin/python3)
frame #18: Py_BytesMain + 0x2a (0x69615a in /home/jiannanwang/pytorch/bin/python3)
frame #19: <unknown function> + 0x2c657 (0x7f318982c657 in /usr/local/fbcode/platform010/lib/libc.so.6)
frame #20: __libc_start_main + 0x88 (0x7f318982c718 in /usr/local/fbcode/platform010/lib/libc.so.6)
frame #21: /home/jiannanwang/pytorch/bin/python3() [0x743f81]

BackendBench: Monkey patched 0 operations with aten backend
BackendBench: Monkey patched 0 operations with aten backend
PASSED

================================================================================================= 1 passed in 19.72s =================================================================================================
[W813 14:26:40.170087195 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Log from test_eval_multiprocessing.py

(pytorch) [[email protected] ~/Workspace/BackendBench (jiannanWang/eval_multiprocessing)]$ uv run --active pytest test/test_eval_multiprocessing.py
================================================================================================ test session starts =================================================================================================
platform linux -- Python 3.12.11+meta, pytest-8.4.1, pluggy-1.6.0 -- /home/jiannanwang/pytorch/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default'
rootdir: /home/jiannanwang/Workspace/BackendBench
configfile: pytest.ini
plugins: hypothesis-6.136.6, anyio-4.9.0, timeout-2.4.0, mock-3.14.1, cov-6.2.1
collected 2 items

test/test_eval_multiprocessing.py::TestEvalCorrectnessMultiprocessing::test_eval_correctness_multiple_tests PASSED                                                                                             [ 50%]
test/test_eval_multiprocessing.py::TestEvalOneOp::test_eval_one_op PASSED                                                                                                                                      [100%]

================================================================================================= 2 passed in 20.62s =================================================================================================

test/test_adverse_cases.py

test/test_eval_multiprocessing.py

BackendBench/eval_multiprocessing.py

PaliC · 2025-08-13T23:42:06Z

@jiannanWang Also it'd be useful to add some timings for running op_info suite against aten here. You mentioned there was a large slowdown, so it's worth seeing how bad it is. Feel free to just use time {command}

BackendBench/eval_multiprocessing.py

jiannanWang · 2025-08-18T22:52:07Z

Nvidia-smi results showing this is multiprocessing. I'm not sure why device 0 got so many process.

[[email protected] ~]$ nvidia-smi
Mon Aug 18 15:38:49 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100                    On  | 00000000:06:00.0 Off |                    0 |
| N/A   37C    P0             118W / 500W |   6463MiB / 97871MiB |     36%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100                    On  | 00000000:26:00.0 Off |                    0 |
| N/A   35C    P0             114W / 500W |    617MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100                    On  | 00000000:46:00.0 Off |                    0 |
| N/A   35C    P0             116W / 500W |    617MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100                    On  | 00000000:66:00.0 Off |                    0 |
| N/A   38C    P0             113W / 500W |    617MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100                    On  | 00000000:86:00.0 Off |                    0 |
| N/A   38C    P0             118W / 500W |    617MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100                    On  | 00000000:A6:00.0 Off |                    0 |
| N/A   37C    P0             118W / 500W |    743MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100                    On  | 00000000:C6:00.0 Off |                    0 |
| N/A   35C    P0             120W / 500W |    727MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100                    On  | 00000000:EC:00.0 Off |                    0 |
| N/A   36C    P0             111W / 500W |    617MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3319071      C   ...pace/BackendBench/.venv/bin/python3      740MiB |
|    0   N/A  N/A   3324869      C   ...pace/BackendBench/.venv/bin/python3      708MiB |
|    0   N/A  N/A   3324870      C   ...pace/BackendBench/.venv/bin/python3      708MiB |
|    0   N/A  N/A   3324871      C   ...pace/BackendBench/.venv/bin/python3      708MiB |
|    0   N/A  N/A   3324872      C   ...pace/BackendBench/.venv/bin/python3      708MiB |
|    0   N/A  N/A   3324873      C   ...pace/BackendBench/.venv/bin/python3      708MiB |
|    0   N/A  N/A   3324874      C   ...pace/BackendBench/.venv/bin/python3      710MiB |
|    0   N/A  N/A   3324875      C   ...pace/BackendBench/.venv/bin/python3      710MiB |
|    0   N/A  N/A   3324876      C   ...pace/BackendBench/.venv/bin/python3      708MiB |
|    1   N/A  N/A   3324870      C   ...pace/BackendBench/.venv/bin/python3      608MiB |
|    2   N/A  N/A   3324871      C   ...pace/BackendBench/.venv/bin/python3      608MiB |
|    3   N/A  N/A   3324872      C   ...pace/BackendBench/.venv/bin/python3      608MiB |
|    4   N/A  N/A   3324873      C   ...pace/BackendBench/.venv/bin/python3      608MiB |
|    5   N/A  N/A   3324874      C   ...pace/BackendBench/.venv/bin/python3      608MiB |
|    6   N/A  N/A   3324875      C   ...pace/BackendBench/.venv/bin/python3      608MiB |
|    7   N/A  N/A   3324876      C   ...pace/BackendBench/.venv/bin/python3      608MiB |
+---------------------------------------------------------------------------------------+

Runtime:
Using command time uv run python BackendBench/scripts/main.py --backend aten --suite opinfo, we got
8 GPUs:

correctness score (mean pass rate over all operators): 1.00
performance score (geomean speedup over all operators): nan

real    0m22.078s
user    1m33.075s
sys     0m22.264s

1 GPU:

correctness score (mean pass rate over all operators): 1.00
performance score (geomean speedup over all operators): nan

real    0m18.184s
user    0m27.285s
sys     0m3.711s

Main (no multiprocessing)

correctness score (mean pass rate over all operators): 1.00
performance score (geomean speedup over all operators): nan

real    0m12.281s
user    0m17.201s
sys     0m1.757s

It turns out that multiprocessing does not provide a speedup in our case. This is likely because our tests run very quickly, and most of the overhead comes from initialization, which multiprocessing cannot accelerate.

For now, we might need to avoid using multiprocessing as the default solution. Instead, we can offer it as an alternative option for users who encounter frequent CUDA errors in their experiments.

PaliC · 2025-08-19T00:44:05Z

@jiannanWang these results make sense I think, but what command are you using for testing / what is the backend/suite. Unclear if we should be concerned with using 8 gpus.

Also can you a similar time trial on main for record keeping.

jiannanWang · 2025-08-19T02:56:41Z

@jiannanWang these results make sense I think, but what command are you using for testing / what is the backend/suite. Unclear if we should be concerned with using 8 gpus.

Also can you a similar time trial on main for record keeping.

Added in the previous comment. I'm using time uv run python BackendBench/scripts/main.py --backend aten --suite opinfo.

msaroufim · 2025-08-19T14:01:57Z

.github/workflows/smoke-test.yml

@@ -37,4 +37,4 @@ jobs:
      run: uv run python -m BackendBench.scripts.main --suite facto --backend aten --ops "add.Tensor" 

    - name: Run pytest tests
-      run: uv run pytest test/
+      run: uv run pytest test/ --deselect test/test_adverse_cases.py


let's not do this here cause then everyone would need to be aware of this test locally, you can add a skip test directly in the test file

msaroufim · 2025-08-19T14:03:27Z

test/test_multiprocessing_eval.py

+import pytest
+import torch
+
+import BackendBench.multiprocessing_eval as multiprocessing_eval


can we merge this file into the the test_adverse_cases file?

msaroufim

Thanks! Please address feedback before merge

jiannanWang added 2 commits August 12, 2025 10:13

create one process for each test

ebeb9b8

fix

28ab6cd

jiannanWang requested review from msaroufim and PaliC August 12, 2025 17:17

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 12, 2025

jiannanWang marked this pull request as draft August 12, 2025 17:17

jiannanWang added 5 commits August 12, 2025 10:29

fix

a7dd7ae

ruff

e40a7d8

fix

32674da

ruff

a5d8a3d

Merge branch 'main' into jiannanWang/eval_multiprocessing

320cfdb

msaroufim reviewed Aug 13, 2025

View reviewed changes

BackendBench/eval.py Outdated Show resolved Hide resolved

add eval_multiprocessing, test_eval_multiprocessing and revert eval

fe3c17c

jiannanWang changed the title ~~[WIP] Spawning isolated processes for each test~~ Spawning isolated processes for each test Aug 13, 2025

jiannanWang marked this pull request as ready for review August 13, 2025 22:09

msaroufim reviewed Aug 13, 2025

View reviewed changes

test/test_adverse_cases.py Show resolved Hide resolved

test/test_eval_multiprocessing.py Outdated Show resolved Hide resolved

BackendBench/eval_multiprocessing.py Outdated Show resolved Hide resolved

BackendBench/eval_multiprocessing.py Outdated Show resolved Hide resolved

PaliC reviewed Aug 13, 2025

View reviewed changes