Skip to content

Spawning isolated processes for each test #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

jiannanWang
Copy link
Contributor

@jiannanWang jiannanWang commented Aug 12, 2025

Fix #45 .

This PR introduces a --num-workers flag to enable multiprocessing evaluation. The main purpose is to spawn isolated processes to prevent CUDA errors from waterfalling during benchmarking. Note that multiprocessing does not improve speed because tests run quickly and most overhead comes from initialization, which multiprocessing cannot accelerate. Therefore, multiprocessing is disabled by default and should only be enabled if CUDA errors block the experiment.

@jiannanWang jiannanWang requested review from msaroufim and PaliC August 12, 2025 17:17
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 12, 2025
@jiannanWang jiannanWang marked this pull request as draft August 12, 2025 17:17
@jiannanWang
Copy link
Contributor Author

Add eval_multiprocessing.py to run each test in a separate subprocess for better CUDA error isolation.
Enable _set_gpu_device for both correctness and performance testing.
Add test_eval_multiprocessing.py to ensure correctness of eval_multiprocessing.py
Log from test_adverse_cases.py is shown below to confirm cuda error isolation.

Log from test_adverse_cases.py
(pytorch) [[email protected] ~/Workspace/BackendBench (jiannanWang/eval_multiprocessing)]$ CUDA_LAUNCH_BLOCKING=1 uv run --active python test/test_adverse_cases.py
BackendBench: Monkey patched 0 operations with aten backend
================================================================================================ test session starts =================================================================================================
platform linux -- Python 3.12.11+meta, pytest-8.4.1, pluggy-1.6.0 -- /home/jiannanwang/pytorch/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default'
rootdir: /home/jiannanwang/Workspace/BackendBench
configfile: pytest.ini
plugins: hypothesis-6.136.6, anyio-4.9.0, timeout-2.4.0, mock-3.14.1, cov-6.2.1
collected 1 item

test/test_adverse_cases.py::TestAdaptiveAvgPool2dBackward::test_adaptive_avg_pool2d_backward_gpu BackendBench: Monkey patched 0 operations with aten backend
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f317ad785e8 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f317ad0d4a2 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f317b1d3422 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x32953 (0x7f317b1af953 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x32b01 (0x7f317b1afb01 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x65c522 (0x7f317285c522 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x44c5d8 (0x7f317264c5d8 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x7b385 (0x7f317ad59385 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f317ad52f39 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x704f48 (0x7f3172904f48 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x705370 (0x7f3172905370 in /home/jiannanwang/pytorch/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #11: /home/jiannanwang/pytorch/bin/python3() [0x402d24]
frame #12: /home/jiannanwang/pytorch/bin/python3() [0x402ea7]
frame #13: _PyEval_EvalFrameDefault + 0xa92 (0x457bb2 in /home/jiannanwang/pytorch/bin/python3)
frame #14: PyEval_EvalCode + 0x89 (0x5e8899 in /home/jiannanwang/pytorch/bin/python3)
frame #15: /home/jiannanwang/pytorch/bin/python3() [0x5e86e8]
frame #16: PyRun_SimpleStringFlags + 0xe1 (0x70d0a1 in /home/jiannanwang/pytorch/bin/python3)
frame #17: Py_RunMain + 0x72f (0x5ed89f in /home/jiannanwang/pytorch/bin/python3)
frame #18: Py_BytesMain + 0x2a (0x69615a in /home/jiannanwang/pytorch/bin/python3)
frame #19: <unknown function> + 0x2c657 (0x7f318982c657 in /usr/local/fbcode/platform010/lib/libc.so.6)
frame #20: __libc_start_main + 0x88 (0x7f318982c718 in /usr/local/fbcode/platform010/lib/libc.so.6)
frame #21: /home/jiannanwang/pytorch/bin/python3() [0x743f81]

BackendBench: Monkey patched 0 operations with aten backend
BackendBench: Monkey patched 0 operations with aten backend
PASSED

================================================================================================= 1 passed in 19.72s =================================================================================================
[W813 14:26:40.170087195 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Log from test_eval_multiprocessing.py
(pytorch) [[email protected] ~/Workspace/BackendBench (jiannanWang/eval_multiprocessing)]$ uv run --active pytest test/test_eval_multiprocessing.py
================================================================================================ test session starts =================================================================================================
platform linux -- Python 3.12.11+meta, pytest-8.4.1, pluggy-1.6.0 -- /home/jiannanwang/pytorch/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default'
rootdir: /home/jiannanwang/Workspace/BackendBench
configfile: pytest.ini
plugins: hypothesis-6.136.6, anyio-4.9.0, timeout-2.4.0, mock-3.14.1, cov-6.2.1
collected 2 items

test/test_eval_multiprocessing.py::TestEvalCorrectnessMultiprocessing::test_eval_correctness_multiple_tests PASSED                                                                                             [ 50%]
test/test_eval_multiprocessing.py::TestEvalOneOp::test_eval_one_op PASSED                                                                                                                                      [100%]

================================================================================================= 2 passed in 20.62s =================================================================================================

@jiannanWang jiannanWang changed the title [WIP] Spawning isolated processes for each test Spawning isolated processes for each test Aug 13, 2025
@jiannanWang jiannanWang marked this pull request as ready for review August 13, 2025 22:09
@PaliC
Copy link
Contributor

PaliC commented Aug 13, 2025

@jiannanWang Also it'd be useful to add some timings for running op_info suite against aten here. You mentioned there was a large slowdown, so it's worth seeing how bad it is. Feel free to just use time {command}

@jiannanWang
Copy link
Contributor Author

jiannanWang commented Aug 18, 2025

Nvidia-smi results showing this is multiprocessing. I'm not sure why device 0 got so many process.

[[email protected] ~]$ nvidia-smi
Mon Aug 18 15:38:49 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100                    On  | 00000000:06:00.0 Off |                    0 |
| N/A   37C    P0             118W / 500W |   6463MiB / 97871MiB |     36%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100                    On  | 00000000:26:00.0 Off |                    0 |
| N/A   35C    P0             114W / 500W |    617MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100                    On  | 00000000:46:00.0 Off |                    0 |
| N/A   35C    P0             116W / 500W |    617MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100                    On  | 00000000:66:00.0 Off |                    0 |
| N/A   38C    P0             113W / 500W |    617MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100                    On  | 00000000:86:00.0 Off |                    0 |
| N/A   38C    P0             118W / 500W |    617MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100                    On  | 00000000:A6:00.0 Off |                    0 |
| N/A   37C    P0             118W / 500W |    743MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100                    On  | 00000000:C6:00.0 Off |                    0 |
| N/A   35C    P0             120W / 500W |    727MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100                    On  | 00000000:EC:00.0 Off |                    0 |
| N/A   36C    P0             111W / 500W |    617MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3319071      C   ...pace/BackendBench/.venv/bin/python3      740MiB |
|    0   N/A  N/A   3324869      C   ...pace/BackendBench/.venv/bin/python3      708MiB |
|    0   N/A  N/A   3324870      C   ...pace/BackendBench/.venv/bin/python3      708MiB |
|    0   N/A  N/A   3324871      C   ...pace/BackendBench/.venv/bin/python3      708MiB |
|    0   N/A  N/A   3324872      C   ...pace/BackendBench/.venv/bin/python3      708MiB |
|    0   N/A  N/A   3324873      C   ...pace/BackendBench/.venv/bin/python3      708MiB |
|    0   N/A  N/A   3324874      C   ...pace/BackendBench/.venv/bin/python3      710MiB |
|    0   N/A  N/A   3324875      C   ...pace/BackendBench/.venv/bin/python3      710MiB |
|    0   N/A  N/A   3324876      C   ...pace/BackendBench/.venv/bin/python3      708MiB |
|    1   N/A  N/A   3324870      C   ...pace/BackendBench/.venv/bin/python3      608MiB |
|    2   N/A  N/A   3324871      C   ...pace/BackendBench/.venv/bin/python3      608MiB |
|    3   N/A  N/A   3324872      C   ...pace/BackendBench/.venv/bin/python3      608MiB |
|    4   N/A  N/A   3324873      C   ...pace/BackendBench/.venv/bin/python3      608MiB |
|    5   N/A  N/A   3324874      C   ...pace/BackendBench/.venv/bin/python3      608MiB |
|    6   N/A  N/A   3324875      C   ...pace/BackendBench/.venv/bin/python3      608MiB |
|    7   N/A  N/A   3324876      C   ...pace/BackendBench/.venv/bin/python3      608MiB |
+---------------------------------------------------------------------------------------+

Runtime:
Using command time uv run python BackendBench/scripts/main.py --backend aten --suite opinfo, we got
8 GPUs:

correctness score (mean pass rate over all operators): 1.00
performance score (geomean speedup over all operators): nan

real    0m22.078s
user    1m33.075s
sys     0m22.264s

1 GPU:

correctness score (mean pass rate over all operators): 1.00
performance score (geomean speedup over all operators): nan

real    0m18.184s
user    0m27.285s
sys     0m3.711s

Main (no multiprocessing)

correctness score (mean pass rate over all operators): 1.00
performance score (geomean speedup over all operators): nan

real    0m12.281s
user    0m17.201s
sys     0m1.757s

It turns out that multiprocessing does not provide a speedup in our case. This is likely because our tests run very quickly, and most of the overhead comes from initialization, which multiprocessing cannot accelerate.

For now, we might need to avoid using multiprocessing as the default solution. Instead, we can offer it as an alternative option for users who encounter frequent CUDA errors in their experiments.

@PaliC
Copy link
Contributor

PaliC commented Aug 19, 2025

@jiannanWang these results make sense I think, but what command are you using for testing / what is the backend/suite. Unclear if we should be concerned with using 8 gpus.

Also can you a similar time trial on main for record keeping.

@jiannanWang
Copy link
Contributor Author

@jiannanWang these results make sense I think, but what command are you using for testing / what is the backend/suite. Unclear if we should be concerned with using 8 gpus.

Also can you a similar time trial on main for record keeping.

Added in the previous comment. I'm using time uv run python BackendBench/scripts/main.py --backend aten --suite opinfo.

@@ -37,4 +37,4 @@ jobs:
run: uv run python -m BackendBench.scripts.main --suite facto --backend aten --ops "add.Tensor"

- name: Run pytest tests
run: uv run pytest test/
run: uv run pytest test/ --deselect test/test_adverse_cases.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not do this here cause then everyone would need to be aware of this test locally, you can add a skip test directly in the test file

import pytest
import torch

import BackendBench.multiprocessing_eval as multiprocessing_eval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we merge this file into the the test_adverse_cases file?

Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Please address feedback before merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure that cuda errors don't waterfall when benchmarking by spawning an isolated process
3 participants