-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Describe the bug
We have a test that does essentially this:
import numba
from cuda.core import Stream
s = numba.cuda.stream()
s_ref = Stream.from_handle(s.handle.value)
# Use s and s_ref in different API calls.
We use this pattern because one of the libraries we're working with accepts cuda.core Stream types, and the otehr accepts numba.cuda.stream types.
In some nondeterministic cases, we see a crash with the confusing message call to cuStreamDestroy results in CUDA_ERROR_INVALID_CONTEXT
As far as we're concerned
Steps/Code to reproduce bug
Unfortunately, this is really difficult to reproduce, and seems to only happen if we run our whole 20+ minute test suite involving at least 2 processes coordinated by MPI.
The traceback looks something like this:
Call to cuStreamDestroy results in CUDA_ERROR_INVALID_CONTEXT
Traceback (most recent call last):
File "/usr/lib/python3.12/weakref.py", line 666, in _exitfunc
f()
File "/usr/lib/python3.12/weakref.py", line 590, in __call__
return info.func(*info.args, **(info.kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/nvshmem/nvshmem4py_venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 1476, in core
dealloc.add_item(module_unload, key)
File "/workspace/nvshmem/nvshmem4py_venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 1051, in add_item
self.clear()
File "/workspace/nvshmem/nvshmem4py_venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 1062, in clear
dtor(handle)
File "/workspace/nvshmem/nvshmem4py_venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 358, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/nvshmem/nvshmem4py_venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 417, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [201] Call to cuStreamDestroy results in CUDA_ERROR_INVALID_CONTEXT
Expected behavior
We expect to see no crash at teardown since we've stopped using the stream by the time this happens.
Environment details (please complete the following information):
- Environment location: Bare Metal
- Method of numba-cuda install: From source
Additional context
Add any other context about the problem here.
If we explicitly del s_ref after we're done using it, the problem goes away.