Skip to content

[BUG] Nondeterministic crash on deleting Stream when interoperating with cuda-core #772

@benhg

Description

@benhg

Describe the bug

We have a test that does essentially this:

import numba
from cuda.core import Stream

s = numba.cuda.stream()
s_ref = Stream.from_handle(s.handle.value)

# Use s and s_ref in different API calls.

We use this pattern because one of the libraries we're working with accepts cuda.core Stream types, and the otehr accepts numba.cuda.stream types.

In some nondeterministic cases, we see a crash with the confusing message call to cuStreamDestroy results in CUDA_ERROR_INVALID_CONTEXT

As far as we're concerned

Steps/Code to reproduce bug

Unfortunately, this is really difficult to reproduce, and seems to only happen if we run our whole 20+ minute test suite involving at least 2 processes coordinated by MPI.

The traceback looks something like this:


Call to cuStreamDestroy results in CUDA_ERROR_INVALID_CONTEXT
Traceback (most recent call last):
  File "/usr/lib/python3.12/weakref.py", line 666, in _exitfunc
    f()
  File "/usr/lib/python3.12/weakref.py", line 590, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/nvshmem/nvshmem4py_venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 1476, in core
    dealloc.add_item(module_unload, key)
  File "/workspace/nvshmem/nvshmem4py_venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 1051, in add_item
    self.clear()
  File "/workspace/nvshmem/nvshmem4py_venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 1062, in clear
    dtor(handle)
  File "/workspace/nvshmem/nvshmem4py_venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 358, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/nvshmem/nvshmem4py_venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 417, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [201] Call to cuStreamDestroy results in CUDA_ERROR_INVALID_CONTEXT

Expected behavior

We expect to see no crash at teardown since we've stopped using the stream by the time this happens.

Environment details (please complete the following information):

  • Environment location: Bare Metal
  • Method of numba-cuda install: From source

Additional context
Add any other context about the problem here.

If we explicitly del s_ref after we're done using it, the problem goes away.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions