-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Describe the bug
It seems like the MemoryPointer object relies on the auto-casting behaviour of old numpy versions to upgrade int32 to int64 on demand, which results in semi-deterministic errors with large GPU:s on newer numpy versions, at least in some cases. I can reproduce the error on T4 GPU:s with driver version 535.161.07, which is what Gitlab runners use.
| def view(self, start, stop=None): |
Here with numpy==2.2.6 and numba_cuda==0.21.3.
E.g. pytest traceback:
self = <numba.cuda.cudadrv.driver.AutoFreePointer object at 0x7f4590b57f70>
start = np.int32(0), stop = np.int32(1152)
def view(self, start, stop=None):
if stop is None:
size = self.size - start
else:
size = stop - start
# Handle NULL/empty memory buffer
if not self.device_pointer_value:
if size != 0:
raise RuntimeError("non-empty slice into empty slice")
view = self # new view is just a reference to self
# Handle normal case
else:
> base = self.device_pointer_value + start
E OverflowError: Python integer 139939146685440 out of bounds for int32
local_installation_linux/numba_cuda/numba/cuda/cudadrv/driver.py:1852: OverflowError
Steps/Code to reproduce bug
It happens in my Gitlab CI when something much like the following is done:
import numba.cuda as cuda
import numpy as np
arr = np.arange(50 * 50 * 100, dtype=np.float32).reshape(50, 50, 100)
arr2 = cuda.to_device(arr)
The problem goes away when I downgrade numpy below 2.0. I can't reproduce it on e.g. an L40S with 580.95.05 drivers.
Expected behavior
There should be no reliance on auto-casting of typed integers in the code. Wrapping start in int64 would probably fix the issue?
Environment details (please complete the following information):
- Environment location: Gitlab enterprise GPU runner (VM with T4 GPU)
- Method of numba-cuda install: pip
Additional context
It's a bit hard to deterministically reproduce since I don't know where device_pointer_value comes from, but it should be easy enough to make sure this doesn't happen at all.
Here's an example of a recent CI failure of mine:
https://gitlab.com/liebi-group/software/mumott/-/jobs/12288859623