Skip to content

[BUG] Integer overflow in device array when numpy version > 2.0 for T4 GPU:s #623

@lcniel

Description

@lcniel

Describe the bug
It seems like the MemoryPointer object relies on the auto-casting behaviour of old numpy versions to upgrade int32 to int64 on demand, which results in semi-deterministic errors with large GPU:s on newer numpy versions, at least in some cases. I can reproduce the error on T4 GPU:s with driver version 535.161.07, which is what Gitlab runners use.

def view(self, start, stop=None):

Here with numpy==2.2.6 and numba_cuda==0.21.3.
E.g. pytest traceback:

self = <numba.cuda.cudadrv.driver.AutoFreePointer object at 0x7f4590b57f70>
start = np.int32(0), stop = np.int32(1152)
    def view(self, start, stop=None):
        if stop is None:
            size = self.size - start
        else:
            size = stop - start
    
        # Handle NULL/empty memory buffer
        if not self.device_pointer_value:
            if size != 0:
                raise RuntimeError("non-empty slice into empty slice")
            view = self  # new view is just a reference to self
        # Handle normal case
        else:
>           base = self.device_pointer_value + start
E           OverflowError: Python integer 139939146685440 out of bounds for int32
local_installation_linux/numba_cuda/numba/cuda/cudadrv/driver.py:1852: OverflowError

Steps/Code to reproduce bug
It happens in my Gitlab CI when something much like the following is done:

import numba.cuda as cuda
import numpy as np
arr = np.arange(50 * 50 * 100, dtype=np.float32).reshape(50, 50, 100)
arr2 = cuda.to_device(arr)

The problem goes away when I downgrade numpy below 2.0. I can't reproduce it on e.g. an L40S with 580.95.05 drivers.

Expected behavior
There should be no reliance on auto-casting of typed integers in the code. Wrapping start in int64 would probably fix the issue?

Environment details (please complete the following information):

  • Environment location: Gitlab enterprise GPU runner (VM with T4 GPU)
  • Method of numba-cuda install: pip

Additional context
It's a bit hard to deterministically reproduce since I don't know where device_pointer_value comes from, but it should be easy enough to make sure this doesn't happen at all.

Here's an example of a recent CI failure of mine:

https://gitlab.com/liebi-group/software/mumott/-/jobs/12288859623

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions