-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Describe the bug
The recently added from_dtype LRU caching uses numpy dtype as the cache key. However, numpy's equality comparison does not distinguish between aligned and unaligned struct types:
f16x2 = np.dtype([("x", np.float16), ("y", np.float16)])
f16x2_aligned = np.dtype([("x", np.float16), ("y", np.float16)], align=True)
This causes a caching collision: if an unaligned type is cached first, all subsequent lookups for the aligned variant will incorrectly return the cached unaligned type. As a result, numba will treat aligned types as unaligned, causing the PTX generator to emit fine-grained memory accesses instead of fast aligned accesses, leading to degraded performance.
Steps/Code to reproduce bug
- Prepare a clean python environment
- Execute
pip install numba-cuda - The code below demonstrates the behaviour:
import numba.cuda
def div_by_2(x):
return x / 2
def main():
sig = numba.cuda.types.complex128(numba.cuda.types.complex128)
func = numba.cuda.compile(div_by_2, sig, device=True, abi="c")
print(func)
if __name__ == "__main__":
main()
Expected behavior
Function will print:
True
f1 signature: [(Array(Record([('x', {'type': float16, 'offset': 0, 'alignment': None, 'title': None, }), ('y', {'type': float16, 'offset': 2, 'alignment': None, 'title': None, })], 4, False), 1, 'C', False, aligned=False), Array(int64, 1, 'C', False, aligned=True))]
f2 signature: [(Array(Record([('x', {'type': float16, 'offset': 0, 'alignment': None, 'title': None, }), ('y', {'type': float16, 'offset': 2, 'alignment': None, 'title': None, })], 4, False), 1, 'C', False, aligned=False), Array(int64, 1, 'C', False, aligned=True))]
Both signatures uses unaligned type.
Environment details (please complete the following information):
- Environment location: wsl local machine
- Method of numba-cuda install: pip install
- Python env:
numba-cuda==0.27.0
numba==0.63.1
Python 3.11.14
Additional context
numba_cuda/numba/cuda/np/numpy_support.py:124