You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Use kernel device-specific descriptor to determine max-wg-size for this kernel
This resolves
```
RuntimeError: Exceeded the number of registers available on the hardware.
The number registers per work-group cannot exceed 65536 for this kernel on this device.
The kernel uses 108 registers per work-item for a total of 1024 work-items per work-group.
-54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)
```
when running example:
```python
import dpctl.tensor as dpt
m1 = dpt.ones((1000, 1000), dtype="i4", device="cuda")
m2 = dpt.ones((1000, 1003), dtype="i4", device="cuda")
r = dpt.matmul(m1[:, :900], m2[:900, :])
```
0 commit comments