You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optimization of custom_reduce_over_group function.
The function used to perform custom reduction in a single
work-item (leader of the work-group sequentially).
It now does so cooperatively few iterations, and
processes remaining non-reduced elements sequentially
in the leading work-item.
The custom_reduce_over_group got sped up about a factor of 3x.
The following now shows timing of the reduction kernel
```
unitrace -d -v -i 20 python -c "import dpctl.tensor as dpt; dpt.min(dpt.ones(10**7, dtype=dpt.float32)).sycl_queue.wait()"
```
or par (less that 10%) slower than the int32 kernel, which uses
built-in sycl::reduce_over_group:
```
unitrace -d -v -i 20 python -c "import dpctl.tensor as dpt; dpt.min(dpt.ones(10**7, dtype=dpt.int32)).sycl_queue.wait()"
```
0 commit comments