You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Changed the `queue.fill()` implementation to make use of the native
functions for a specific backend. Also, unified the implementation with
the one for memset, since it is just an 8-bit subset operation of fill.
In the CUDA case, both memset and fill are currently calling
`urEnqueueUSMFill` which depending on the size of the filling pattern
calls either `cuMemsetD8Async`, `cuMemsetD16Async`, `cuMemsetD32Async`
or `commonMemSetLargePattern`. Before this patch memset was using the
same thing, just beforehand setting patternSize always to 1 byte which
resulted in calling `cuMemsetD8Async`. In other backends, the behaviour
is analogous.
The fill method was just invoking a `parallel_for` to fill the memory
with the pattern which was making this operation quite slow.
0 commit comments