You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SYCL] Extract args directly from kernel if we can (intel#18387)
In some cases, all values that need to be passed as kernel arguments are
stored within the kernel function object, and their offsets can be
calculated using the integration header or equivalent built-ins. In such
cases, we can therefore set kernel arguments directly without staging
via `MArgs`.
This first attempt is limited to the simplest cases where all kernel
arguments are either standard layout types or pointers. It may be
possible to extend this approach to cover other cases, but only if some
classes are redesigned.
The implementation currently stores some information (e.g., the number
of kernel arguments) inside of the handler, because there is no way to
access the kernel type within `handler::finalize()`.
---
Some notes for reviewers:
- This depends on the new `hasSpecialCaptures` functionality introduced
in intel#18386, which returns `true` for kernels that only capture standard
layout classes and pointers.
- There are some seemingly unrelated changes in kernel_desc.hpp and to
some of the unit tests. These changes were necessary because
`hasSpecialCaptures` requires `getParamDesc` to be `constexpr`. I think
this wasn't picked up during intel#18386 because `hasSpecialCaptures` wasn't
previously being run for every kernel.
- I'm not really satisfied by the solution of adding a lot more member
variables, but it was the best way I could think of to limit the scope
of the changes required. Long-term, it would be better to try and move
everything (including the complicated cases) to extract everything
directly from the lambda, to design an abstraction that unifies the
`MArgs` and `MKernelFuncPtr` paths, or to find a way to access the
required values without them being stored in the handler (e.g., using
something like intel#18081).
---
This change was motivated by profiling of the `SubmitKernel` benchmark
in the https://github.com/intel/compute-benchmarks/ suite, which can be
run similar to:
```
/build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=1 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1 --UseEvents=0
```
This is the simplest submission case there is, appending a kernel with
no special arguments to an in-order queue. In the benchmarks on my
machine, I saw around 1-2% of execution time spent in calls to
`extractArgsAndReqsFromLambda`, attributed to populating the `MArgs`
vector using information from the integration headers. This PR removes
the need to call `extractArgsAndReqsFromLambda` entirely in the
submission path used by this benchmark, thus improving performance.
---------
Signed-off-by: John Pennycook <[email protected]>
0 commit comments