-
Notifications
You must be signed in to change notification settings - Fork 1
[SYCL] Documentation for the proposed design for kernel argument decomposition. #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,7 @@ SYCL Compiler and Runtime architecture design | |
.. contents:: | ||
:local: | ||
|
||
|
||
Introduction | ||
============ | ||
|
||
|
@@ -13,6 +14,7 @@ library. More details are provided in | |
`external document <https://github.com/intel/llvm/blob/sycl/sycl/doc/design/CompilerAndRuntimeDesign.md>`_\ , | ||
which are going to be added to clang documentation in the future. | ||
|
||
|
||
Address space handling | ||
====================== | ||
|
||
|
@@ -112,8 +114,293 @@ space attributes for pointers: | |
* - ``__attribute__((opencl_private))`` | ||
- private_space | ||
|
||
|
||
.. code-block:: C++ | ||
|
||
//TODO: add support for __attribute__((opencl_global_host)) and __attribute__((opencl_global_device)). | ||
|
||
|
||
Kernel argument validation and decomposition | ||
============================================ | ||
|
||
SYCL 2020 specifies requirements on the types of arguments that can be passed to | ||
a SYCL kernel. | ||
These requirements are enforced on the arguments passed to functions declared | ||
with the | ||
`[[clang::sycl_kernel_entry_point]] <https://clang.llvm.org/docs/AttributeReference.html#sycl-kernel-entry-point>`__ | ||
attribute. | ||
|
||
Valid kernel argument types can be broadly categorized in three groups per | ||
`section 3.13.1, "Device copyable" <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec::device.copyable>`__ | ||
and | ||
`section 4.12.4, "Rules for parameter passing to kernels" <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:kernel.parameter.passing>`__. | ||
|
||
* Types that are implicitly device copyable because they satisfy the C++ definition | ||
of trivially copyable | ||
(`[basic.types.general]p9 <https://eel.is/c++draft/basic.types.general#9>`__, | ||
`[class.prop]p1 <https://eel.is/c++draft/class.prop#1>`__. | ||
Such types may be bit-copied to the device. | ||
* Types that are explicitly device copyable because, for a type ``T``, | ||
``sycl::is_device_copyable_v<T>`` is true. | ||
Such types may be bit-copied to the device. | ||
* Types that are device copyable by fiat (``sycl::accessor``, ``sycl::local_accessor``, | ||
``sycl::stream``, ``sycl::reducer``, etc...). | ||
Such types may require special handling. | ||
|
||
Support for the third category of types is provided through a *decomposition protocol* | ||
that such types opt in to as described below. | ||
The decomposition protocol facilitates transformation of an object of such a type | ||
into a sequence of objects, each of which has a type that satisfies one of the | ||
first two type categories. | ||
|
||
A type that opts into the decomposition protocol may be a type that also satisfies | ||
the C++ definition of a trivially copyable type and may therefore appear as a | ||
subobject | ||
(`[intro.object]p2 <https://eel.is/c++draft/intro.object#2>`__) | ||
type of another type that also satisfies the trivially copyable type requirements. | ||
tahonermann marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
Types that opt in to the decomposition protocol, or directly or indirectly have a | ||
subobject type that opts in to the decomposition protocol, *require decomposition*. | ||
A kernel argument of such a type is transformed to a sequence of arguments that | ||
are substituted for the original argument. | ||
|
||
Given a call to a function declared with the ``[[clang::sycl_kernel_entry_point]]`` | ||
attribute, each argument ``A`` of parameter type ``P`` is processed as follows. | ||
The resulting sequence of replacement arguments constitutes the arguments to the | ||
``sycl_kernel_launch()`` function and their types constitute the corresponding | ||
parameters of the synthesized offload kernel entry point function (the SYCL kernel | ||
caller function). | ||
|
||
#. If ``P`` is a non-union class type with a ``sycl_deconstruct()`` member function, | ||
then the type is one that has opted in to the decomposition protocol and shall | ||
meet the requirements below. | ||
|
||
#. The ``sycl_deconstruct()`` member function shall be a non-static member | ||
function, shall declare no parameters, shall return a *tuple-like* type that | ||
satisfies the requirements for an initializer of a structured binding declaration | ||
(`[dcl.struct.bind] <https://eel.is/c++draft/dcl.struct.bind>`__), | ||
shall not be declared with the ``[[noreturn]]`` attribute, may have a | ||
potentially-throwing exception specification | ||
(`[except.spec]p1 <https://eel.is/c++draft/except.spec#1>`__), | ||
and may have a function body that throws exceptions. | ||
|
||
#. The class shall declare a ``sycl_reconstruct()`` member function. | ||
That function shall be a static member function, shall have a return type of | ||
cv-unqualified ``P``, shall not be declared with the ``[[noreturn]]`` | ||
attribute, may have a non-throwing exception specification, and shall declare | ||
parameters corresponding to the elements of the tuple-like type returned by | ||
``sycl_deconstruct()``. | ||
For each element of that tuple-like type, there shall be a corresponding in-order | ||
parameter with a type that is convertible from the tuple element type. | ||
Additional parameters with default arguments may be present. | ||
The body of the function shall abide by the device language restrictions | ||
specified in | ||
`section 5.4, "Language restrictions for device functions" <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:language.restrictions.kernels>`__. | ||
|
||
The original argument ``A`` is replaced with the sequence of elements returned | ||
in the tuple-like type for a call to ``sycl_deconstruct()`` on ``A``. | ||
Each element with a type that requires decomposition is recursively processed | ||
and replaced by its sequence of decomposed objects. | ||
|
||
#. Otherwise, if ``P`` is a non-union aggregate type | ||
(`[dcl.init.aggr]p1 <https://eel.is/c++draft/dcl.init.aggr#1>`__) | ||
or a lambda closure type | ||
(`[expr.prim.lambda.closure]p1 <https://eel.is/c++draft/expr.prim.lambda.closure#1>`__), | ||
its subobject types determine whether ``P`` requires decomposition. | ||
If any subobject type requires decomposition, then ``P`` requires decomposition. | ||
tahonermann marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
Each subobject of ``A`` that has a type that requires decomposition is recursively | ||
processed and their replacement objects are sequenced after ``A``. | ||
If all subobject types require decomposition, then ``A`` is removed from the kernel | ||
argument list. | ||
Otherwise, the remaining subobjects of ``A`` that do not require decomposition may | ||
be passed as individual kernel arguments in place of ``A`` (the choice to pass ``A`` | ||
with the storage for its decomposed subobjects bit-copied and reinitialized in the | ||
offload entry point function or to pass each remaining subobject as a distinct | ||
argument in place of ``A`` is unspecified; the intent is to allow the most efficient | ||
choice to be made based on the cost of passing ``A`` vs the cost of passing | ||
additional arguments). | ||
|
||
#. Otherwise, if ``P`` is a trivially copyable non-class type, then ``A`` is passed as | ||
a bit-copyable argument. | ||
|
||
#. Otherwise, if the `SYCL_DEVICE_COPYABLE` macro is predefined with a value of ``1``, | ||
``P`` is a class type, and ``sycl::is_device_copyable_v<P>`` is true, ``P`` shall | ||
satisfy the constraints listed in | ||
`section 3.13.1, "Device copyable" <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec::device.copyable>`__. | ||
``A`` is passed as a bit-copyable argument. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about a case like this:
Do you expect this rule to apply to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The rule that governs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is your intent, then I think rule 2 needs to be clarified somehow. There is no mention in that rule about sub-object types that are marked There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Those cases fall into the "Otherwise ..." portion of the rule. I'll try to add some clarification. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried to clarify this. Please review. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still think this is confusing. Rule 4 is very detailed about what constitutes "device copyability" (e.g. it talks about the I actually think it would be clearer to omit the details about when an object is "device copyable" -- these details are in the SYCL specification. Thus, I think you could simplify rule 4. In fact, I think rules 3, 4, and 6 could all be combined into a single rule that just says:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When writing these, I was thinking of things from an implementer perspective and, in particular, what type checking needs to be done in which order. That is why I separated the checks for trivial copyability and the However, given recent discussion, I think we'll be pursuing a different approach. I have a proto-POC that provides the capabilities that Andrei is requesting. If discussions with him go well, I'll submit a new PR with documentation for that approach. |
||
|
||
#. Otherwise, if ``P`` is a union type, then each of its non-static data members shall | ||
have a trivially copyable type, none of which requires decomposition. | ||
``A`` is passed as a bit-copyable argument. | ||
tahonermann marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#. Otherwise, if ``P`` is a trivially copyable class type, none of its non-static data | ||
members shall require decomposition and its anonymous union members and non-static | ||
data members of union type shall satisfy the constraints on a union type described | ||
above. ``A`` is passed as a bit-copyable argument. | ||
|
||
#. Otherwise, ``P`` is not a valid kernel argument type and the program is ill-formed. | ||
|
||
If the sequence of replacement arguments contains an argument with a type that is, | ||
or has as a subobject type, a reference type, a type that is prohibited in device | ||
code (e.g., ``long double``), or a pointer to data member type (see | ||
`SYCL WG issue 612 <https://github.com/KhronosGroup/SYCL-Docs/issues/612>`__), | ||
then the program is ill-formed. | ||
|
||
The constraints on union types are derived from explicit restrictions specified in | ||
`section 5.4, "Language restrictions for device functions" <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:language.restrictions.kernels>`__ | ||
and practical restrictions that are not clearly addressed in SYCL 2020 (see | ||
`CMPLRLLVM-61883 <https://jira.devtools.intel.com/browse/CMPLRLLVM-61883>`__ | ||
for some previous discussion). | ||
|
||
The sequence of possibly decomposed arguments is passed to the | ||
``sycl_kernel_launch()`` function. | ||
Each argument is passed as an xvalue and may be move-constructed from. | ||
Destructors are invoked as usual for the argument/parameter pairs of the | ||
``[[clang::sycl_kernel_entry_point]]`` attributed function and for the | ||
``sycl_kernel_launch()`` function. | ||
|
||
The body of a function declared with the ``[[clang::sycl_kernel_entry_point]]`` | ||
attribute is incorporated in the body of the synthesized offload kernel entry point | ||
function. | ||
In order to execute the incorporated statements, objects matching the original | ||
sequence of arguments must be available. | ||
For arguments that were not subject to decomposition, the matching parameter is | ||
used. | ||
For arguments that were subject to decomposition, the original argument is | ||
reconstructed as a local variable from the parameters that correspond to the | ||
decomposed sequence of arguments. | ||
Within the variable initialization, parameters are referenced as xvalues and | ||
may be move-constructed from. | ||
Destructors are invoked as usual for the parameters of the synthesized offload | ||
kernel entry point and the local variables used to reconstruct the original | ||
kernel arguments. | ||
|
||
Consider the following example. | ||
|
||
.. code-block:: C++ | ||
|
||
#include <tuple> | ||
#include <sycl/sycl.hpp> | ||
|
||
template<typename KN, typename... Ts> | ||
void sycl_kernel_launch(const char* kn, Ts.. ts) { ... } | ||
|
||
template<typename KN, typename KT> | ||
[[clang::sycl_kernel_entry_point(KN)]] | ||
void kernel_entry_point(KT k) { | ||
k(); | ||
} | ||
|
||
struct X { | ||
~X(); | ||
int dm; | ||
}; | ||
template<> bool sycl::is_device_copyable_v<X> = true; | ||
|
||
struct special_type { | ||
~special_type(); | ||
std::tuple<int, X> sycl_deconstruct(); | ||
static special_type sycl_reconstruct(int, X) noexcept; | ||
}; | ||
|
||
struct aggregate { | ||
special_type sta[2]; | ||
}; | ||
|
||
struct kernel_name; | ||
|
||
void f() { | ||
int i; | ||
special_type st; | ||
aggregate a; | ||
auto k = [i, st, a] {}; | ||
kernel_entry_point<kernel_name>(k); | ||
} | ||
|
||
For host compilation, the body of ``kernel_entry_point<kernel_name>()`` is | ||
replaced with synthesized code that performs kernel argument decomposition | ||
and forwards the results to ``sycl_kernel_launch()``. | ||
The synthesized code looks approximately as follows. | ||
Structured bindings are used to illustrate the intent. | ||
Note that some of this code does not conform to standard C++ (a lambda | ||
closure type can not necessarily be decomposed as shown), but it hopefully | ||
suffices to convey the intent. | ||
|
||
.. code-block:: C++ | ||
|
||
// KT is decltype(k) in f(). | ||
void kernel_entry_point<kernel_name>(KT k) { | ||
// 'k' is a lambda closure type with captures that require decomposition. | ||
auto& [ i, st, a ] = k; | ||
// All of the captures of 'k' have been decomposed; it is eliminated as a kernel argument. | ||
// 'i' is a trivially copyable non-class type; no further decomposition required. | ||
// 'st' is of a type that opts into the decomposition protocol. | ||
auto&& [ st1, st2 ] = st.sycl_deconstruct(); | ||
// 'st1' is a trivially copyable non-class type; no further decomposition required. | ||
// 'st2' is an explicitly device copyable type; no further decomposition required. | ||
// 'a' is an aggregate with a member that requires decomposition. | ||
auto& [ asta ] = a; | ||
// All of the data members of 'a' have been decomposed; it is eliminated as a kernel argument. | ||
// 'asta' is an aggregate with elements that require decomposition. | ||
auto& [ asta1, asta2 ] = asta; | ||
// All of the elements of 'asta' have been decomposed; it is eliminated as a kernel argument. | ||
// 'asta1' is of a type that opts into the decomposition protocol. | ||
auto&& [ asta1_1, asta1_2 ] = asta1.sycl_deconstruct(); | ||
// All of the elements of 'asta1' have been decomposed; it is eliminated as a kernel argument. | ||
// 'asta1_1' is a trivially copyable non-class type; no further decomposition required. | ||
// 'asta1_2' is an explicitly device copyable type; no further decomposition required. | ||
// 'asta2' is of a type that opts into the decomposition protocol. | ||
auto&& [ asta2_1, asta2_2 ] = asta2.sycl_deconstruct(); | ||
// All of the elements of 'asta2' have been decomposed; it is eliminated as a kernel argument. | ||
// 'asta2_1' is a trivially copyable non-class type; no further decomposition required. | ||
// 'asta2_2' is an explicitly device copyable type; no further decomposition required. | ||
|
||
// Pass the decomposed arguments, all of which satisfy the SYCL 2020 device copyable | ||
// and kernel argument requirements, to the sycl_kernel_launch() function. | ||
sycl_kernel_launch<kernel_name>("kernel_name", | ||
std::move(i), | ||
std::move(st1), std::move(st2), | ||
std::move(asta1_1), std::move(asta1_2), | ||
std::move(asta2_1), std::move(asta2_2)); | ||
|
||
// Destructor runs for 'k'. | ||
} | ||
|
||
For device compilation, an offload kernel entry point function is synthesized that looks | ||
approximately as follows. | ||
Aggregate initialization is used to illustrate the intent with the acknowledgement | ||
that, for example, lambda closure types cannot be constructed with aggregate | ||
initialization in standard C++. | ||
Again, hopefully the intent is clear. | ||
|
||
.. code-block:: C++ | ||
|
||
void offload_kernel_entry_point<kernel_name>( | ||
int i, int st1, X st2, int asta1_1, X asta1_2, int asta2_1, X asta2_2) | ||
{ | ||
// KT is decltype(k) in f(). | ||
KT k = { | ||
i, | ||
special_type::sycl_reconstruct(std::move(st1), std::move(st2)), | ||
{ // 'a' | ||
{ // 'sta' | ||
special_type::sycl_reconstruct(std::move(asta1_1), std::move(asta1_2)), | ||
special_type::sycl_reconstruct(std::move(asta2_1), std::move(asta2_2)) | ||
} | ||
} | ||
}; | ||
k(); | ||
// Destructors runs for 'k', 'k.st', 'k.a.sta[0]', and 'k.a.sta[1]'. | ||
// Destructors run for 'st2', 'asta1_2', and 'asta2_2'. | ||
|
||
} | ||
|
||
The ``sycl_kernel_launch()`` function is then responsible for enqueuing the kernel | ||
invocation and arranging for each of its function arguments to be bit-copied to the | ||
device (or, for special types used to implement decomposition for types that require | ||
special handling, like ``local_accessor``, copied to the device in an appropriate way. | ||
Such special requirements are handled by the SYCL RT implementation). | ||
|
||
This design is not intended to address all possible transformations of kernel | ||
arguments that might be desired for performance optimization purposes. | ||
For example, a transformation to coalesce local accessors in order to perform | ||
a single allocation request rather than one for each local accessor might be desirable. | ||
This design leaves such transformations to the SYCL RT to implement via its own means. |
Uh oh!
There was an error while loading. Please reload this page.