Skip to content
Draft
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
289 changes: 288 additions & 1 deletion clang/docs/SYCLSupport.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ SYCL Compiler and Runtime architecture design
.. contents::
:local:


Introduction
============

Expand All @@ -13,6 +14,7 @@ library. More details are provided in
`external document <https://github.com/intel/llvm/blob/sycl/sycl/doc/design/CompilerAndRuntimeDesign.md>`_\ ,
which are going to be added to clang documentation in the future.


Address space handling
======================

Expand Down Expand Up @@ -112,8 +114,293 @@ space attributes for pointers:
* - ``__attribute__((opencl_private))``
- private_space


.. code-block:: C++

//TODO: add support for __attribute__((opencl_global_host)) and __attribute__((opencl_global_device)).


Kernel argument validation and decomposition
============================================

SYCL 2020 specifies requirements on the types of arguments that can be passed to
a SYCL kernel.
These requirements are enforced on the arguments passed to functions declared
with the
`[[clang::sycl_kernel_entry_point]] <https://clang.llvm.org/docs/AttributeReference.html#sycl-kernel-entry-point>`__
attribute.

Valid kernel argument types can be broadly categorized in three groups per
`section 3.13.1, "Device copyable" <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec::device.copyable>`__
and
`section 4.12.4, "Rules for parameter passing to kernels" <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:kernel.parameter.passing>`__.

* Types that are implicitly device copyable because they satisfy the C++ definition
of trivially copyable
(`[basic.types.general]p9 <https://eel.is/c++draft/basic.types.general#9>`__,
`[class.prop]p1 <https://eel.is/c++draft/class.prop#1>`__.
Such types may be bit-copied to the device.
* Types that are explicitly device copyable because, for a type ``T``,
``sycl::is_device_copyable_v<T>`` is true.
Such types may be bit-copied to the device.
* Types that are device copyable by fiat (``sycl::accessor``, ``sycl::local_accessor``,
``sycl::stream``, ``sycl::reducer``, etc...).
Such types may require special handling.

Support for the third category of types is provided through a *decomposition protocol*
that such types opt in to as described below.
The decomposition protocol facilitates transformation of an object of such a type
into a sequence of objects, each of which has a type that satisfies one of the
first two type categories.

A type that opts into the decomposition protocol may be a type that also satisfies
the C++ definition of a trivially copyable type and may therefore appear as a
subobject
(`[intro.object]p2 <https://eel.is/c++draft/intro.object#2>`__)
type of another type that also satisfies the trivially copyable type requirements.

Types that opt in to the decomposition protocol, or directly or indirectly have a
subobject type that opts in to the decomposition protocol, *require decomposition*.
A kernel argument of such a type is transformed to a sequence of arguments that
are substituted for the original argument.

Given a call to a function declared with the ``[[clang::sycl_kernel_entry_point]]``
attribute, each argument ``A`` of parameter type ``P`` is processed as follows.
The resulting sequence of replacement arguments constitutes the arguments to the
``sycl_kernel_launch()`` function and their types constitute the corresponding
parameters of the synthesized offload kernel entry point function (the SYCL kernel
caller function).

#. If ``P`` is a non-union class type with a ``sycl_deconstruct()`` member function,
then the type is one that has opted in to the decomposition protocol and shall
meet the requirements below.

#. The ``sycl_deconstruct()`` member function shall be a non-static member
function, shall declare no parameters, shall return a *tuple-like* type that
satisfies the requirements for an initializer of a structured binding declaration
(`[dcl.struct.bind] <https://eel.is/c++draft/dcl.struct.bind>`__),
shall not be declared with the ``[[noreturn]]`` attribute, may have a
potentially-throwing exception specification
(`[except.spec]p1 <https://eel.is/c++draft/except.spec#1>`__),
and may have a function body that throws exceptions.

#. The class shall declare a ``sycl_reconstruct()`` member function.
That function shall be a static member function, shall have a return type of
cv-unqualified ``P``, shall not be declared with the ``[[noreturn]]``
attribute, may have a non-throwing exception specification, and shall declare
parameters corresponding to the elements of the tuple-like type returned by
``sycl_deconstruct()``.
For each element of that tuple-like type, there shall be a corresponding in-order
parameter with a type that is convertible from the tuple element type.
Additional parameters with default arguments may be present.
The body of the function shall abide by the device language restrictions
specified in
`section 5.4, "Language restrictions for device functions" <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:language.restrictions.kernels>`__.

The original argument ``A`` is replaced with the sequence of elements returned
in the tuple-like type for a call to ``sycl_deconstruct()`` on ``A``.
Each element with a type that requires decomposition is recursively processed
and replaced by its sequence of decomposed objects.

#. Otherwise, if ``P`` is a non-union aggregate type
(`[dcl.init.aggr]p1 <https://eel.is/c++draft/dcl.init.aggr#1>`__)
or a lambda closure type
(`[expr.prim.lambda.closure]p1 <https://eel.is/c++draft/expr.prim.lambda.closure#1>`__),
its subobject types determine whether ``P`` requires decomposition.
If any subobject type requires decomposition, then ``P`` requires decomposition.
Each subobject of ``A`` that has a type that requires decomposition is recursively
processed and their replacement objects are sequenced after ``A``.
If all subobject types require decomposition, then ``A`` is removed from the kernel
argument list.
Otherwise, the remaining subobjects of ``A`` that do not require decomposition may
be passed as individual kernel arguments in place of ``A`` (the choice to pass ``A``
with the storage for its decomposed subobjects bit-copied and reinitialized in the
offload entry point function or to pass each remaining subobject as a distinct
argument in place of ``A`` is unspecified; the intent is to allow the most efficient
choice to be made based on the cost of passing ``A`` vs the cost of passing
additional arguments).

#. Otherwise, if ``P`` is a trivially copyable non-class type, then ``A`` is passed as
a bit-copyable argument.

#. Otherwise, if the `SYCL_DEVICE_COPYABLE` macro is predefined with a value of ``1``,
``P`` is a class type, and ``sycl::is_device_copyable_v<P>`` is true, ``P`` shall
satisfy the constraints listed in
`section 3.13.1, "Device copyable" <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec::device.copyable>`__.
``A`` is passed as a bit-copyable argument.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a case like this:

struct dc {
  dc(const dc& other) {/*... not trivial ...*/}
};

template<>
struct sycl::is_device_copyable<dc> : std::true_type {};

struct udt {
  float f;
  dc d;
};

Do you expect this rule to apply to struct udt?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule that governs udt is the earlier non-union aggregate case. In this case, udt has no subobjects that require decomposition and is therefore bit-copyable. dc is presumed to satisfy the device copyable rules because sycl::is_device_copyable_v<dc> is true (presumed because the copy constructor cannot be confirmed to perform a bitwise copy).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is your intent, then I think rule 2 needs to be clarified somehow. There is no mention in that rule about sub-object types that are marked is_device_copyable. For that matter, there is no mention in that rule about sub-object types that are (or are not) trivially copyable.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those cases fall into the "Otherwise ..." portion of the rule. I'll try to add some clarification.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clarify this. Please review.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is confusing. Rule 4 is very detailed about what constitutes "device copyability" (e.g. it talks about the SYCL_DEVICE_COPYABLE macro, etc.) However, rule 2 just says "All subobjects that do not require decomposition shall be device copyable", with no further explanation about what "device copyable" means. This gives the impression that the macro does not apply to rule 2, while it does apply to rule 4.

I actually think it would be clearer to omit the details about when an object is "device copyable" -- these details are in the SYCL specification. Thus, I think you could simplify rule 4.

In fact, I think rules 3, 4, and 6 could all be combined into a single rule that just says:

Otherwise, if P is a device copyable type, then A is passed as a bit-copyable argument.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing these, I was thinking of things from an implementer perspective and, in particular, what type checking needs to be done in which order. That is why I separated the checks for trivial copyability and the sycl::is_device_copyable trait. From a specification perspective, what you suggest makes sense.

However, given recent discussion, I think we'll be pursuing a different approach. I have a proto-POC that provides the capabilities that Andrei is requesting. If discussions with him go well, I'll submit a new PR with documentation for that approach.


#. Otherwise, if ``P`` is a union type, then each of its non-static data members shall
have a trivially copyable type, none of which requires decomposition.
``A`` is passed as a bit-copyable argument.

#. Otherwise, if ``P`` is a trivially copyable class type, none of its non-static data
members shall require decomposition and its anonymous union members and non-static
data members of union type shall satisfy the constraints on a union type described
above. ``A`` is passed as a bit-copyable argument.

#. Otherwise, ``P`` is not a valid kernel argument type and the program is ill-formed.

If the sequence of replacement arguments contains an argument with a type that is,
or has as a subobject type, a reference type, a type that is prohibited in device
code (e.g., ``long double``), or a pointer to data member type (see
`SYCL WG issue 612 <https://github.com/KhronosGroup/SYCL-Docs/issues/612>`__),
then the program is ill-formed.

The constraints on union types are derived from explicit restrictions specified in
`section 5.4, "Language restrictions for device functions" <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:language.restrictions.kernels>`__
and practical restrictions that are not clearly addressed in SYCL 2020 (see
`CMPLRLLVM-61883 <https://jira.devtools.intel.com/browse/CMPLRLLVM-61883>`__
for some previous discussion).

The sequence of possibly decomposed arguments is passed to the
``sycl_kernel_launch()`` function.
Each argument is passed as an xvalue and may be move-constructed from.
Destructors are invoked as usual for the argument/parameter pairs of the
``[[clang::sycl_kernel_entry_point]]`` attributed function and for the
``sycl_kernel_launch()`` function.

The body of a function declared with the ``[[clang::sycl_kernel_entry_point]]``
attribute is incorporated in the body of the synthesized offload kernel entry point
function.
In order to execute the incorporated statements, objects matching the original
sequence of arguments must be available.
For arguments that were not subject to decomposition, the matching parameter is
used.
For arguments that were subject to decomposition, the original argument is
reconstructed as a local variable from the parameters that correspond to the
decomposed sequence of arguments.
Within the variable initialization, parameters are referenced as xvalues and
may be move-constructed from.
Destructors are invoked as usual for the parameters of the synthesized offload
kernel entry point and the local variables used to reconstruct the original
kernel arguments.

Consider the following example.

.. code-block:: C++

#include <tuple>
#include <sycl/sycl.hpp>

template<typename KN, typename... Ts>
void sycl_kernel_launch(const char* kn, Ts.. ts) { ... }

template<typename KN, typename KT>
[[clang::sycl_kernel_entry_point(KN)]]
void kernel_entry_point(KT k) {
k();
}

struct X {
~X();
int dm;
};
template<> bool sycl::is_device_copyable_v<X> = true;

struct special_type {
~special_type();
std::tuple<int, X> sycl_deconstruct();
static special_type sycl_reconstruct(int, X) noexcept;
};

struct aggregate {
special_type sta[2];
};

struct kernel_name;

void f() {
int i;
special_type st;
aggregate a;
auto k = [i, st, a] {};
kernel_entry_point<kernel_name>(k);
}

For host compilation, the body of ``kernel_entry_point<kernel_name>()`` is
replaced with synthesized code that performs kernel argument decomposition
and forwards the results to ``sycl_kernel_launch()``.
The synthesized code looks approximately as follows.
Structured bindings are used to illustrate the intent.
Note that some of this code does not conform to standard C++ (a lambda
closure type can not necessarily be decomposed as shown), but it hopefully
suffices to convey the intent.

.. code-block:: C++

// KT is decltype(k) in f().
void kernel_entry_point<kernel_name>(KT k) {
// 'k' is a lambda closure type with captures that require decomposition.
auto& [ i, st, a ] = k;
// All of the captures of 'k' have been decomposed; it is eliminated as a kernel argument.
// 'i' is a trivially copyable non-class type; no further decomposition required.
// 'st' is of a type that opts into the decomposition protocol.
auto&& [ st1, st2 ] = st.sycl_deconstruct();
// 'st1' is a trivially copyable non-class type; no further decomposition required.
// 'st2' is an explicitly device copyable type; no further decomposition required.
// 'a' is an aggregate with a member that requires decomposition.
auto& [ asta ] = a;
// All of the data members of 'a' have been decomposed; it is eliminated as a kernel argument.
// 'asta' is an aggregate with elements that require decomposition.
auto& [ asta1, asta2 ] = asta;
// All of the elements of 'asta' have been decomposed; it is eliminated as a kernel argument.
// 'asta1' is of a type that opts into the decomposition protocol.
auto&& [ asta1_1, asta1_2 ] = asta1.sycl_deconstruct();
// All of the elements of 'asta1' have been decomposed; it is eliminated as a kernel argument.
// 'asta1_1' is a trivially copyable non-class type; no further decomposition required.
// 'asta1_2' is an explicitly device copyable type; no further decomposition required.
// 'asta2' is of a type that opts into the decomposition protocol.
auto&& [ asta2_1, asta2_2 ] = asta2.sycl_deconstruct();
// All of the elements of 'asta2' have been decomposed; it is eliminated as a kernel argument.
// 'asta2_1' is a trivially copyable non-class type; no further decomposition required.
// 'asta2_2' is an explicitly device copyable type; no further decomposition required.

// Pass the decomposed arguments, all of which satisfy the SYCL 2020 device copyable
// and kernel argument requirements, to the sycl_kernel_launch() function.
sycl_kernel_launch<kernel_name>("kernel_name",
std::move(i),
std::move(st1), std::move(st2),
std::move(asta1_1), std::move(asta1_2),
std::move(asta2_1), std::move(asta2_2));

// Destructor runs for 'k'.
}

For device compilation, an offload kernel entry point function is synthesized that looks
approximately as follows.
Aggregate initialization is used to illustrate the intent with the acknowledgement
that, for example, lambda closure types cannot be constructed with aggregate
initialization in standard C++.
Again, hopefully the intent is clear.

.. code-block:: C++

void offload_kernel_entry_point<kernel_name>(
int i, int st1, X st2, int asta1_1, X asta1_2, int asta2_1, X asta2_2)
{
// KT is decltype(k) in f().
KT k = {
i,
special_type::sycl_reconstruct(std::move(st1), std::move(st2)),
{ // 'a'
{ // 'sta'
special_type::sycl_reconstruct(std::move(asta1_1), std::move(asta1_2)),
special_type::sycl_reconstruct(std::move(asta2_1), std::move(asta2_2))
}
}
};
k();
// Destructors runs for 'k', 'k.st', 'k.a.sta[0]', and 'k.a.sta[1]'.
// Destructors run for 'st2', 'asta1_2', and 'asta2_2'.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one of the argument types was explicitly marked "device copyable", would you still generate a call to its destructor here? I think you said in chat that you would, with the rationale that the SYCL spec says that it is the user's responsibility to ensure that the destructor has no effect when run on the device (section 3.13.1 "Device copyable"). I agree that this is what the spec says.

@rolandschulz do you think this is what people expect when they mark a type as device copyable? Or, do they expect that the compiler will not generate a call to the destructor in device code?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I guess it would actually be up to the calling convention. If the convention is callee-destroy, then destructors for parameters will be called when the function returns. Since there is no traditional caller for the invocation of the entry point, then destructors would not be called for a caller-destroy convention. Perhaps I should change the comment to "Destructors may run" for parameters.

Eliding calls to destructors for objects that are not decomposed and passed as their own argument might be possible, but would be surprising in my opinion. In this example, k is illustrated as a local variable and I would expect implementation to do likewise. Eliding destructor calls might be problematic; see #53 for an example of a crash that Mariya fixed by adding missing cleanup handling.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a note on that in the spec "Likewise, it is unspecified whether the implementation actually calls the destructor for such a class on the device since the destructor must have no effect on the device.". If I recall the discussions correctly we did that to allow either implementation. My understanding is that #53 shouldn't be needed because it should only fix a problem for wrong code.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had mentioned that note somewhere, but I don't see it now. It probably is worth explicitly mentioning somewhere.

I don't think there is anything wrong with the test case added for that issue (aside from a missing specialization of sycl::is_device_copyable<DCopyable>). The relevant type is declared as:

struct DCopyable {
  int i;
  ~DCopyable();
};

The requirements in section 3.13.1 don't require a trivial destructor. The requirement is that the destructor is public, not defined as deleted, and "has no effect when executed on the device".

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more specificity around destructors and added a reference to 3.13.1 regarding it being unspecified if they are called.

}

The ``sycl_kernel_launch()`` function is then responsible for enqueuing the kernel
invocation and arranging for each of its function arguments to be bit-copied to the
device (or, for special types used to implement decomposition for types that require
special handling, like ``local_accessor``, copied to the device in an appropriate way.
Such special requirements are handled by the SYCL RT implementation).

This design is not intended to address all possible transformations of kernel
arguments that might be desired for performance optimization purposes.
For example, a transformation to coalesce local accessors in order to perform
a single allocation request rather than one for each local accessor might be desirable.
This design leaves such transformations to the SYCL RT to implement via its own means.