diff --git a/llvm/docs/SYCLFrontEndDeviceCompilation.rst b/llvm/docs/SYCLFrontEndDeviceCompilation.rst new file mode 100644 index 0000000000000..4471e2e6c0612 --- /dev/null +++ b/llvm/docs/SYCLFrontEndDeviceCompilation.rst @@ -0,0 +1,135 @@ +================================= +SYCL Front-end device compilation +================================= + +.. contents:: + :local: + +.. _sycl_frontend_device_compilation: +SYCL front-end device compiler parses input source, outlines device part of the +code, applies additional restrictions on the device code (e.g. no exceptions or +virtual calls), generates LLVM IR bitcode for the device code only and +"integration header" which provides information like kernel name, parameters +order and data type for the runtime library. Multiple LLVM IR bitcodes (in case +of multiple targets) are packaged into a single object by the +clang-offload-packager. + +An example of front-end device compilation command is shown below: + + .. code-block:: console + + $ clang -cc1 -triple spirv64 -fsycl-is-device test.cpp -o test.bc + +Front-end device compilation for SYCL offloading can be split into the following +components - Device code outlining, SYCL kernel function object lowering, +Generation of device code diagnostics, and Integration header generation. These +components are explained in the sections below. + +********************* +Device code outlining +********************* + This component is responsible for identifying and outlining "device code" in the + single source. + Here is a code example of a SYCL program that demonstrates compiler outlining + work: + + .. code-block:: c++ + :linenos: + + int foo(int x) { return ++x; } + int bar(int x) { throw std::exception{"CPU code only!"}; } + // ... + using namespace sycl; + queue Q; + buffer a{range<1>{1024}}; + Q.submit([&](handler& cgh) { + auto A = a.get_access(cgh); + cgh.parallel_for(range<1>{1024}, [=](id<1> index) { + A[index] = index[0] * 2 + foo(42); + }); + } + + In this example, the compiler needs to compile the lambda expression passed + to the `sycl::handler::parallel_for` method, as well as the function `foo` + called from the lambda expression for the device. + + The compiler must also ignore the `bar` function when we compile the + "device" part of the single source code, as it's unused inside the device + portion of the source code (the contents of the lambda expression passed to the + `sycl::handler::parallel_for` and any function called from this lambda + expression). + + The current approach is to use the SYCL kernel attribute in the runtime to + mark code passed to `sycl::handler::parallel_for` as "kernel functions". + The runtime library can't mark foo as "device" code - this is a compiler + job: to traverse all symbols accessible from kernel functions and add them to + the "device part" of the code marking them with the new SYCL device attribute. + +******************************************************** +SYCL kernel function object (functor or lambda) lowering +******************************************************** + This component creates an OpenCL kernel function interface for SYCL kernels. + All SYCL memory objects shared between host and device (buffers/images, + these objects map to OpenCL buffers and images) must be accessed through special + `accessor` classes. The "device" side implementation of these classes contains + pointers to the device memory. As there is no way in OpenCL to pass structures + with pointers inside as kernel arguments all memory objects shared between host + and device must be passed to the kernel as raw pointers. + + SYCL also has a special mechanism for passing kernel arguments from host to + the device. In OpenCL kernel arguments are set by calling `clSetKernelArg` + function for each kernel argument, meanwhile in SYCL all the kernel arguments + are fields of "SYCL kernel function" which can be defined as a lambda function + or a named function object and passed as an argument to SYCL function for + invoking kernels (such as `parallel_for` or `single_task`). For example, in the + previous code snippet above `accessor` `A` is one such captured kernel argument. + + To facilitate the mapping of SYCL kernel data members to OpenCL + kernel arguments and overcome OpenCL limitations we added the generation of an + OpenCL kernel function inside the compiler. An OpenCL kernel function contains + the body of the SYCL kernel function, receives OpenCL-like parameters and + additionally does some manipulation to initialize SYCL kernel data members + with these parameters. In some pseudo code the OpenCL kernel function for the + previous code snippet above looks like this: + + .. code-block:: c++ + :linenos: + + // SYCL kernel is defined in SYCL headers: + template + __attribute__((sycl_kernel)) void sycl_kernel_function(KernelType KernelFuncObj) { + // ... + KernelFuncObj(); + } + // Generated OpenCL kernel function + __kernel KernelName(global int* a) { + KernelType KernelFuncObj; // Actually kernel function object declaration + // doesn't have a name in AST. + // Let the kernel function object have one captured field - accessor A. + // We need to init it with global pointer from arguments: + KernelFuncObj.A.__init(a); + // Body of the SYCL kernel from SYCL headers: + { + KernelFuncObj(); + } + } + + OpenCL kernel function is generated by the compiler inside the Sema using AST + nodes. + +************************************* +Generation of device code diagnostics +************************************* + This component enforces language restrictions on device code. + +***************************** +Integration header generation +***************************** + This component emits information required for binding host and device parts of + the SYCL code via OpenCL API. In proposed design, we use SYCL device front-end + compiler to produce the integration header for two reasons. First, it must be + possible to use any host compiler to produce SYCL heterogeneous applications. + Second, even if the same clang compiler is used for the host compilation, + information provided in the integration header is used (included) by the SYCL + runtime implementation, so the header must be available before the host + compilation starts. \ No newline at end of file diff --git a/llvm/docs/SYCLOffloadingDesign.rst b/llvm/docs/SYCLOffloadingDesign.rst new file mode 100644 index 0000000000000..7c6e83f52d4de --- /dev/null +++ b/llvm/docs/SYCLOffloadingDesign.rst @@ -0,0 +1,677 @@ +================================== +SYCL Offloading Design & Internals +================================== + +.. contents:: + :local: + +************ +Introduction +************ + +SYCL is a high-level single-source standard C++ programming model that can +target a wide range of heterogeneous platforms. SYCL is an industry-driven +standard developed by Khronos Group and announced in March 2014 that allows data +parallelism for heterogeneous systems based on C++ 17. It is a royalty-free, +cross-platform abstraction layer that enables the development of applications +and frameworks with the portability and efficiency of OpenCL, while providing a +user interface that's flexible and easy to use. SYCL standard is designed to be +both highly portable and highly efficient and allows applications to offload +compute kernels and data transfer to OpenCL-enabled accelerators such as GPUs, +FPGAs, and other processors. + +A SYCL implementation refers to a software version or revision of the SYCL +programming model. It consists of the necessary tools, libraries, and runtime +support that enable developers to write SYCL code and execute it on compatible +hardware platforms. These are some of the popular SYCL implementations that are +available. This document focuses on DPCPP by Intel which is a part of the oneAPI +solution that supports Intel CPUs, GPUs, and FPGAs as well as NVIDIA and AMD +GPUs. For more details, please look at `sycl-overview `_. + + .. code-block:: c++ + :linenos: + + #include + using namespace sycl; + int main() { + queue Q; // The queue, Q, is the object that + // submits the task to a device. + int const size = 10; + buffer A{ size }; // The buffer, A, is the memory used to + // transfer data between host and device. + Q.submit([&](handler& h) { // The handler, h, is the object that contains + // the single_task function to be used. + accessor A_acc(A, h); // The accessor, A_acc, is the object that + // efficiently accesses the buffer elements. + h.single_task([=]() { + A_acc[5] = 77; + }); + }); + host_accessor result(A); // host_accessor is the object that allows + // the host to access the buffer memory. + for (int i = 0; i < size; i++) // Print output + std::cout << result[i] << " "; std::cout << "\n"; + return 0; + } + +Here, we present a code with a simple SYCL algorithm called single_task, which +is equivalent in standard C++ would be std::thread, to gain a fundamental +understanding of SYCL components. The construct single_task is used to define a +unit of work that should be executed on a single processing element, typically a +single CPU core or GPU thread. A SYCL kernel function is a lambda function that +is passed as an argument to the single_task construct. +This document describes the clang driver and code generation steps for creating +offloading applications. Clang supports offloading to various architectures +using programming models like CUDA, HIP, and OpenMP. Currently, Clang support +for SYCL offloading is being added. + +Clang will support SYCL target offloading to several different architectures +such as Intel CPUs and GPUs, NVPTX, and AMDGPU. Offloading code is generated by +clang and then executed using SYCL runtime. This section describes the steps +necessary to create a functioning device image that can be loaded by the SYCL +runtime. {{ SYCL runtime reference? }}. + +SYCL offloading can support two modes of compilation: (i) Just-In-Time (JIT) +compilation and (ii) Ahead-Of-Time (AOT) compilation. In JIT compilation, final +wrapped device object is represented using the SPIR-V Intermediate +Representation. Final device code generation is deferred until application +runtime. JIT compilation provides portability of device code and allows for +target specific optimizations to happen. In AOT copilation, final device code is +generated at compile time (before the wrapping step) to produce the final +binary. AOT compilation reduces application execution time by skipping the +device code generation step during runtime and final device code can also be +tested before deploying. + +Remainder of this document illustrates the steps necessary to compile a SYCL +offloading application using clang. + +******************************************* +High-level SYCL Offloading Compilation Flow +******************************************* + +.. figure:: figures/toplevel.png + :scale: 50% + :align: center + + Figure 1: Top-level SYCL offloading compilation flow. + +Figure 1 showcases the SYCL offloading compilation flow. The SYCL compiler +performs the following high-level actions to generate SYCL offloading code: + + - Front-end device compilation step (`clang --target=`) + - Packaging of device code (`clang-offload-packager`) + - Front-end host compilation step (`clang --target= --fembed-offload-object`) + - Linking stage (`clang-linker-wrapper`) + +The set of possible 'clang -cc1' invocations for a given user command is shown +below. + +.. code-block:: console + :linenos: + + $ clang++ -fsycl --offload-arch=pvc,sm_52 test.cpp -### + clang -cc1 -triple nvptx64 ... // AOT device compilation for sm_52 + clang -cc1 -triple spir64_gen ... // AOT device compilation for pvc + clang -cc1 -triple spirv64 ... // JIT device compilation (backup) for pvc + clang -cc1 -triple x86_64 ... // Host compilation + +**************************** +Front-end device compilation +**************************** +This step parses input source, outlines device part of the code, applies +additional restrictions on the device code (e.g. no exceptions or virtual +calls), generates LLVM IR bitcode for the device code only and "integration +header" which provides information like kernel name, parameters order and data +type for the runtime library. Multiple LLVM IR bitcodes (in case of multiple +targets) are packaged into a single object by the clang-offload-packager. + +An example of front-end device compilation command is shown below: + + .. code-block:: console + + $ clang -cc1 -triple spirv64 -fsycl-is-device test.cpp -o test.bc + +Front-end device compilation for SYCL offloading can be split into the following +components - Device code outlining, SYCL kernel function object lowering, +Generation of device code diagnostics, and Integration header generation. These +components are explained :ref:`in this document`. + +********************* +Device code packaging +********************* +When dealing with multiple device binaries, an additional step is performed to +package the multiple device binaries before being added to the host object. This +additional step is performed with the clang-offload-packager taking image inputs +containing information relating to the target triple, architecture setting and +offloading kind. + +The clang-offload-packager is run during ‘fat object’ generation regardless of +the number of device binaries being added to the conglomerate fat object. The +device binaries are contained in what is designated as an ‘Offload Binary’. +These binaries can reside in a variety of binary formats including Bitcode +files, ELF objects, executables and shared objects, COFF objects, archives or +simply stored as an offload binary. + +Example usage of clang-offload-packager: + + .. code-block:: console + + $ clang-offload-packager --image=file=,triple=,kind= + $ clang-offload-packager --image=file=test.bc,triple=spirv64,kind=sycl -o device.out + +************************** +Front-end host compilation +************************** +This step parses input file and emits code for the host device. This host +compilation takes an additional argument which helps to embed the packaged +device binary into the host code. This step generates a fat object that will be +consumed during link stage. + +An example of front-end host compilation command is shown below: + + .. code-block:: console + + $ clang -cc1 -triple x86_64 -fsycl-is-host -fembed-offload-object=device.out test.cpp -o test.o + +************* +Linking stage +************* +In this step, fat object file(s) are passed to linker wrapper tool. The tool +extracts the device objects and runs the device linking action on the extracted +objects. Fully formed device images are then wrapped into host objects and then +linked with host image in the original fat object using the host linker tool. + +.. figure:: figures/linker_wrapper.png + :scale: 50% + :align: center + + Figure 2: Device linking flow for SYCL offloading inside the linker-wrapper tool. + +Figure 2 shows the compilation flow inside the linker-wrapper tool. First, all +device objects are extracted from the fat objects and grouped according to the +target device. For each target device group, we invoke clang to link the device +objects and generate a list of device objects. All the device objects are then +wrapped together and embedded into a wrapped host object. This wrapped object is +linked normally with the rest of host objects and/or libraries using the usual +linker - e.g. `ld` on Linux and `link.exe` on Windows. + +Remainder of this chapter provides a detailed explanation of different steps of +SYCL compilation flow inside the linker-wrapper tool. + +Linking of device objects +========================= + Device code linking for SYCL offloading will be performed by making a call to + the clang driver. A SYCL specific option '--sycl-link' is passed to the driver + to control the linking process. In the presence of this option, the clang driver + will invoke a SYCL linker tool called 'clang-sycl-linker'. This tool is + responsible for linking the device bitcode and generating a group of + fully-formed device images. An example of a call to clang-sycl-linker is shown + below. + + .. code-block:: console + + $ clang++ --target --sycl-link test.bc test.o -Xlinker -### + clang-sycl-linker -triple test.bc test.o + + .. figure:: figures/sycl_linker.png + :scale: 50% + :align: center + + Figure 3: SYCL-specific device code linking for JIT compilation flow. + + Figure 3 shows the device code linking process performed inside the SYCL linker + tool for JIT compilation. Following are the key steps performed: + + 1. All the inputs of LLVM IR bitcode type are gathered and linked together using + the linkInModule API calls. + + 2. If the user provides device library files, the output of the previous step is + linked with such files to get the final fully linked LLVM IR device bitcode + image. + + 3. The fully linked LLVM IR device bitcode image undergoes several post-link + steps. This include device code splitting, specialization constant lowering, + symbol table generation, and property sets generation. + + 4. Each of the split device images in the LLVM IR format will be translated to + SPIR-V IR.using an external tool 'llvm-spirv'. + + Remainder of this section discusses the various post-link steps and the + translation to SPIR-V IR. + +Device code splitting +--------------------- + This process takes as input a fully linked device module with a set of SYCL + device kernels for a specific target and performs splitting to generate + several fully-contained device modules. Each of the newly formed module + contains a sub-set of the original set of SYCL device kernels along with a + union of all the functions from each of their respective call graphs. Here, + call graph of a SYCL kernel is the set of all functions reachable from that + kernel. + Device code splitting is performed for two reasons: + + (1) User requests a specific mode of code splitting. Three modes of code + splitting are supported. They are: + + i. per-kernel - Every SYCL kernel and the functions that are contained in its + call graph go into a separate module. + + ii. per-module - All SYCL kernels belonging to a specific user-specified + module and all functions from their respective call graphs go into a separate + module. + + iii. auto - per-kernel or per-module option is selected based on some + heuristics. + + (2) SYCL kernels can be specialized for different devices. For example, kernels + that use 64-bit floating point (FP64) operations can be executed only on devices + that support such operations. Hence, it is required that all kernels that + require FP64 support be grouped separately from kernels that do not require FP64 + support. A point to note: A SYCL kernel is said to require FP64 support if the + kernel itself or any funtion in its call graph uses FP64 operations. Figure 4 + showcases this device code split. + + .. figure:: figures/code_split.png + :scale: 50% + :align: center + + Figure 4: An example of device code splitting for SYCL offloading. + +Specialization constant lowering +-------------------------------- + Specialization constants are implemented in accordance with how they are + defined by SYCL 2020 specification. Here are links to `sycl-registry `_ + and `sycl-2020-spec `_. + + Specialization constants represent constants whose values can be set + dynamically during execution of the SYCL application. The values of these + constants are fixed when a SYCL kernel function is invoked, and they do not + change during the execution of the kernel. However, the application is able to + set a new value for a specialization constants each time a kernel is invoked, + so the values can be tuned differently for each invocation. + An example usage of specialization constant is shown below: + + .. code-block:: c++ + :linenos: + + #include + using namespace sycl; + using coeff_t = std::array, 3>; + // Read coefficients from somewhere. + coeff_t get_coefficients(); + // Identify the specialization constant. + constexpr specialization_id coeff_id; + void do_conv(buffer in, buffer out) { + queue myQueue; + myQueue.submit([&](handler &cgh) { + accessor in_acc { in, cgh, read_only }; + accessor out_acc { out, cgh, write_only }; + // Set the coefficient of the convolution as constant. + // This will build a specific kernel the coefficient available as literals. + cgh.set_specialization_constant(get_coefficients()); + cgh.parallel_for( + in.get_range(), [=](item<2> item_id, kernel_handler h) { + float acc = 0; + coeff_t coeff = h.get_specialization_constant(); + for (int i = -1; i <= 1; i++) { + if (item_id[0] + i < 0 || item_id[0] + i >= in_acc.get_range()[0]) + continue; + for (int j = -1; j <= 1; j++) { + if (item_id[1] + j < 0 || item_id[1] + j >= in_acc.get_range()[1]) + continue; + // The underlying JIT can see all the values of the array returned + // by coeff.get(). + acc += coeff[i + 1][j + 1] * + in_acc[item_id[0] + i][item_id[1] + j]; + } + } + out_acc[item_id] = acc; + }); + }); + myQueue.wait(); + } + +Handling of specialization constants requires the compiler to have access to a +fully linked device image and hence happens during the post-link step. +This handling happens inside a dedicated LLVM IR pass named`SpecConstantsPass` +which: +1. Assigns numeric IDs to specialization constants found in the linked module. +2. Transforms IR to a form expected by the LLVM to SPIR-V translation step. +3. Translates SPIR-V friendly IR to SPIR-V IR during LLVM to SPIR-V +transation step. +1. Collects and provides \ =\> \ +mapping, which is later being used by SYCL runtime to set specialization +constant values provided by user. + +Remainder of this section explains these three steps in detail. + +Assignment of numeric IDs to specialization constants +````````````````````````````````````````````````````` + +This task is achieved by maintaining a map, which holds a list of numeric IDs +for each encountered symbolic ID of a specialization constant. Those IDs are +used to identify the specialization constants at SPIR-V level. + +As noted above one symbolic ID can have several numeric IDs assigned to it - +such 1:N mapping comes from the fact that at SPIR-V level, composite +specialization constants don't have dedicated IDs and they are being identified +and specialized through their scalar leafs and corresponding numeric IDs. + +For example, the following code: + + .. code-block:: c++ + :linenos: + + struct Nested { + float a, b; + }; + struct A { + int x; + Nested n; + }; + constexpr specialization_id id_int; + constexpr specialization_id id_A; + // ... + [=](kernel_handler h) { + h.get_specialization_constant(); + h.get_specialization_constant(); + } + +Will result in the following numeric IDs assignment: + + .. code-block:: c++ + :linenos: + + // since `id_int` is a simple arithmetic specialization constant, we only + // have a single numeric ID associated with its symbolic ID + unique_symbolic_id_for_id_int -> { 0 } + // `id_A` is a composite with three leafs (scalar members, including ones + // located in nested composite types), which results in three numeric IDs + // associated with the same symbolic ID + unique_symbolic_id_for_id_A -> { 1, 2, 3 } + +As it is shown in the example above, if a composite specialization constant +contains another composite within it, that nested composite is also being +"flattened" and its leafs are considered to be leafs of the parent +specialization constants. This done by depth-first search through the composite +elements. + +Transformation of LLVM IR to SPIR-V friendly IR form +```````````````````````````````````````````````````` + +SPIR-V friendly IR form is a special representation of LLVM IR, where some +function are named in particular way in order to be recognizable by the SPIR-V +translator to convert them into corresponding SPIR-V instructions later. +The format is documented [here][spirv-friendly-ir] `spirv-friendly-ir `_ + +For specialization constant, we need to generate the following constructs: + + .. code-block:: c++ + :linenos: + + template // T is arithmetic type + T __spirv_SpecConstant(int numericID, T default_value); + + template // T is composite type, + // Elements are arithmetic or composite types + T __spirv_SpecConstantComposite(Elements... elements); + +Particularly, `SpecConstantsPass` translates calls to the + + .. code-block:: c++ + + T __sycl_getScalar2020SpecConstantValue(const char *SymbolicID, const void *DefaultValue, const char *RTBuffer) + +intrinsic into calls to + + .. code-block:: c++ + + T __spirv_SpecConstant(int ID, T default_val) + +And for + + .. code-block:: c++ + + T __sycl_getComposite2020SpecConstantValue(const char *SybmolicID, const void *DefaultValue, const char *RTBuffer) + +it generates number of + + .. code-block:: c++ + + T __spirv_SpecConstant(int ID, T default_val) + +calls for each leaf of the composite type, plus number of + + .. code-block:: c++ + + T __spirv_SpecConstantComposite(Elements... elements) + +for each composite type (including the outermost one). + +Example of LLVM IR transformation can be found below, input LLVM IR: + + .. code-block:: c++ + :linenos: + + %struct.POD = type { [2 x %struct.A], <2 x i32> } + %struct.A = type { i32, float } + + @gold_scalar_default = global %class.specialization_id { i32 42 } + @gold_default = global %class.specialization_id { %struct.POD { [2 x %struct.A] [%struct.A { i32 1, float 2.000000e+00 }, %struct.A { i32 2, float 3.000000e+00 }], <2 x i32> } } + + ; the second argument of intrinsics below are simplified a bit + ; in real-life LLVM IR it looks like: + ; i8* bitcast (%class.specialization_id* @gold_scalar_default to i8* + %gold_scalar = call i32 __sycl_getScalar2020SpecConstantValue ("gold_scalar_identifier", @gold_scalar_default, i8* %buffer) + %gold = call %struct.POD __sycl_getComposite2020SpecConstantValue ("gold_identifier", @gold_default, i8* %default) + +LLVM IR generated by `SpecConstantsPass`: + + .. code-block:: c++ + :linenos: + + %gold_scalar = call i32 __spirv_SpecConstant(i32 0, i32 42) + + %gold_POD_A0_x = call i32 __spirv_SpecConstant(i32 1, i32 1) + %gold_POD_A0_y = call float __spirv_SpecConstant(i32 2, float 2.000000e+00) + + %gold_POD_A0 = call %struct.A __spirv_SpecConstantComposite(i32 %gold_POD_A0_x, float %gold_POD_A0_y) + + %gold_POD_A1_x = call i32 __spirv_SpecConstant(i32 3, i32 2) + %gold_POD_A1_y = call float __spirv_SpecConstant(i32 4, float 3.000000e+00) + + %gold_POD_A1 = call %struct.A __spirv_SpecConstantComposite(i32 %gold_POD_A1_x, float %gold_POD_A1_y) + + %gold_POD_A = call [2 x %struct.A] __spirv_SpecConstantComposite(%struct.A %gold_POD_A0, %struct.A %gold_POD_A1) + + %gold_POD_b0 = call i32 __spirv_SpecConstant(i32 4, i32 44) + %gold_POD_b1 = call i32 __spirv_SpecConstant(i32 6, i32 44) + %gold_POD_b = call <2 x i32> __spirv_SpecConstant(i32 %gold_POD_b0, i32 %gold_POD_b1) + + %gold = call %struct.POD __spirv_SpecConstantComposite([2 x %struct.A] %gold_POD_A, <2 x i32> %gold_POD_b) + +This SPIR-V friendly LLVM IR will be handled appropriately by the LLVM to SPIR-V +translation step. + +Translating SPIR-V friendly LLVM IR to SPIR-V +````````````````````````````````````````````` +Given the `__spirv_SpecConstant` intrinsic calls produced by the +`SpecConstants` pass: + + .. code-block:: c++ + :linenos: + + ; Function Attrs: alwaysinline + define dso_local spir_func i32 @get() local_unnamed_addr #0 { + ; args are "ID" and "default value": + %1 = tail call spir_func i32 @_Z20__spirv_SpecConstantii(i32 42, i32 0) + ret i32 %1 + } + + %struct.A = type { i32, float } + + ; Function Attrs: alwaysinline + define dso_local spir_func void @get2(%struct.A* sret %ret.ptr) local_unnamed_addr #0 { + ; args are "ID" and "default value": + %1 = tail call spir_func i32 @_Z20__spirv_SpecConstantii(i32 43, i32 0) + %2 = tail call spir_func float @_Z20__spirv_SpecConstantif(i32 44, float 0.000000e+00) + %ret = tail call spir_func %struct.A @_Z29__spirv_SpecConstantCompositeif(%1, %2) + store %struct.A %ret, %struct.A* %ret.ptr + ret void + } + +the translator will generate `OpSpecConstant` SPIR-V instructions with proper +`SpecId` decorations: + +.. code-block:: c++ + :linenos: + + OpDecorate %i32 SpecId 42 ; ID + %i32 = OpSpecConstant %int 0 ; Default value + OpDecorate %A.i32 SpecId 43 ; ID of the 1st member + OpDecorate %A.float SpecId 44 ; ID of the 2nd member + %A.i32 = OpSpecConstant %int.type 0 ; 1st member with default value + %A.float = OpSpecConstant %float.type 0.0 ; 2nd member with default value + %struct = OpSpecConstantComposite %struct.type %A.i32 %A.float ; Composite doens't need IDs or default value + %1 = OpTypeFunction %int + + %get = OpFunction %int None %1 + %2 = OpLabel + OpReturnValue %i32 + OpFunctionEnd + %1 = OpTypeFunction %struct.type + + %get2 = OpFunction %struct.type None %struct + %2 = OpLabel + OpReturnValue %struct + OpFunctionEnd + +Collecting spec constants info and communicating it to SYCL runtime +``````````````````````````````````````````````````````````````````` + +For each encountered specialization constants, we emit a property, which encodes +information required by SYCL runtime to set the value of a specialization +constant through corresponding API. + +These properties are stored in "SYCL/specialization constants" property set and +their names are the same as symbolic IDs of corresponding specialization +constants. + +Each such property contains an array of tuples (descriptors) +\. This descriptor might be overcomplicated for +simple arithmetic spec constants, but it is still used for them in order to +unify internal representation of scalar and composite spec constants and +simplify their handling in SYCL runtime. +This descriptor is needed, because at SYCL runtime level, composite constants +are set by user as a byte array and we have to break it down to the leaf members +of the composite and set a value for each leaf as for a separate scalar +specialization constant. + +For simple scalar specialization constants the array will only contain a single +descriptor representing the constant itself. For composite specialization +constants the array will contain several descriptors for each leaf of the +composite type. + +The descriptor contains the following fields: +- ID of a composite constant leaf, i.e. ID of a scalar specialization constant, +which is a part of a composite type or ID of a constant itself if it is a +scalar. +- Offset from the beginning of composite, which points to the location of a +scalar value within the composite, i.e. the position where scalar +specialization constant resides within the byte array supplied by the user. +For scalar specialization constants it will always be 0. +- Size of the scalar specialization constant + +For example, the following code: + .. code-block:: c++ + :linenos: + + struct Nested { + float a, b; + }; + struct A { + int x; + Nested n; + }; + + constexpr specialization_id id_int; + constexpr specialization_id id_A; + // ... + [=](kernel_handler h) { + h.get_specialization_constant(); + h.get_specialization_constant(); + } + +Will result in the following property set generated: + .. code-block:: c++ + :linenos: + + property_set { + Name = "SYCL/specialization constants", + properties: [ + property { + Name: "id_int_symbolic_ID", + ValAddr: points to byte array [{0, 0, 4}], + Type: PI_PROPERTY_TYPE_BYTE_ARRAY, + Size: sizeof(byte array above) + }, + property { + Name: "id_A_symbolic_ID", + ValAddr: points to byte array [{1, 0, 4}, {2, 4, 4}, {3, 8, 4}], + Type: PI_PROPERTY_TYPE_BYTE_ARRAY, + Size: sizeof(byte array above) + }, + ] + } + +During the creation of the wrapped device image, these device image properties +are embedded into the application together with device code and used by SYCL +runtime while handling specialization constants during application execution. It +calls appropriate API to set a value of a specialization constant. + +Symbol table generation +----------------------- + +During the post-linking stage, a symbol table (list of kernels) for each +produced device module is generated. This facilitates proper module selection +by SYCL runtime. This symbol table will consist name of all device entry points +inside a given module. An example LLVM IR module is shown below. + .. code-block:: c++ + :linenos: + + define spir_kernel void @foo(%arg1, %arg2) { + } + define spir_kernel void @bar(%arg1, %arg2) { + } + define spir_kernel void @alpha(%arg1, %arg2) { + } + +For this example, the following symbol table generated. All symbols in symbol +table will be concatenated in a single string (line breaks are used as +separators) which is then added to the StringData map of the offloading device +image. + .. code-block:: c++ + :linenos: + + Image.StringData["SYMBOL_TABLE"] = "foo\nbar\nalpha\n" + +Property Set Generation +----------------------- + +SYCL LLVM to SPIR-V translation +=============================== + +AOT compilation flow for SPIR-V based targets +============================================= + +AOT compilation flow for non SPIR-V based targets (NVPTX AND AMD) +================================================================= + +Generation of wrapped object +============================ + +Generation of final host-linked object +====================================== + +SYCL runtime +============ diff --git a/llvm/docs/figures/code_split.png b/llvm/docs/figures/code_split.png new file mode 100644 index 0000000000000..33889ff6a785e Binary files /dev/null and b/llvm/docs/figures/code_split.png differ diff --git a/llvm/docs/figures/linker_wrapper.png b/llvm/docs/figures/linker_wrapper.png new file mode 100644 index 0000000000000..95f2c109948b0 Binary files /dev/null and b/llvm/docs/figures/linker_wrapper.png differ diff --git a/llvm/docs/figures/sycl_linker.png b/llvm/docs/figures/sycl_linker.png new file mode 100644 index 0000000000000..a73cc2404b9a3 Binary files /dev/null and b/llvm/docs/figures/sycl_linker.png differ diff --git a/llvm/docs/figures/sycl_linker_aot_intel.png b/llvm/docs/figures/sycl_linker_aot_intel.png new file mode 100644 index 0000000000000..5fa88e892d51d Binary files /dev/null and b/llvm/docs/figures/sycl_linker_aot_intel.png differ diff --git a/llvm/docs/figures/sycl_linker_aot_other.png b/llvm/docs/figures/sycl_linker_aot_other.png new file mode 100644 index 0000000000000..c8e0ab81585f2 Binary files /dev/null and b/llvm/docs/figures/sycl_linker_aot_other.png differ diff --git a/llvm/docs/figures/toplevel.png b/llvm/docs/figures/toplevel.png new file mode 100644 index 0000000000000..98b5303c4cf45 Binary files /dev/null and b/llvm/docs/figures/toplevel.png differ diff --git a/llvm/docs/index.rst b/llvm/docs/index.rst index b480729aaa5d9..d18124e132a3a 100644 --- a/llvm/docs/index.rst +++ b/llvm/docs/index.rst @@ -60,11 +60,15 @@ Getting Started, How-tos, Developer Guides, and Tutorials. .. toctree:: :hidden: + SYCLOffloadingDesign GettingStartedTutorials Reference UserGuides DiscourseMigrationGuide +:doc:`SYCLOffloadingDesign` + For those new to the SYCL system. + :doc:`GettingStartedTutorials` For those new to the LLVM system.