diff --git a/llvm/docs/SYCLOffloadingDesign.rst b/llvm/docs/SYCLOffloadingDesign.rst
new file mode 100644
index 0000000000000..021d881cf7ce2
--- /dev/null
+++ b/llvm/docs/SYCLOffloadingDesign.rst
@@ -0,0 +1,743 @@
+==================================
+SYCL Offloading Design & Internals
+==================================
+
+.. contents::
+ :local:
+
+************
+Introduction
+************
+
+SYCL is a high-level single-source standard C++ programming model that can
+target a wide range of heterogeneous platforms. SYCL is an industry-driven
+standard developed by Khronos Group and announced in March 2014 that allows data
+parallelism for heterogeneous systems based on C++ 17. It is a royalty-free,
+cross-platform abstraction layer that enables the development of applications
+and frameworks with the portability and efficiency of OpenCL, while providing a
+user interface that's flexible and easy to use. SYCL standard is designed to be
+both highly portable and highly efficient and allows applications to offload
+compute kernels and data transfer to OpenCL-enabled accelerators such as GPUs,
+FPGAs, and other processors.
+
+A SYCL implementation refers to a software version or revision of the SYCL
+programming model. It consists of the necessary tools, libraries, and runtime
+support that enable developers to write SYCL code and execute it on compatible
+hardware platforms. These are some of the popular SYCL implementations that are
+available. This document focuses on DPCPP by Intel which is a part of the oneAPI
+solution that supports Intel CPUs, GPUs, and FPGAs as well as NVIDIA and AMD
+GPUs. For more details, please look at `sycl-overview `_.
+
+ .. code-block:: c++
+ :linenos:
+
+ #include
+ using namespace sycl;
+ int main() {
+ queue Q; // The queue, Q, is the object that
+ // submits the task to a device.
+ int const size = 10;
+ buffer A{ size }; // The buffer, A, is the memory used to
+ // transfer data between host and device.
+ Q.submit([&](handler& h) { // The handler, h, is the object that contains
+ // the single_task function to be used.
+ accessor A_acc(A, h); // The accessor, A_acc, is the object that
+ // efficiently accesses the buffer elements.
+ h.single_task([=]() {
+ A_acc[5] = 77;
+ });
+ });
+ host_accessor result(A); // host_accessor is the object that allows
+ // the host to access the buffer memory.
+ for (int i = 0; i < size; i++) // Print output
+ std::cout << result[i] << " "; std::cout << "\n";
+ return 0;
+ }
+
+Here, we present a code with a simple SYCL algorithm called single_task, which
+is equivalent in standard C++ would be std::thread, to gain a fundamental
+understanding of SYCL components. The construct single_task is used to define a
+unit of work that should be executed on a single processing element, typically a
+single CPU core or GPU thread. A SYCL kernel function is a lambda function that
+is passed as an argument to the single_task construct.
+This document describes the clang driver and code generation steps for creating
+offloading applications. Clang supports offloading to various architectures
+using programming models like CUDA, HIP, and OpenMP. Currently, Clang support
+for SYCL offloading is being added.
+
+Clang will support SYCL target offloading to several different architectures
+such as Intel CPUs and GPUs, NVPTX, and AMDGPU. Offloading code is generated by
+clang and then executed using SYCL runtime. This section describes the steps
+necessary to create a functioning device image that can be loaded by the SYCL
+runtime. {{ SYCL runtime reference? }}.
+
+SYCL offloading can support two modes of compilation: (i) Just-In-Time (JIT)
+compilation and (ii) Ahead-Of-Time (AOT) compilation. In JIT compilation, final
+wrapped device object is represented using the SPIR-V Intermediate
+Representation. Final device code generation is deferred until application
+runtime. JIT compilation provides portability of device code and allows for
+target specific optimizations to happen. In AOT copilation, final device code is
+generated at compile time (before the wrapping step) to produce the final
+binary. AOT compilation reduces application execution time by skipping the
+device code generation step during runtime and final device code can also be
+tested before deploying.
+
+Remainder of this document illustrates the steps necessary to compile a SYCL
+offloading application using clang.
+
+*******************************************
+High-level SYCL Offloading Compilation Flow
+*******************************************
+
+.. figure:: toplevel.png
+ :scale: 50%
+ :align: center
+
+ Top-level SYCL offloading compilation flow.
+
+Figure 1 showcases the SYCL offloading compilation flow. The SYCL compiler
+performs the following high-level actions to generate SYCL offloading code:
+
+ - Front-end device compilation step (`clang --target=`)
+ - Packaging of device code (`clang-offload-packager`)
+ - Front-end host compilation step (`clang --target= --fembed-offload-object`)
+ - Linking stage (`clang-linker-wrapper`)
+
+The set of possible 'clang -cc1' invocations for a given user command is shown
+below.
+
+.. code-block:: console
+ :linenos:
+
+ $ clang++ -fsycl --offload-arch=pvc,sm_52 test.cpp -###
+ clang -cc1 -triple nvptx64 ... // AOT device compilation for sm_52
+ clang -cc1 -triple spir64_gen ... // AOT device compilation for pvc
+ clang -cc1 -triple spir64 ... // JIT device compilation (backup) for pvc
+ clang -cc1 -triple x86_64 ... // Host compilation
+
+****************************
+Front-end device compilation
+****************************
+This step parses input source, outlines device part of the code, applies
+additional restrictions on the device code (e.g. no exceptions or virtual
+calls), generates LLVM IR bitcode for the device code only and "integration
+header" which provides information like kernel name, parameters order and data
+type for the runtime library. Multiple LLVM IR bitcodes (in case of multiple
+targets) are packaged into a single object by the clang-offload-packager.
+Front-end device compilation for SYCL offloading can be split into the following
+components - Device code outlining, SYCL kernel function object lowering,
+Generation of device code diagnostics, and Integration header generation. These
+components are explained in the sections below.
+
+Device code outlining
+=====================
+ This component is responsible for identifying and outlining "device code" in the
+ single source.
+ Here is a code example of a SYCL program that demonstrates compiler outlining
+ work:
+
+ .. code-block:: c++
+ :linenos:
+
+ int foo(int x) { return ++x; }
+ int bar(int x) { throw std::exception{"CPU code only!"}; }
+ // ...
+ using namespace sycl;
+ queue Q;
+ buffer a{range<1>{1024}};
+ Q.submit([&](handler& cgh) {
+ auto A = a.get_access(cgh);
+ cgh.parallel_for(range<1>{1024}, [=](id<1> index) {
+ A[index] = index[0] * 2 + foo(42);
+ });
+ }
+
+ In this example, the compiler needs to compile the lambda expression passed
+ to the `sycl::handler::parallel_for` method, as well as the function `foo`
+ called from the lambda expression for the device.
+
+ The compiler must also ignore the `bar` function when we compile the
+ "device" part of the single source code, as it's unused inside the device
+ portion of the source code (the contents of the lambda expression passed to the
+ `sycl::handler::parallel_for` and any function called from this lambda
+ expression).
+
+ The current approach is to use the SYCL kernel attribute in the runtime to
+ mark code passed to `sycl::handler::parallel_for` as "kernel functions".
+ The runtime library can't mark foo as "device" code - this is a compiler
+ job: to traverse all symbols accessible from kernel functions and add them to
+ the "device part" of the code marking them with the new SYCL device attribute.
+
+SYCL kernel function object (functor or lambda) lowering
+========================================================
+ This component creates an OpenCL kernel function interface for SYCL kernels.
+ All SYCL memory objects shared between host and device (buffers/images,
+ these objects map to OpenCL buffers and images) must be accessed through special
+ `accessor` classes. The "device" side implementation of these classes contains
+ pointers to the device memory. As there is no way in OpenCL to pass structures
+ with pointers inside as kernel arguments all memory objects shared between host
+ and device must be passed to the kernel as raw pointers.
+
+ SYCL also has a special mechanism for passing kernel arguments from host to
+ the device. In OpenCL kernel arguments are set by calling `clSetKernelArg`
+ function for each kernel argument, meanwhile in SYCL all the kernel arguments
+ are fields of "SYCL kernel function" which can be defined as a lambda function
+ or a named function object and passed as an argument to SYCL function for
+ invoking kernels (such as `parallel_for` or `single_task`). For example, in the
+ previous code snippet above `accessor` `A` is one such captured kernel argument.
+
+ To facilitate the mapping of SYCL kernel data members to OpenCL
+ kernel arguments and overcome OpenCL limitations we added the generation of an
+ OpenCL kernel function inside the compiler. An OpenCL kernel function contains
+ the body of the SYCL kernel function, receives OpenCL-like parameters and
+ additionally does some manipulation to initialize SYCL kernel data members
+ with these parameters. In some pseudo code the OpenCL kernel function for the
+ previous code snippet above looks like this:
+
+ .. code-block:: c++
+ :linenos:
+
+ // SYCL kernel is defined in SYCL headers:
+ template
+ __attribute__((sycl_kernel)) void sycl_kernel_function(KernelType KernelFuncObj) {
+ // ...
+ KernelFuncObj();
+ }
+ // Generated OpenCL kernel function
+ __kernel KernelName(global int* a) {
+ KernelType KernelFuncObj; // Actually kernel function object declaration
+ // doesn't have a name in AST.
+ // Let the kernel function object have one captured field - accessor A.
+ // We need to init it with global pointer from arguments:
+ KernelFuncObj.A.__init(a);
+ // Body of the SYCL kernel from SYCL headers:
+ {
+ KernelFuncObj();
+ }
+ }
+
+ OpenCL kernel function is generated by the compiler inside the Sema using AST
+ nodes.
+
+Generation of device code diagnostics
+=====================================
+ This component enforces language restrictions on device code.
+
+Integration header generation
+=============================
+ This component emits information required for binding host and device parts of
+ the SYCL code via OpenCL API. In proposed design, we use SYCL device front-end
+ compiler to produce the integration header for two reasons. First, it must be
+ possible to use any host compiler to produce SYCL heterogeneous applications.
+ Second, even if the same clang compiler is used for the host compilation,
+ information provided in the integration header is used (included) by the SYCL
+ runtime implementation, so the header must be available before the host
+ compilation starts.
+
+**************************
+Front-end host compilation
+**************************
+This step parses input file and emits code for the host device. This host
+compilation takes an additional argument which helps to embed the packaged
+device binary into the host code. This step generates a fat object that will be
+consumed during link stage.
+
+*************
+Linking stage
+*************
+Pass the fat object file(s) to the linker wrapper tool. The tool extracts the
+device objects and runs the device linking action on the extracted objects.
+Fully formed device images are then wrapped into host objects and then linked
+with host image in the original fat object using the host linker tool.
+
+.. figure:: linker_wrapper.png
+ :scale: 50%
+ :align: center
+
+ Device linking flow for SYCL offloading inside the linker-wrapper tool.
+
+Figure 2 shows the compilation flow inside the linker-wrapper tool. First, all
+device objects are extracted from the fat objects and grouped according to the
+target device. For each target device group, we invoke clang to link the device
+objects and generate a list of device objects. All the device objects are then
+wrapped together and embedded into a wrapped host object. This wrapper object is
+linked normally with the rest of host objects and/or libraries using the usual
+linker - e.g. `ld` on Linux and `link.exe` on Windows.
+
+Linking of device objects
+=========================
+ Device code linking for SYCL offloading will be performed by making a call to
+ the clang driver. A SYCL specific option '--sycl-link' is passed to the driver
+ to control the linking process. In the presence of this option, the clang driver
+ will invoke a SYCL linker tool called 'clang-sycl-linker'. This tool is
+ responsible for linking the device bitcode and generating a group of
+ fully-formed device images. An example of a call to clang-sycl-linker is shown
+ below.
+
+ .. code-block:: console
+
+ $ clang++ --target --sycl-link test.bc test.o -Xlinker -###
+ clang-sycl-linker -triple test.bc test.o
+
+ .. figure:: sycl_linker.png
+ :scale: 50%
+ :align: center
+
+ SYCL-specific device code linking for JIT compilation flow.
+
+ Figure 3 shows the device code linking process performed inside the SYCL linker
+ tool for JIT compilation. Following are the key steps performed:
+
+ 1. All the inputs of LLVM IR bitcode type are gathered and linked together using
+ the llvm-link tool.
+
+ 2. If the user provides device library files, the ouput of the previous step is
+ linked with such files to get the final fully linked LLVM IR device bitcode
+ image.
+
+ 3. The fully linked LLVM IR device bitcode image undergoes several post-link
+ steps. This include device code splitting, specialization constant lowering,
+ symbol table generation, and property sets generation.
+
+ 4. Each of the split device images in the LLVM IR format will be translated to
+ SPIR-V IR.using an external tool 'llvm-spirv'.
+
+ Remainder of this section discusses the various post-link steps and the
+ translation to SPIR-V IR.
+
+Device code splitting
+---------------------
+ This process takes as input a fully linked device module with a set of SYCL
+ device kernels for a specific target and performs splitting to generate
+ several fully-contained device modules. Each of the newly formed module
+ contains a sub-set of the original set of SYCL device kernels along with a
+ union of all the functions from each of their respective call graphs. Here,
+ call graph of a SYCL kernel is the set of all functions reachable from that
+ kernel.
+ Device code splitting is performed for two reasons:
+
+ (1) User requests a specific mode of code splitting. Three modes of code
+ splitting are supported. They are:
+
+ i. per-kernel - Every SYCL kernel and the functions that are contained in its
+ call graph go into a separate module.
+
+ ii. per-module - All SYCL kernels belonging to a specific user-specified
+ module and all functions from their respective call graphs go into a separate
+ module.
+
+ iii. auto - per-kernel or per-module option is selected based on some
+ heuristics.
+
+ (2) SYCL kernels can be specialized for different devices. For example, kernels
+ that use 64-bit floating point (FP64) operations can be executed only on devices
+ that support such operations. Hence, it is required that all kernels that
+ require FP64 support be grouped separately from kernels that do not require FP64
+ support. A point to note: A SYCL kernel is said to require FP64 support if the
+ kernel itself or any funtion in its call graph uses FP64 operations. Figure 4
+ showcases this device code split.
+
+ .. figure:: code_split.png
+ :scale: 50%
+ :align: center
+
+ An example of device code splitting for SYCL offloading.
+
+Specialization constant lowering
+--------------------------------
+ Specialization constants are implemented in accordance with how they are
+ defined by SYCL 2020 specification. Here are links to `sycl-registry `_
+ and `sycl-2020-spec `_.
+
+ Specialization constants represent constants whose values can be set
+ dynamically during execution of the SYCL application. The values of these
+ constants are fixed when a SYCL kernel function is invoked, and they do not
+ change during the execution of the kernel. However, the application is able to
+ set a new value for a specialization constants each time a kernel is invoked,
+ so the values can be tuned differently for each invocation.
+ An example usage of specialization constant is shown below:
+
+ .. code-block:: c++
+ :linenos:
+
+ #include
+ using namespace sycl;
+ using coeff_t = std::array, 3>;
+ // Read coefficients from somewhere.
+ coeff_t get_coefficients();
+ // Identify the specialization constant.
+ constexpr specialization_id coeff_id;
+ void do_conv(buffer in, buffer out) {
+ queue myQueue;
+ myQueue.submit([&](handler &cgh) {
+ accessor in_acc { in, cgh, read_only };
+ accessor out_acc { out, cgh, write_only };
+ // Set the coefficient of the convolution as constant.
+ // This will build a specific kernel the coefficient available as literals.
+ cgh.set_specialization_constant(get_coefficients());
+ cgh.parallel_for(
+ in.get_range(), [=](item<2> item_id, kernel_handler h) {
+ float acc = 0;
+ coeff_t coeff = h.get_specialization_constant();
+ for (int i = -1; i <= 1; i++) {
+ if (item_id[0] + i < 0 || item_id[0] + i >= in_acc.get_range()[0])
+ continue;
+ for (int j = -1; j <= 1; j++) {
+ if (item_id[1] + j < 0 || item_id[1] + j >= in_acc.get_range()[1])
+ continue;
+ // The underlying JIT can see all the values of the array returned
+ // by coeff.get().
+ acc += coeff[i + 1][j + 1] *
+ in_acc[item_id[0] + i][item_id[1] + j];
+ }
+ }
+ out_acc[item_id] = acc;
+ });
+ });
+ myQueue.wait();
+ }
+
+Handling of specialization constants requires the compiler to have access to a
+fully linked device image and hence happens during the post-link step.
+This handling happens inside a dedicated LLVM IR pass named`SpecConstantsPass`
+which:
+1. Assigns numeric IDs to specialization constants found in the linked module.
+2. Transforms IR to a form expected by the LLVM to SPIR-V translation step.
+3. Translates SPIR-V friendly IR to SPIR-V IR during LLVM to SPIR-V
+transation step.
+1. Collects and provides \ =\> \
+mapping, which is later being used by SYCL runtime to set specialization
+constant values provided by user.
+
+Remainder of this section explains these three steps in detail.
+
+Assignment of numeric IDs to specialization constants
+`````````````````````````````````````````````````````
+
+This task is achieved by maintaining a map, which holds a list of numeric IDs
+for each encountered symbolic ID of a specialization constant. Those IDs are
+used to identify the specialization constants at SPIR-V level.
+
+As noted above one symbolic ID can have several numeric IDs assigned to it -
+such 1:N mapping comes from the fact that at SPIR-V level, composite
+specialization constants don't have dedicated IDs and they are being identified
+and specialized through their scalar leafs and corresponding numeric IDs.
+
+For example, the following code:
+
+ .. code-block:: c++
+ :linenos:
+
+ struct Nested {
+ float a, b;
+ };
+ struct A {
+ int x;
+ Nested n;
+ };
+ constexpr specialization_id id_int;
+ constexpr specialization_id id_A;
+ // ...
+ [=](kernel_handler h) {
+ h.get_specialization_constant();
+ h.get_specialization_constant();
+ }
+
+Will result in the following numeric IDs assignment:
+
+ .. code-block:: c++
+ :linenos:
+
+ // since `id_int` is a simple arithmetic specialization constant, we only
+ // have a single numeric ID associated with its symbolic ID
+ unique_symbolic_id_for_id_int -> { 0 }
+ // `id_A` is a composite with three leafs (scalar members, including ones
+ // located in nested composite types), which results in three numeric IDs
+ // associated with the same symbolic ID
+ unique_symbolic_id_for_id_A -> { 1, 2, 3 }
+
+As it is shown in the example above, if a composite specialization constant
+contains another composite within it, that nested composite is also being
+"flattened" and its leafs are considered to be leafs of the parent
+specialization constants. This done by depth-first search through the composite
+elements.
+
+Transformation of LLVM IR to SPIR-V friendly IR form
+````````````````````````````````````````````````````
+
+SPIR-V friendly IR form is a special representation of LLVM IR, where some
+function are named in particular way in order to be recognizable by the SPIR-V
+translator to convert them into corresponding SPIR-V instructions later.
+The format is documented [here][spirv-friendly-ir] `spirv-friendly-ir `_
+
+For specialization constant, we need to generate the following constructs:
+
+ .. code-block:: c++
+ :linenos:
+
+ template // T is arithmetic type
+ T __spirv_SpecConstant(int numericID, T default_value);
+
+ template // T is composite type,
+ // Elements are arithmetic or composite types
+ T __spirv_SpecConstantComposite(Elements... elements);
+
+Particularly, `SpecConstantsPass` translates calls to the
+
+ .. code-block:: c++
+
+ T __sycl_getScalar2020SpecConstantValue(const char *SymbolicID, const void *DefaultValue, const char *RTBuffer)
+
+intrinsic into calls to
+
+ .. code-block:: c++
+
+ T __spirv_SpecConstant(int ID, T default_val)
+
+And for
+
+ .. code-block:: c++
+
+ T __sycl_getComposite2020SpecConstantValue(const char *SybmolicID, const void *DefaultValue, const char *RTBuffer)
+
+it generates number of
+
+ .. code-block:: c++
+
+ T __spirv_SpecConstant(int ID, T default_val)
+
+calls for each leaf of the composite type, plus number of
+
+ .. code-block:: c++
+
+ T __spirv_SpecConstantComposite(Elements... elements)
+
+for each composite type (including the outermost one).
+
+Example of LLVM IR transformation can be found below, input LLVM IR:
+
+ .. code-block:: c++
+ :linenos:
+
+ %struct.POD = type { [2 x %struct.A], <2 x i32> }
+ %struct.A = type { i32, float }
+
+ @gold_scalar_default = global %class.specialization_id { i32 42 }
+ @gold_default = global %class.specialization_id { %struct.POD { [2 x %struct.A] [%struct.A { i32 1, float 2.000000e+00 }, %struct.A { i32 2, float 3.000000e+00 }], <2 x i32> } }
+
+ ; the second argument of intrinsics below are simplified a bit
+ ; in real-life LLVM IR it looks like:
+ ; i8* bitcast (%class.specialization_id* @gold_scalar_default to i8*
+ %gold_scalar = call i32 __sycl_getScalar2020SpecConstantValue ("gold_scalar_identifier", @gold_scalar_default, i8* %buffer)
+ %gold = call %struct.POD __sycl_getComposite2020SpecConstantValue ("gold_identifier", @gold_default, i8* %default)
+
+LLVM IR generated by `SpecConstantsPass`:
+
+ .. code-block:: c++
+ :linenos:
+
+ %gold_scalar = call i32 __spirv_SpecConstant(i32 0, i32 42)
+
+ %gold_POD_A0_x = call i32 __spirv_SpecConstant(i32 1, i32 1)
+ %gold_POD_A0_y = call float __spirv_SpecConstant(i32 2, float 2.000000e+00)
+
+ %gold_POD_A0 = call %struct.A __spirv_SpecConstantComposite(i32 %gold_POD_A0_x, float %gold_POD_A0_y)
+
+ %gold_POD_A1_x = call i32 __spirv_SpecConstant(i32 3, i32 2)
+ %gold_POD_A1_y = call float __spirv_SpecConstant(i32 4, float 3.000000e+00)
+
+ %gold_POD_A1 = call %struct.A __spirv_SpecConstantComposite(i32 %gold_POD_A1_x, float %gold_POD_A1_y)
+
+ %gold_POD_A = call [2 x %struct.A] __spirv_SpecConstantComposite(%struct.A %gold_POD_A0, %struct.A %gold_POD_A1)
+
+ %gold_POD_b0 = call i32 __spirv_SpecConstant(i32 4, i32 44)
+ %gold_POD_b1 = call i32 __spirv_SpecConstant(i32 6, i32 44)
+ %gold_POD_b = call <2 x i32> __spirv_SpecConstant(i32 %gold_POD_b0, i32 %gold_POD_b1)
+
+ %gold = call %struct.POD __spirv_SpecConstantComposite([2 x %struct.A] %gold_POD_A, <2 x i32> %gold_POD_b)
+
+This SPIR-V friendly LLVM IR will be handled appropriately by the LLVM to SPIR-V
+translation step.
+
+Translating SPIR-V friendly LLVM IR to SPIR-V
+`````````````````````````````````````````````
+Given the `__spirv_SpecConstant` intrinsic calls produced by the
+`SpecConstants` pass:
+
+ .. code-block:: c++
+ :linenos:
+
+ ; Function Attrs: alwaysinline
+ define dso_local spir_func i32 @get() local_unnamed_addr #0 {
+ ; args are "ID" and "default value":
+ %1 = tail call spir_func i32 @_Z20__spirv_SpecConstantii(i32 42, i32 0)
+ ret i32 %1
+ }
+
+ %struct.A = type { i32, float }
+
+ ; Function Attrs: alwaysinline
+ define dso_local spir_func void @get2(%struct.A* sret %ret.ptr) local_unnamed_addr #0 {
+ ; args are "ID" and "default value":
+ %1 = tail call spir_func i32 @_Z20__spirv_SpecConstantii(i32 43, i32 0)
+ %2 = tail call spir_func float @_Z20__spirv_SpecConstantif(i32 44, float 0.000000e+00)
+ %ret = tail call spir_func %struct.A @_Z29__spirv_SpecConstantCompositeif(%1, %2)
+ store %struct.A %ret, %struct.A* %ret.ptr
+ ret void
+ }
+
+the translator will generate `OpSpecConstant` SPIR-V instructions with proper
+`SpecId` decorations:
+
+.. code-block:: c++
+ :linenos:
+
+ OpDecorate %i32 SpecId 42 ; ID
+ %i32 = OpSpecConstant %int 0 ; Default value
+ OpDecorate %A.i32 SpecId 43 ; ID of the 1st member
+ OpDecorate %A.float SpecId 44 ; ID of the 2nd member
+ %A.i32 = OpSpecConstant %int.type 0 ; 1st member with default value
+ %A.float = OpSpecConstant %float.type 0.0 ; 2nd member with default value
+ %struct = OpSpecConstantComposite %struct.type %A.i32 %A.float ; Composite doens't need IDs or default value
+ %1 = OpTypeFunction %int
+
+ %get = OpFunction %int None %1
+ %2 = OpLabel
+ OpReturnValue %i32
+ OpFunctionEnd
+ %1 = OpTypeFunction %struct.type
+
+ %get2 = OpFunction %struct.type None %struct
+ %2 = OpLabel
+ OpReturnValue %struct
+ OpFunctionEnd
+
+Collecting spec constants info and communicating it to SYCL runtime
+```````````````````````````````````````````````````````````````````
+
+For each encountered specialization constants, we emit a property, which encodes
+information required by SYCL runtime to set the value of a specialization
+constant through corresponding API.
+
+These properties are stored in "SYCL/specialization constants" property set and
+their names are the same as symbolic IDs of corresponding specialization
+constants.
+
+Each such property contains an array of tuples (descriptors)
+\. This descriptor might be overcomplicated for
+simple arithmetic spec constants, but it is still used for them in order to
+unify internal representation of scalar and composite spec constants and
+simplify their handling in SYCL runtime.
+This descriptor is needed, because at SYCL runtime level, composite constants
+are set by user as a byte array and we have to break it down to the leaf members
+of the composite and set a value for each leaf as for a separate scalar
+specialization constant.
+
+For simple scalar specialization constants the array will only contain a single
+descriptor representing the constant itself. For composite specialization
+constants the array will contain several descriptors for each leaf of the
+composite type.
+
+The descriptor contains the following fields:
+- ID of a composite constant leaf, i.e. ID of a scalar specialization constant,
+which is a part of a composite type or ID of a constant itself if it is a
+scalar.
+- Offset from the beginning of composite, which points to the location of a
+scalar value within the composite, i.e. the position where scalar
+specialization constant resides within the byte array supplied by the user.
+For scalar specialization constants it will always be 0.
+- Size of the scalar specialization constant
+
+For example, the following code:
+ .. code-block:: c++
+ :linenos:
+
+ struct Nested {
+ float a, b;
+ };
+ struct A {
+ int x;
+ Nested n;
+ };
+
+ constexpr specialization_id id_int;
+ constexpr specialization_id id_A;
+ // ...
+ [=](kernel_handler h) {
+ h.get_specialization_constant();
+ h.get_specialization_constant();
+ }
+
+Will result in the following property set generated:
+ .. code-block:: c++
+ :linenos:
+
+ property_set {
+ Name = "SYCL/specialization constants",
+ properties: [
+ property {
+ Name: "id_int_symbolic_ID",
+ ValAddr: points to byte array [{0, 0, 4}],
+ Type: PI_PROPERTY_TYPE_BYTE_ARRAY,
+ Size: sizeof(byte array above)
+ },
+ property {
+ Name: "id_A_symbolic_ID",
+ ValAddr: points to byte array [{1, 0, 4}, {2, 4, 4}, {3, 8, 4}],
+ Type: PI_PROPERTY_TYPE_BYTE_ARRAY,
+ Size: sizeof(byte array above)
+ },
+ ]
+ }
+
+During the creation of the wrapped device image, these device image properties
+are embedded into the application together with device code and used by SYCL
+runtime while handling specialization constants during application execution. It
+calls appropriate API to set a value of a specialization constant.
+
+Symbol table generation
+-----------------------
+
+During the post-linking stage, a symbol table (list of kernels) for each
+produced device module is generated. This facilitates proper module selection
+by SYCL runtime. This symbol table will consist name of all device entry points
+inside a given module. An example LLVM IR module is shown below.
+ .. code-block:: c++
+ :linenos:
+
+ define spir_kernel void @foo(%arg1, %arg2) {
+ }
+ define spir_kernel void @bar(%arg1, %arg2) {
+ }
+ define spir_kernel void @alpha(%arg1, %arg2) {
+ }
+
+For this example, the following symbol table generated. All symbols in symbol
+table will be concatenated in a single string (line breaks are used as
+separators) which is then added to the StringData map of the offloading device
+image.
+ .. code-block:: c++
+ :linenos:
+
+ Image.StringData["SYMBOL_TABLE"] = "foo\nbar\nalpha\n"
+
+Property Set Generation
+-----------------------
+
+SYCL LLVM to SPIR-V translation
+===============================
+
+AOT compilation flow for SPIR-V based targets
+=============================================
+
+AOT compilation flow for non SPIR-V based targets (NVPTX AND AMD)
+=================================================================
+
+Generation of wrapped object
+============================
+
+Generation of final host-linked object
+======================================
+
+SYCL runtime
+============
diff --git a/llvm/docs/figures/Slide1.PNG b/llvm/docs/figures/Slide1.PNG
new file mode 100644
index 0000000000000..eeeafc516e7df
Binary files /dev/null and b/llvm/docs/figures/Slide1.PNG differ
diff --git a/llvm/docs/figures/Slide2.PNG b/llvm/docs/figures/Slide2.PNG
new file mode 100644
index 0000000000000..82325357af72a
Binary files /dev/null and b/llvm/docs/figures/Slide2.PNG differ
diff --git a/llvm/docs/figures/code_split.png b/llvm/docs/figures/code_split.png
new file mode 100644
index 0000000000000..33889ff6a785e
Binary files /dev/null and b/llvm/docs/figures/code_split.png differ
diff --git a/llvm/docs/figures/linker_wrapper.png b/llvm/docs/figures/linker_wrapper.png
new file mode 100644
index 0000000000000..95f2c109948b0
Binary files /dev/null and b/llvm/docs/figures/linker_wrapper.png differ
diff --git a/llvm/docs/figures/sycl_linker.png b/llvm/docs/figures/sycl_linker.png
new file mode 100644
index 0000000000000..efb0018e12d5e
Binary files /dev/null and b/llvm/docs/figures/sycl_linker.png differ
diff --git a/llvm/docs/figures/sycl_linker_aot_intel.png b/llvm/docs/figures/sycl_linker_aot_intel.png
new file mode 100644
index 0000000000000..5fa88e892d51d
Binary files /dev/null and b/llvm/docs/figures/sycl_linker_aot_intel.png differ
diff --git a/llvm/docs/figures/sycl_linker_aot_other.png b/llvm/docs/figures/sycl_linker_aot_other.png
new file mode 100644
index 0000000000000..c8e0ab81585f2
Binary files /dev/null and b/llvm/docs/figures/sycl_linker_aot_other.png differ
diff --git a/llvm/docs/figures/toplevel.png b/llvm/docs/figures/toplevel.png
new file mode 100644
index 0000000000000..98b5303c4cf45
Binary files /dev/null and b/llvm/docs/figures/toplevel.png differ
diff --git a/llvm/docs/index.rst b/llvm/docs/index.rst
index b480729aaa5d9..d18124e132a3a 100644
--- a/llvm/docs/index.rst
+++ b/llvm/docs/index.rst
@@ -60,11 +60,15 @@ Getting Started, How-tos, Developer Guides, and Tutorials.
.. toctree::
:hidden:
+ SYCLOffloadingDesign
GettingStartedTutorials
Reference
UserGuides
DiscourseMigrationGuide
+:doc:`SYCLOffloadingDesign`
+ For those new to the SYCL system.
+
:doc:`GettingStartedTutorials`
For those new to the LLVM system.