@@ -364,6 +364,110 @@ Unpoisoning may not be an option, if (for example) you are not maintaining the a
364364* You are using allocator, which does not call destructor during deallocation.
365365* You are aware that memory allocated with an allocator may be accessed, even when unused by container.
366366
367+ Offloading C++ Parallel Algorithms to GPUs
368+ ------------------------------------------
369+
370+ Experimental support for GPU offloading has been added to ``libc++ ``. The
371+ implementation uses OpenMP target offloading to leverage GPU compute resources.
372+ The OpenMP PSTL backend can target both NVIDIA and AMD GPUs.
373+ However, the implementation only supports contiguous iterators, such as
374+ iterators for ``std::vector `` or ``std::array ``.
375+ To enable the OpenMP offloading backend it must be selected with
376+ ``LIBCXX_PSTL_BACKEND=openmp `` when installing ``libc++ ``. Further, when
377+ compiling a program, the user must specify the command line options
378+ ``-fopenmp -fexperimental-library ``. To install LLVM with OpenMP offloading
379+ enabled, please read
380+ `the LLVM OpenMP FAQ. <https://openmp.llvm.org/SupportAndFAQ.html >`_
381+ You may also want to to visit
382+ `the OpenMP offloading command-line argument reference. <https://openmp.llvm.org/CommandLineArgumentReference.html#offload-command-line-arguments >`_
383+
384+ Example
385+ ~~~~~~~
386+
387+ The following is an example of offloading vector addition to a GPU using our
388+ standard library extension. It implements the classical vector addition from
389+ BLAS that overwrites the vector ``y `` with ``y=a*x+y ``. Thus ``y.begin() `` is
390+ both used as an input and an output iterator in this example.
391+
392+ .. code-block :: cpp
393+
394+ #include <algorithm>
395+ #include <execution>
396+
397+ template <typename T1, typename T2, typename T3>
398+ void axpy(const T1 a, const std::vector<T2> &x, std::vector<T3> &y) {
399+ std::transform(std::execution::par_unseq, x.begin(), x.end(), y.begin(),
400+ y.begin(), [=](T2 xi, T3 yi) { return a * xi + yi; });
401+ }
402+
403+ The execution policy ``std::execution::par_unseq `` states that the algorithm's
404+ execution may be parallelized, vectorized, and migrated across threads. This is
405+ the only execution mode that is safe to offload to GPUs, and for all other
406+ execution modes the algorithms will execute on the CPU.
407+ Special attention must be paid to the lambda captures when enabling GPU
408+ offloading. If the lambda captures by reference, the user must manually map the
409+ variables to the device. If capturing by reference, the above example could
410+ be implemented in the following way.
411+
412+ .. code-block :: cpp
413+
414+ template <typename T1, typename T2, typename T3>
415+ void axpy(const T1 a, const std::vector<T2> &x, std::vector<T3> &y) {
416+ #pragma omp target data map(to : a)
417+ std::transform(std::execution::par_unseq, x.begin(), x.end(), y.begin(),
418+ y.begin(), [&](T2 xi, T3 yi) { return a * xi + yi; });
419+ }
420+
421+ However, if unified shared memory, USM, is enabled, no additional data mapping
422+ is necessary when capturing y reference.
423+
424+ Compiling functions for GPUs with OpenMP
425+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
426+
427+ The C++ standard defines that all accesses to memory are inside a single address
428+ space. However, discrete GPU systems have distinct address spaces. A single
429+ address space can be emulated if your system supports unified shared memory.
430+ However, many discrete GPU systems do not, and in those cases it is important to
431+ pass device function pointers to the parallel algorithms. Below is an example of
432+ how the OpenMP ``declare target `` directive with the ``indirect `` clause can be
433+ used to mark that a function should be compiled for both host and device.
434+
435+ .. code-block :: cpp
436+
437+ // This function computes the squared difference of two floating points
438+ float squared(float a, float b) { return a * a - 2.0f * a * b + b * b; };
439+
440+ // Declare that the function must be compiled for both host and device
441+ #pragma omp declare target indirect to(squared)
442+
443+ int main() {
444+ std::vector<float> a(100, 1.0);
445+ std::vector<float> b(100, 1.25);
446+
447+ // Pass the host function pointer to the parallel algorithm and let OpenMP
448+ // translate it to the device function pointer internally
449+ float sum =
450+ std::transform_reduce(std::execution::par_unseq, a.begin(), a.end(),
451+ b.begin(), 0.0f, std::plus{}, squared);
452+
453+ // Validate that the result is approximately 6.25
454+ assert(std::abs(sum - 6.25f) < 1e-10);
455+ return 0;
456+ }
457+
458+ Without unified shared memory, the above example will not work if the host
459+ function pointer ``squared `` is passed to the parallel algorithm.
460+
461+ Important notes about exception handling
462+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
463+
464+ GPU architectures do not support exception handling. If compiling a program
465+ containing parallel algorithms with current versions of Clang, a program with
466+ exceptions in offloaded code regions will compile, but the program will
467+ terminate if an exception is thrown on the device. This does not conform with
468+ the C++ standard and exception handling on GPUs will hopefully be better
469+ supported in future releases of LLVM.
470+
367471Platform specific behavior
368472==========================
369473
0 commit comments