Skip to content

Commit 197f5bc

Browse files
committed
Update hipstdpar and related docs
Change-Id: I2ff3afab40c93fd1a50917baf14cd4b526a7a2e0
1 parent 3ae4e13 commit 197f5bc

File tree

2 files changed

+395
-0
lines changed

2 files changed

+395
-0
lines changed

clang/docs/HIPSupport.rst

Lines changed: 390 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -266,3 +266,393 @@ Example Usage
266266
Base* basePtr = &obj;
267267
basePtr->virtualFunction(); // Allowed since obj is constructed in device code
268268
}
269+
270+
C++ Standard Parallelism Offload Support: Compiler And Runtime
271+
==============================================================
272+
273+
Introduction
274+
============
275+
276+
This section describes the implementation of support for offloading the
277+
execution of standard C++ algorithms to accelerators that can be targeted via
278+
HIP. Furthermore, it enumerates restrictions on user defined code, as well as
279+
the interactions with runtimes.
280+
281+
Algorithm Offload: What, Why, Where
282+
===================================
283+
284+
C++17 introduced overloads
285+
`for most algorithms in the standard library <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0024r2.html>`_
286+
which allow the user to specify a desired
287+
`execution policy <https://en.cppreference.com/w/cpp/algorithm#Execution_policies>`_.
288+
The `parallel_unsequenced_policy <https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag_t>`_
289+
maps relatively well to the execution model of AMD GPUs. This, coupled with the
290+
the availability and maturity of GPU accelerated algorithm libraries that
291+
implement most / all corresponding algorithms in the standard library
292+
(e.g. `rocThrust <https://github.com/ROCmSoftwarePlatform/rocThrust>`_), makes
293+
it feasible to provide seamless accelerator offload for supported algorithms,
294+
when an accelerated version exists. Thus, it becomes possible to easily access
295+
the computational resources of an AMD accelerator, via a well specified,
296+
familiar, algorithmic interface, without having to delve into low-level hardware
297+
specific details. Putting it all together:
298+
299+
- **What**: standard library algorithms, when invoked with the
300+
``parallel_unsequenced_policy``
301+
- **Why**: democratise AMDGPU accelerator programming, without loss of user
302+
familiarity
303+
- **Where**: only AMDGPU accelerators targeted by Clang/LLVM via HIP
304+
305+
Small Example
306+
=============
307+
308+
Given the following C++ code:
309+
310+
.. code-block:: C++
311+
312+
bool has_the_answer(const std::vector<int>& v) {
313+
return std::find(std::execution::par_unseq, std::cbegin(v), std::cend(v), 42) != std::cend(v);
314+
}
315+
316+
if Clang is invoked with the ``--hipstdpar --offload-arch=foo`` flags, the call
317+
to ``find`` will be offloaded to an accelerator that is part of the ``foo``
318+
target family. If either ``foo`` or its runtime environment do not support
319+
transparent on-demand paging (such as e.g. that provided in Linux via
320+
`HMM <https://docs.kernel.org/mm/hmm.html>`_), it is necessary to also include
321+
the ``--hipstdpar-interpose-alloc`` flag. If the accelerator specific algorithm
322+
library ``foo`` uses doesn't have an implementation of a particular algorithm,
323+
execution seamlessly falls back to the host CPU. It is legal to specify multiple
324+
``--offload-arch``s. All the flags we introduce, as well as a thorough view of
325+
various restrictions and their implications will be provided below.
326+
327+
Implementation - General View
328+
=============================
329+
330+
We built support for Algorithm Offload support atop the pre-existing HIP
331+
infrastructure. More specifically, when one requests offload via ``--hipstdpar``,
332+
compilation is switched to HIP compilation, as if ``-x hip`` was specified.
333+
Similarly, linking is also switched to HIP linking, as if ``--hip-link`` was
334+
specified. Note that these are implicit, and one should not assume that any
335+
interop with HIP specific language constructs is available e.g. ``__device__``
336+
annotations are neither necessary nor guaranteed to work.
337+
338+
Since there are no language restriction mechanisms in place, it is necessary to
339+
relax HIP language specific semantic checks performed by the FE; they would
340+
identify otherwise valid, offloadable code, as invalid HIP code. Given that we
341+
know that the user intended only for certain algorithms to be offloaded, and
342+
encoded this by specifying the ``parallel_unsequenced_policy``, we rely on a
343+
pass over IR to clean up any and all code that was not "meant" for offload. If
344+
requested, allocation interposition is also handled via a separate pass over IR.
345+
346+
To interface with the client HIP runtime, and to forward offloaded algorithm
347+
invocations to the corresponding accelerator specific library implementation, an
348+
implementation detail forwarding header is implicitly included by the driver,
349+
when compiling with ``--hipstdpar``. In what follows, we will delve into each
350+
component that contributes to implementing Algorithm Offload support.
351+
352+
Implementation - Driver
353+
=======================
354+
355+
We augment the ``clang`` driver with the following flags:
356+
357+
- ``--hipstdpar`` enables algorithm offload, which depending on phase, has the
358+
following effects:
359+
360+
- when compiling:
361+
362+
- ``-x hip`` gets prepended to enable HIP support;
363+
- the ``ROCmToolchain`` component checks for the ``hipstdpar_lib.hpp``
364+
forwarding header,
365+
`rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_ and
366+
`rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_ in
367+
their canonical locations, which can be overriden via flags found below;
368+
if all are found, the forwarding header gets implicitly included,
369+
otherwise an error listing the missing component is generated;
370+
- the ``LangOpts.HIPStdPar`` member is set.
371+
372+
- when linking:
373+
374+
- ``--hip-link`` and ``-frtlib-add-rpath`` gets appended to enable HIP
375+
support.
376+
377+
- ``--hipstdpar-interpose-alloc`` enables the interposition of standard
378+
allocation / deallocation functions with accelerator aware equivalents; the
379+
``LangOpts.HIPStdParInterposeAlloc`` member is set;
380+
- ``--hipstdpar-path=`` specifies a non-canonical path for the forwarding
381+
header; it must point to the folder where the header is located and not to the
382+
header itself;
383+
- ``--hipstdpar-thrust-path=`` specifies a non-canonical path for
384+
`rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_; it
385+
must point to the folder where the library is installed / built under a
386+
``/thrust`` subfolder;
387+
- ``--hipstdpar-prim-path=`` specifies a non-canonical path for
388+
`rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_; it must
389+
point to the folder where the library is installed / built under a
390+
``/rocprim`` subfolder;
391+
392+
The `--offload-arch <https://llvm.org/docs/AMDGPUUsage.html#amdgpu-processors>`_
393+
flag can be used to specify the accelerator for which offload code is to be
394+
generated.
395+
396+
Implementation - Front-End
397+
==========================
398+
399+
When ``LangOpts.HIPStdPar`` is set, we relax some of the HIP language specific
400+
``Sema`` checks to account for the fact that we want to consume pure unannotated
401+
C++ code:
402+
403+
1. ``__device__`` / ``__host__ __device__`` functions (which would originate in
404+
the accelerator specific algorithm library) are allowed to call implicitly
405+
``__host__`` functions;
406+
2. ``__global__`` functions (which would originate in the accelerator specific
407+
algorithm library) are allowed to call implicitly ``__host__`` functions;
408+
3. resolving ``__builtin`` availability is deferred, because it is possible that
409+
a ``__builtin`` that is unavailable on the target accelerator is not
410+
reachable from any offloaded algorithm, and thus will be safely removed in
411+
the middle-end;
412+
4. ASM parsing / checking is deferred, because it is possible that an ASM block
413+
that e.g. uses some constraints that are incompatible with the target
414+
accelerator is not reachable from any offloaded algorithm, and thus will be
415+
safely removed in the middle-end.
416+
417+
``CodeGen`` is similarly relaxed, with implicitly ``__host__`` functions being
418+
emitted as well.
419+
420+
Implementation - Middle-End
421+
===========================
422+
423+
We add two ``opt`` passes:
424+
425+
1. ``HipStdParAcceleratorCodeSelectionPass``
426+
427+
- For all kernels in a ``Module``, compute reachability, where a function
428+
``F`` is reachable from a kernel ``K`` if and only if there exists a direct
429+
call-chain rooted in ``F`` that includes ``K``;
430+
- Remove all functions that are not reachable from kernels;
431+
- This pass is only run when compiling for the accelerator.
432+
433+
The first pass assumes that the only code that the user intended to offload was
434+
that which was directly or transitively invocable as part of an algorithm
435+
execution. It also assumes that an accelerator aware algorithm implementation
436+
would rely on accelerator specific special functions (kernels), and that these
437+
effectively constitute the only roots for accelerator execution graphs. Both of
438+
these assumptions are based on observing how widespread accelerators,
439+
such as GPUs, work.
440+
441+
1. ``HipStdParAllocationInterpositionPass``
442+
443+
- Iterate through all functions in a ``Module``, and replace standard
444+
allocation / deallocation functions with accelerator-aware equivalents,
445+
based on a pre-established table; the list of functions that can be
446+
interposed is available
447+
`here <https://github.com/ROCmSoftwarePlatform/roc-stdpar#allocation--deallocation-interposition-status>`_;
448+
- This is only run when compiling for the host.
449+
450+
The second pass is optional.
451+
452+
Implementation - Forwarding Header
453+
==================================
454+
455+
The forwarding header implements two pieces of functionality:
456+
457+
1. It forwards algorithms to a target accelerator, which is done by relying on
458+
C++ language rules around overloading:
459+
460+
- overloads taking an explicit argument of type
461+
``parallel_unsequenced_policy`` are introduced into the ``std`` namespace;
462+
- these will get preferentially selected versus the master template;
463+
- the body forwards to the equivalent algorithm from the accelerator specific
464+
library
465+
466+
2. It provides allocation / deallocation functions that are equivalent to the
467+
standard ones, but obtain memory by invoking
468+
`hipMallocManaged <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory_m.html#gab8cfa0e292193fa37e0cc2e4911fa90a>`_
469+
and release it via `hipFree <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory.html#ga740d08da65cae1441ba32f8fedb863d1>`_.
470+
471+
Predefined Macros
472+
=================
473+
474+
.. list-table::
475+
:header-rows: 1
476+
477+
* - Macro
478+
- Description
479+
* - ``__HIPSTDPAR__``
480+
- Defined when Clang is compiling code in algorithm offload mode, enabled
481+
with the ``--hipstdpar`` compiler option.
482+
* - ``__HIPSTDPAR_INTERPOSE_ALLOC__``
483+
- Defined only when compiling in algorithm offload mode, when the user
484+
enables interposition mode with the ``--hipstdpar-interpose-alloc``
485+
compiler option, indicating that all dynamic memory allocation /
486+
deallocation functions should be replaced with accelerator aware
487+
variants.
488+
489+
Restrictions
490+
============
491+
492+
We define two modes in which runtime execution can occur:
493+
494+
1. **HMM Mode** - this assumes that the
495+
`HMM <https://docs.kernel.org/mm/hmm.html>`_ subsystem of the Linux kernel
496+
is used to provide transparent on-demand paging i.e. memory obtained from a
497+
system / OS allocator such as via a call to ``malloc`` or ``operator new`` is
498+
directly accessible to the accelerator and it follows the C++ memory model;
499+
2. **Interposition Mode** - this is a fallback mode for cases where transparent
500+
on-demand paging is unavailable (e.g. in the Windows OS), which means that
501+
memory must be allocated via an accelerator aware mechanism, and system
502+
allocated memory is inaccessible for the accelerator.
503+
504+
The following restrictions imposed on user code apply to both modes:
505+
506+
1. Pointers to function, and all associated features, such as e.g. dynamic
507+
polymorphism, cannot be used (directly or transitively) by the user provided
508+
callable passed to an algorithm invocation;
509+
2. Global / namespace scope / ``static`` / ``thread`` storage duration variables
510+
cannot be used (directly or transitively) in name by the user provided
511+
callable;
512+
513+
- When executing in **HMM Mode** they can be used in address e.g.:
514+
515+
.. code-block:: C++
516+
517+
namespace { int foo = 42; }
518+
519+
bool never(const std::vector<int>& v) {
520+
return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v), [](auto&& x) {
521+
return x == foo;
522+
});
523+
}
524+
525+
bool only_in_hmm_mode(const std::vector<int>& v) {
526+
return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v),
527+
[p = &foo](auto&& x) { return x == *p; });
528+
}
529+
530+
3. Only algorithms that are invoked with the ``parallel_unsequenced_policy`` are
531+
candidates for offload;
532+
4. Only algorithms that are invoked with iterator arguments that model
533+
`random_access_iterator <https://en.cppreference.com/w/cpp/iterator/random_access_iterator>`_
534+
are candidates for offload;
535+
5. `Exceptions <https://en.cppreference.com/w/cpp/language/exceptions>`_ cannot
536+
be used by the user provided callable;
537+
6. Dynamic memory allocation (e.g. ``operator new``) cannot be used by the user
538+
provided callable;
539+
7. Selective offload is not possible i.e. it is not possible to indicate that
540+
only some algorithms invoked with the ``parallel_unsequenced_policy`` are to
541+
be executed on the accelerator.
542+
543+
In addition to the above, using **Interposition Mode** imposes the following
544+
additional restrictions:
545+
546+
1. All code that is expected to interoperate has to be recompiled with the
547+
``--hipstdpar-interpose-alloc`` flag i.e. it is not safe to compose libraries
548+
that have been independently compiled;
549+
2. automatic storage duration (i.e. stack allocated) variables cannot be used
550+
(directly or transitively) by the user provided callable e.g.
551+
552+
.. code-block:: c++
553+
554+
bool never(const std::vector<int>& v, int n) {
555+
return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v),
556+
[p = &n](auto&& x) { return x == *p; });
557+
}
558+
559+
Current Support
560+
===============
561+
562+
At the moment, C++ Standard Parallelism Offload is only available for AMD GPUs,
563+
when the `ROCm <https://rocm.docs.amd.com/en/latest/>`_ stack is used, on the
564+
Linux operating system. Support is synthesised in the following table:
565+
566+
.. list-table::
567+
:header-rows: 1
568+
569+
* - `Processor <https://llvm.org/docs/AMDGPUUsage.html#amdgpu-processors>`_
570+
- HMM Mode
571+
- Interposition Mode
572+
* - GCN GFX9 (Vega)
573+
- YES
574+
- YES
575+
* - GCN GFX10.1 (RDNA 1)
576+
- *NO*
577+
- YES
578+
* - GCN GFX10.3 (RDNA 2)
579+
- *NO*
580+
- YES
581+
* - GCN GFX11 (RDNA 3)
582+
- *NO*
583+
- YES
584+
585+
The minimum Linux kernel version for running in HMM mode is 6.4.
586+
587+
The forwarding header can be obtained from
588+
`its GitHub repository <https://github.com/ROCmSoftwarePlatform/roc-stdpar>`_.
589+
It will be packaged with a future `ROCm <https://rocm.docs.amd.com/en/latest/>`_
590+
release. Because accelerated algorithms are provided via
591+
`rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_, a
592+
transitive dependency on
593+
`rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_ exists. Both
594+
can be obtained either by installing their associated components of the
595+
`ROCm <https://rocm.docs.amd.com/en/latest/>`_ stack, or from their respective
596+
repositories. The list algorithms that can be offloaded is available
597+
`here <https://github.com/ROCmSoftwarePlatform/roc-stdpar#algorithm-support-status>`_.
598+
599+
HIP Specific Elements
600+
---------------------
601+
602+
1. There is no defined interop with the
603+
`HIP kernel language <https://rocm.docs.amd.com/projects/HIP/en/latest/reference/kernel_language.html>`_;
604+
whilst things like using `__device__` annotations might accidentally "work",
605+
they are not guaranteed to, and thus cannot be relied upon by user code;
606+
- A consequence of the above is that both bitcode linking and linking
607+
relocatable object files will "work", but it is not guaranteed to remain
608+
working or actively tested at the moment; this restriction might be relaxed
609+
in the future.
610+
2. Combining explicit HIP, CUDA or OpenMP Offload compilation with
611+
``--hipstdpar`` based offloading is not allowed or supported in any way.
612+
3. There is no way to target different accelerators via a standard algorithm
613+
invocation (`this might be addressed in future C++ standards <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2500r1.html>`_);
614+
an unsafe (per the point above) way of achieving this is to spawn new threads
615+
and invoke the `hipSetDevice <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___device.html#ga43c1e7f15925eeb762195ccb5e063eae>`_
616+
interface e.g.:
617+
618+
.. code-block:: c++
619+
620+
int accelerator_0 = ...;
621+
int accelerator_1 = ...;
622+
623+
bool multiple_accelerators(const std::vector<int>& u, const std::vector<int>& v) {
624+
std::atomic<unsigned int> r{0u};
625+
626+
thread t0{[&]() {
627+
hipSetDevice(accelerator_0);
628+
629+
r += std::count(std::execution::par_unseq, std::cbegin(u), std::cend(u), 42);
630+
}};
631+
thread t1{[&]() {
632+
hitSetDevice(accelerator_1);
633+
634+
r += std::count(std::execution::par_unseq, std::cbegin(v), std::cend(v), 314152)
635+
}};
636+
637+
t0.join();
638+
t1.join();
639+
640+
return r;
641+
}
642+
643+
Note that this is a temporary, unsafe workaround for a deficiency in the C++
644+
Standard.
645+
646+
Open Questions / Future Developments
647+
====================================
648+
649+
1. The restriction on the use of global / namespace scope / ``static`` /
650+
``thread`` storage duration variables in offloaded algorithms will be lifted
651+
in the future, when running in **HMM Mode**;
652+
2. The restriction on the use of dynamic memory allocation in offloaded
653+
algorithms will be lifted in the future.
654+
3. The restriction on the use of pointers to function, and associated features
655+
such as dynamic polymorphism might be lifted in the future, when running in
656+
**HMM Mode**;
657+
4. Offload support might be extended to cases where the ``parallel_policy`` is
658+
used for some or all targets.

0 commit comments

Comments
 (0)