Skip to content

Commit a05be06

Browse files
committed
Deploying to main from @ AMReX-Codes/amrex@fb18121 🚀
1 parent 618a348 commit a05be06

File tree

8 files changed

+137
-115
lines changed

8 files changed

+137
-115
lines changed

amrex/docs_html/GPU.html

Lines changed: 62 additions & 53 deletions
Large diffs are not rendered by default.

amrex/docs_html/GPU_Chapter.html

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
<script src="_static/js/theme.js"></script>
2323
<link rel="index" title="Index" href="genindex.html" />
2424
<link rel="search" title="Search" href="search.html" />
25-
<link rel="next" title="Overview of AMReX GPU Strategy" href="GPU.html" />
25+
<link rel="next" title="Overview of AMReX GPU Support" href="GPU.html" />
2626
<link rel="prev" title="Time Integration" href="TimeIntegration_Chapter.html" />
2727
</head>
2828

@@ -67,7 +67,7 @@
6767
<li class="toctree-l1"><a class="reference internal" href="FFT_Chapter.html">Discrete Fourier Transform</a></li>
6868
<li class="toctree-l1"><a class="reference internal" href="TimeIntegration_Chapter.html">Time Integration</a></li>
6969
<li class="toctree-l1 current"><a class="current reference internal" href="#">GPU</a><ul>
70-
<li class="toctree-l2"><a class="reference internal" href="GPU.html">Overview of AMReX GPU Strategy</a></li>
70+
<li class="toctree-l2"><a class="reference internal" href="GPU.html">Overview of AMReX GPU Support</a></li>
7171
<li class="toctree-l2"><a class="reference internal" href="GPU.html#building-gpu-support">Building GPU Support</a></li>
7272
<li class="toctree-l2"><a class="reference internal" href="GPU.html#gpu-namespace-and-macros">Gpu Namespace and Macros</a></li>
7373
<li class="toctree-l2"><a class="reference internal" href="GPU.html#memory-allocation">Memory Allocation</a></li>
@@ -121,9 +121,9 @@
121121
<section id="gpu">
122122
<span id="chap-gpu"></span><h1>GPU<a class="headerlink" href="#gpu" title="Permalink to this heading"></a></h1>
123123
<p>In this chapter, we will present the GPU support in AMReX. AMReX targets
124-
NVIDIA, AMD and Intel GPUs using their native vendor language and therefore
124+
NVIDIA, AMD and Intel GPUs using their native vendor languages and therefore
125125
requires CUDA, HIP/ROCm and SYCL, for NVIDIA, AMD and Intel GPUs, respectively.
126-
Users can also use OpenMP and/or OpenACC in their applications.</p>
126+
Users can also use OpenMP and/or OpenACC in their applications if desired.</p>
127127
<p>AMReX supports NVIDIA GPUs with compute capability &gt;= 6 and CUDA &gt;= 11, and
128128
AMD GPUs with ROCm &gt;= 5. While SYCL compilers are in development in
129129
preparation for Aurora, AMReX only officially supports the latest publicly
@@ -136,7 +136,7 @@
136136
<div class="toctree-wrapper compound">
137137
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
138138
<ul>
139-
<li class="toctree-l1"><a class="reference internal" href="GPU.html">Overview of AMReX GPU Strategy</a></li>
139+
<li class="toctree-l1"><a class="reference internal" href="GPU.html">Overview of AMReX GPU Support</a></li>
140140
<li class="toctree-l1"><a class="reference internal" href="GPU.html#building-gpu-support">Building GPU Support</a></li>
141141
<li class="toctree-l1"><a class="reference internal" href="GPU.html#gpu-namespace-and-macros">Gpu Namespace and Macros</a></li>
142142
<li class="toctree-l1"><a class="reference internal" href="GPU.html#memory-allocation">Memory Allocation</a></li>
@@ -158,7 +158,7 @@
158158
</div>
159159
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
160160
<a href="TimeIntegration_Chapter.html" class="btn btn-neutral float-left" title="Time Integration" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
161-
<a href="GPU.html" class="btn btn-neutral float-right" title="Overview of AMReX GPU Strategy" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
161+
<a href="GPU.html" class="btn btn-neutral float-right" title="Overview of AMReX GPU Support" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
162162
</div>
163163

164164
<hr/>

amrex/docs_html/Visualization_Chapter.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
<link rel="index" title="Index" href="genindex.html" />
2424
<link rel="search" title="Search" href="search.html" />
2525
<link rel="next" title="Amrvis" href="Visualization.html" />
26-
<link rel="prev" title="Overview of AMReX GPU Strategy" href="GPU.html" />
26+
<link rel="prev" title="Overview of AMReX GPU Support" href="GPU.html" />
2727
</head>
2828

2929
<body class="wy-body-for-nav">
@@ -133,7 +133,7 @@
133133
</div>
134134
</div>
135135
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
136-
<a href="GPU.html" class="btn btn-neutral float-left" title="Overview of AMReX GPU Strategy" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
136+
<a href="GPU.html" class="btn btn-neutral float-left" title="Overview of AMReX GPU Support" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
137137
<a href="Visualization.html" class="btn btn-neutral float-right" title="Amrvis" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
138138
</div>
139139

Binary file not shown.

amrex/docs_html/_sources/GPU.rst.txt

Lines changed: 64 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,17 @@
66

77
.. _sec:gpu:overview:
88

9-
Overview of AMReX GPU Strategy
10-
==============================
9+
Overview of AMReX GPU Support
10+
=============================
1111

12-
AMReX's GPU strategy focuses on providing performant GPU support
13-
with minimal changes and maximum flexibility. This allows
14-
application teams to get running on GPUs quickly while allowing
15-
long term performance tuning and programming model selection. AMReX
16-
uses the native programming language for GPUs: CUDA for NVIDIA, HIP
12+
AMReX's GPU support focuses on providing performance portability
13+
across a range of important architectures with minimal
14+
code changes required at the application level. This allows
15+
application teams to use a single, maintainable codebase that works
16+
on a variety of platforms while allowing for the performance tuning of specific,
17+
high-impact kernels if desired.
18+
19+
Internally, AMReX uses the native programming languages for GPUs: CUDA for NVIDIA, HIP
1720
for AMD and SYCL for Intel. This will be designated with ``CUDA/HIP/SYCL``
1821
throughout the documentation. However, application teams can also use
1922
OpenACC or OpenMP in their individual codes.
@@ -22,33 +25,25 @@ At this time, AMReX does not support cross-native language compilation
2225
(HIP for non-AMD systems and SYCL for non Intel systems). It may work with
2326
a given version, but AMReX does not track or guarantee such functionality.
2427

25-
When running AMReX on a CPU system, the parallelization strategy is a
26-
combination of MPI and OpenMP using tiling, as detailed in
27-
:ref:`sec:basics:mfiter:tiling`. However, tiling is ineffective on GPUs
28-
due to the overhead associated with kernel launching. Instead,
29-
efficient use of the GPU's resources is the primary concern. Improving
30-
resource efficiency allows a larger percentage of GPU threads to work
31-
simultaneously, increasing effective parallelism and decreasing the time
32-
to solution.
33-
34-
When running on CPUs, AMReX uses an ``MPI+X`` strategy where the ``X``
35-
threads are used to perform parallelization techniques, like tiling.
36-
The most common ``X`` is ``OpenMP``. On GPUs, AMReX requires ``CUDA/HIP/SYCL``
37-
and can be further combined with other parallel GPU languages, including
38-
``OpenACC`` and ``OpenMP``, to control the offloading of subroutines
39-
to the GPU. This ``MPI+X+Y`` GPU strategy has been developed
40-
to give users the maximum flexibility to find the best combination of
41-
portability, readability and performance for their applications.
28+
AMReX uses an ``MPI+X`` approach to hierarchical parallelism. When running on
29+
CPUs, ``X`` is ``OpenMP``, and threads are used to process tiles assigned to the
30+
same MPI rank concurrently, as detailed in :ref:`sec:basics:mfiter:tiling`. On GPUs,
31+
``X`` is one of ``CUDA/HIP/SYCL``, and tiling is disabled by default
32+
to mitigate the overhead associated with kernel launching. Instead, kernels are usually
33+
launched at the ``Box`` level, and one or more cells
34+
in a given ``Box`` are mapped to each GPU thread, as detailed in :numref:`fig:gpu:threads`
35+
below.
4236

4337
Presented here is an overview of important features of AMReX's GPU strategy.
4438
Additional information that is required for creating GPU applications is
4539
detailed throughout the rest of this chapter:
4640

47-
- Each MPI rank offloads its work to a single GPU. ``(MPI ranks == Number of GPUs)``
41+
- Each MPI rank offloads its work to a single GPU. Multiple ranks can share the
42+
same device, but for best performance we usually recommend ``(MPI ranks == Number of GPUs)``.
4843

49-
- Calculations that can be offloaded efficiently to GPUs use GPU threads
50-
to parallelize over a valid box at a time. This is done by launching over
51-
a large number GPU threads that only work on a few cells each. This work
44+
- To provide performance portability, GPU kernels are usually launched through ``ParallelFor`` looping constructs
45+
that use GPU extended lambdas to specify the work to be performed on each loop element. When compiled with GPU
46+
support, these constructs launch kernels with a large number of GPU threads that only work on a few cells each. This work
5247
distribution is illustrated in :numref:`fig:gpu:threads`.
5348

5449
.. |a| image:: ./GPU/gpu_2.png
@@ -70,31 +65,26 @@ detailed throughout the rest of this chapter:
7065
| The lo and hi of one tiled box are marked. | thread, each thread using a box with lo = hi. |
7166
+-----------------------------------------------------+------------------------------------------------------+
7267

73-
- C++ macros and GPU extended lambdas are used to provide performance
74-
portability while making the code as understandable as possible to
75-
science-focused code teams.
68+
- These kernels are usually launched inside AMReX's :cpp:`MFIter` and :cpp:`ParIter`
69+
loops, since in AMReX's approach to parallelism it is assumed that separate :cpp:`Box` objects
70+
can be processed independently. However, AMReX also provides a :cpp:`MultiFab` version
71+
of :cpp:`ParallelFor` that can process an entire level worth of :cpp:`Box` objects in
72+
a single kernel launch when it is safe to do so.
7673

7774
- AMReX can utilize GPU managed memory to automatically handle memory
7875
movement for mesh and particle data. Simple data structures, such
7976
as :cpp:`IntVect`\s can be passed by value and complex data structures, such as
8077
:cpp:`FArrayBox`\es, have specialized AMReX classes to handle the
81-
data movement for the user. Tests have shown CUDA managed memory
82-
to be efficient and reliable, especially when applications remove
83-
any unnecessary data accesses. However, managed memory is not used by
78+
data movement for the user. This particularly useful for the early stages
79+
of porting an application to GPUs. However, for best performance on a
80+
variety of platforms, we recommend disabling managed memory and handling
81+
host/device data migration explicitly. managed memory is not used by
8482
:cpp:`FArrayBox` and :cpp:`MultiFab` by default.
8583

86-
- Application teams should strive to keep mesh and particle data structures
84+
- Best performance is usually achieved when keeping mesh and particle data structures
8785
on the GPU for as long as possible, minimizing movement back to the CPU.
88-
This strategy lends itself to AMReX applications readily; the mesh and
89-
particle data can stay on the GPU for most subroutines except for
90-
of redistribution, communication and I/O operations.
91-
92-
- AMReX's GPU strategy is focused on launching GPU kernels inside AMReX's
93-
:cpp:`MFIter` and :cpp:`ParIter` loops. By performing GPU work within
94-
:cpp:`MFIter` and :cpp:`ParIter` loops, GPU work is isolated to independent
95-
data sets on well-established AMReX data objects, providing consistency and safety
96-
that also matches AMReX's coding methodology. Similar tools are also available for
97-
launching work outside of AMReX loops.
86+
In many AMReX applications, the mesh and particle data can stay on the GPU for most
87+
subroutines except for I/O operations.
9888

9989
- AMReX further parallelizes GPU applications by utilizing streams.
10090
Streams guarantee execution order of kernels within the same stream, while
@@ -613,7 +603,7 @@ SUNDIALS CUDA vector:
613603
GPU Safe Classes and Functions
614604
==============================
615605

616-
AMReX GPU work takes place inside of MFIter and particle loops.
606+
AMReX GPU work takes place inside of MFIter and ParIter loops.
617607
Therefore, there are two ways classes and functions have been modified
618608
to interact with the GPU:
619609

@@ -624,7 +614,7 @@ such as :cpp:`amrex::min` and :cpp:`amrex::max`. In specialized cases,
624614
classes are labeled such that the object can be constructed, destructed
625615
and its functions can be implemented on the device, including ``IntVect``.
626616

627-
2. Functions that contain MFIter or particle loops have been rewritten
617+
2. Functions that contain MFIter or ParIter loops have been rewritten
628618
to contain device launches. For example, the :cpp:`FillBoundary`
629619
function cannot be called from device code, but calling it from
630620
CPU will launch GPU kernels if AMReX is compiled with GPU support.
@@ -1597,11 +1587,34 @@ Particle Support
15971587

15981588
.. _sec:gpu:particle:
15991589

1600-
As with ``MultiFab``, particle data stored in AMReX ``ParticleContainer`` classes are
1601-
stored in GPU memory when AMReX is compiled with ``USE_CUDA=TRUE``. This means that the :cpp:`dataPtr` associated with particles
1590+
As with ``MultiFab``, particle data stored in AMReX ``ParticleContainer`` classes can be
1591+
stored in GPU-accessible memory when AMReX is compiled with GPU support. The type of memory used by a given ``ParticleContainer`` can be controlled
1592+
by the ``Allocator`` template parameter. By default, when compiled with GPU support ``ParticleContainer`` uses ``The_Arena()``. This means that the :cpp:`dataPtr` associated with particle data
16021593
can be passed into GPU kernels. These kernels can be launched with a variety of approaches,
1603-
including Cuda C / Fortran and OpenACC. An example Fortran particle subroutine offloaded via OpenACC might
1604-
look like the following:
1594+
including AMReX's native kernel launching mechanisms as well OpenMP and OpenACC. Using AMReX's C++ syntax, a kernel launch involving particle data might look like:
1595+
1596+
.. highlight:: c++
1597+
1598+
::
1599+
1600+
for(MyParIter pti(pc, lev); pti.isValid(); ++pti)
1601+
{
1602+
auto& ptile = pti.GetParticleTile();
1603+
auto ptd = tile.getParticleTileData();
1604+
const auto np = tile.numParticles();
1605+
amrex::ParallelFor( np,
1606+
[=] AMREX_GPU_DEVICE (const int ip) noexcept
1607+
{
1608+
ptd.id(i).make_invalid();
1609+
});
1610+
}
1611+
1612+
The above code simply invalidates all particle on all particle tiles. The ``ParticleTileData``
1613+
object is analogous to ``Array4`` in that it stores pointers to particle data and can be used
1614+
on either the host or the device. This is a convenient way to pass particle data into GPU kernels
1615+
because the same object can be used regardless of whether the data layout is AoS or SoA.
1616+
1617+
An example Fortran particle subroutine offloaded via OpenACC might look like the following:
16051618

16061619
.. highlight:: fortran
16071620

amrex/docs_html/_sources/GPU_Chapter.rst.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ GPU
44
===
55

66
In this chapter, we will present the GPU support in AMReX. AMReX targets
7-
NVIDIA, AMD and Intel GPUs using their native vendor language and therefore
7+
NVIDIA, AMD and Intel GPUs using their native vendor languages and therefore
88
requires CUDA, HIP/ROCm and SYCL, for NVIDIA, AMD and Intel GPUs, respectively.
9-
Users can also use OpenMP and/or OpenACC in their applications.
9+
Users can also use OpenMP and/or OpenACC in their applications if desired.
1010

1111
AMReX supports NVIDIA GPUs with compute capability >= 6 and CUDA >= 11, and
1212
AMD GPUs with ROCm >= 5. While SYCL compilers are in development in

amrex/docs_html/objects.inv

-3 Bytes
Binary file not shown.

amrex/docs_html/searchindex.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)