You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- C++ macros and GPU extended lambdas are used to provide performance
74
-
portability while making the code as understandable as possible to
75
-
science-focused code teams.
68
+
- These kernels are usually launched inside AMReX's :cpp:`MFIter` and :cpp:`ParIter`
69
+
loops, since in AMReX's approach to parallelism it is assumed that separate :cpp:`Box` objects
70
+
can be processed independently. However, AMReX also provides a :cpp:`MultiFab` version
71
+
of :cpp:`ParallelFor` that can process an entire level worth of :cpp:`Box` objects in
72
+
a single kernel launch when it is safe to do so.
76
73
77
74
- AMReX can utilize GPU managed memory to automatically handle memory
78
75
movement for mesh and particle data. Simple data structures, such
79
76
as :cpp:`IntVect`\s can be passed by value and complex data structures, such as
80
77
:cpp:`FArrayBox`\es, have specialized AMReX classes to handle the
81
-
data movement for the user. Tests have shown CUDA managed memory
82
-
to be efficient and reliable, especially when applications remove
83
-
any unnecessary data accesses. However, managed memory is not used by
78
+
data movement for the user. This particularly useful for the early stages
79
+
of porting an application to GPUs. However, for best performance on a
80
+
variety of platforms, we recommend disabling managed memory and handling
81
+
host/device data migration explicitly. managed memory is not used by
84
82
:cpp:`FArrayBox` and :cpp:`MultiFab` by default.
85
83
86
-
- Application teams should strive to keep mesh and particle data structures
84
+
- Best performance is usually achieved when keeping mesh and particle data structures
87
85
on the GPU for as long as possible, minimizing movement back to the CPU.
88
-
This strategy lends itself to AMReX applications readily; the mesh and
89
-
particle data can stay on the GPU for most subroutines except for
90
-
of redistribution, communication and I/O operations.
91
-
92
-
- AMReX's GPU strategy is focused on launching GPU kernels inside AMReX's
93
-
:cpp:`MFIter` and :cpp:`ParIter` loops. By performing GPU work within
94
-
:cpp:`MFIter` and :cpp:`ParIter` loops, GPU work is isolated to independent
95
-
data sets on well-established AMReX data objects, providing consistency and safety
96
-
that also matches AMReX's coding methodology. Similar tools are also available for
97
-
launching work outside of AMReX loops.
86
+
In many AMReX applications, the mesh and particle data can stay on the GPU for most
87
+
subroutines except for I/O operations.
98
88
99
89
- AMReX further parallelizes GPU applications by utilizing streams.
100
90
Streams guarantee execution order of kernels within the same stream, while
@@ -613,7 +603,7 @@ SUNDIALS CUDA vector:
613
603
GPU Safe Classes and Functions
614
604
==============================
615
605
616
-
AMReX GPU work takes place inside of MFIter and particle loops.
606
+
AMReX GPU work takes place inside of MFIter and ParIter loops.
617
607
Therefore, there are two ways classes and functions have been modified
618
608
to interact with the GPU:
619
609
@@ -624,7 +614,7 @@ such as :cpp:`amrex::min` and :cpp:`amrex::max`. In specialized cases,
624
614
classes are labeled such that the object can be constructed, destructed
625
615
and its functions can be implemented on the device, including ``IntVect``.
626
616
627
-
2. Functions that contain MFIter or particle loops have been rewritten
617
+
2. Functions that contain MFIter or ParIter loops have been rewritten
628
618
to contain device launches. For example, the :cpp:`FillBoundary`
629
619
function cannot be called from device code, but calling it from
630
620
CPU will launch GPU kernels if AMReX is compiled with GPU support.
@@ -1597,11 +1587,34 @@ Particle Support
1597
1587
1598
1588
.. _sec:gpu:particle:
1599
1589
1600
-
As with ``MultiFab``, particle data stored in AMReX ``ParticleContainer`` classes are
1601
-
stored in GPU memory when AMReX is compiled with ``USE_CUDA=TRUE``. This means that the :cpp:`dataPtr` associated with particles
1590
+
As with ``MultiFab``, particle data stored in AMReX ``ParticleContainer`` classes can be
1591
+
stored in GPU-accessible memory when AMReX is compiled with GPU support. The type of memory used by a given ``ParticleContainer`` can be controlled
1592
+
by the ``Allocator`` template parameter. By default, when compiled with GPU support ``ParticleContainer`` uses ``The_Arena()``. This means that the :cpp:`dataPtr` associated with particle data
1602
1593
can be passed into GPU kernels. These kernels can be launched with a variety of approaches,
1603
-
including Cuda C / Fortran and OpenACC. An example Fortran particle subroutine offloaded via OpenACC might
1604
-
look like the following:
1594
+
including AMReX's native kernel launching mechanisms as well OpenMP and OpenACC. Using AMReX's C++ syntax, a kernel launch involving particle data might look like:
1595
+
1596
+
.. highlight:: c++
1597
+
1598
+
::
1599
+
1600
+
for(MyParIter pti(pc, lev); pti.isValid(); ++pti)
1601
+
{
1602
+
auto& ptile = pti.GetParticleTile();
1603
+
auto ptd = tile.getParticleTileData();
1604
+
const auto np = tile.numParticles();
1605
+
amrex::ParallelFor( np,
1606
+
[=] AMREX_GPU_DEVICE (const int ip) noexcept
1607
+
{
1608
+
ptd.id(i).make_invalid();
1609
+
});
1610
+
}
1611
+
1612
+
The above code simply invalidates all particle on all particle tiles. The ``ParticleTileData``
1613
+
object is analogous to ``Array4`` in that it stores pointers to particle data and can be used
1614
+
on either the host or the device. This is a convenient way to pass particle data into GPU kernels
1615
+
because the same object can be used regardless of whether the data layout is AoS or SoA.
1616
+
1617
+
An example Fortran particle subroutine offloaded via OpenACC might look like the following:
0 commit comments