Skip to content

Commit 2211077

Browse files
ax3lAlexanderSinn
andauthored
AMReX SIMD Helpers (#4520)
## Summary AMReX does not have a concept yet to help users write effective SIMD code on CPU, besides relying on auto-vectorization and pragmas, which are unreliable for any complex enough code. [1] Lucky enough, C++ `std::datapar` was just accepted into C++26, which gives an easy in to write portable SIMD/scalar code. Yet, I did not find a compiler/stdlib yet with support for it, so I finally had play with the C++17 `<experimental/simd>` headers, which are not as complete as C++26 but a good in, especially if complemented with the https://github.com/mattkretz/vir-simd library. This PR adds initial support for portable user-code by providing: - build system support: `AMReX_SIMD` (default is OFF), relying on [vir-simd](https://github.com/mattkretz/vir-simd) - an `AMReX_SIMD.H` header that handles includes & helper types - `ParallelForSIMD<SIMD_WIDTH>(...)` ## Additional background [1] Fun fact one: As written in the [story behind Intel's iscp compiler](https://pharr.org/matt/blog/2018/04/18/ispc-origins) and credited to [Tim Foley](http://graphics.stanford.edu/~tfoley/), *auto-vectorization is not a programming model.* Fun fact two: This is as ad-hoc as the implementation for [data parallel types / SIMD in Kokkos](https://kokkos.org/kokkos-core-wiki/API/simd/simd.html), it seems. ## User-Code Examples & Benchmark [Please see this ImpactX PR for details.](BLAST-ImpactX/impactx#1002) ## Checklist - [x] clean up commits (separate commits) - [x] finalize fallbacks & CI checks - [ ] add a `vir::stdx::simd` test in CI - [x] CMake - [ ] GnuMake - [x] `AMReX_SIMD.H` - [x] `ParallelForSIMD` - [x] `ParticleIdWrapper::make_(in)valid(mask)` - [x] clean up `sincos` support - [x] `SmallMatrix` - [x] Support for `GpuComplex` (minimal) - [x] Support [passing WIDTH as compile-time meta-data](https://godbolt.org/z/7455hqrEc) to callee in `ParallelForSIMD` - [ ] include documentation in the code and/or rst files, if appropriate - [x] add `vir::stdx::simd` in package managers: - [x] Spack [vir-simd](spack/spack-packages#332) - [x] Conda [vir-simd](conda-forge/staged-recipes#30377) ## Future Ideas / PRs - allocate particle arrays aligned so we can use [stdx::vector_aligned](https://en.cppreference.com/w/cpp/experimental/simd/vector_aligned.html) (for [copies](https://en.cppreference.com/w/cpp/experimental/simd/simd/copy_from) into/out of vector registers - note: makes no difference anymore on modern CPUs) - Support more/all functions in `ParticleIdWrapper`/`ParticleCpuWrapper` - Support for [vir::simdize<std::complex<T>>](mattkretz/vir-simd#42) instead of `GpuComplex<SIMD>` - `ParallelFor` ND support - `ParallelFor`/`ParallelForSIMD`: one could, maybe, with enable-if magic, etc fuse them into a single name again - CMake superbuild: `vir-simd` auto-download for convenience (opt-out) - Build system: "SIMD provider" selection, once we can opt-in to a C++26 compiler+stdlib instead of C++17 TS2 + vir-simd - Update AMReX package in package management: - Spack [vir-simd](spack/spack-packages#332) - Conda [vir-simd](conda-forge/staged-recipes#30377) --------- Co-authored-by: Alexander Sinn <[email protected]>
1 parent fa6f5e3 commit 2211077

16 files changed

+379
-26
lines changed

Docs/sphinx_documentation/source/BuildingAMReX.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -455,6 +455,8 @@ The list of available options is reported in the :ref:`table <tab:cmakevar>` bel
455455
+------------------------------+-------------------------------------------------+-------------------------+-----------------------+
456456
| AMReX_MPI | Build with MPI support | YES | YES, NO |
457457
+------------------------------+-------------------------------------------------+-------------------------+-----------------------+
458+
| AMReX_SIMD | Enable SIMD Primitives (using vir::stdx::simd) | NO | YES, NO |
459+
+------------------------------+-------------------------------------------------+-------------------------+-----------------------+
458460
| AMReX_OMP | Build with OpenMP support | NO | YES, NO |
459461
+------------------------------+-------------------------------------------------+-------------------------+-----------------------+
460462
| AMReX_GPU_BACKEND | Build with on-node, accelerated GPU backend | NONE | NONE, SYCL, HIP, CUDA |
@@ -683,6 +685,8 @@ A list of AMReX component names and related configure options are shown in the t
683685
+------------------------------+-----------------+
684686
| AMReX_MPI | MPI |
685687
+------------------------------+-----------------+
688+
| AMReX_SIMD | SIMD |
689+
+------------------------------+-----------------+
686690
| AMReX_OMP | OMP |
687691
+------------------------------+-----------------+
688692
| AMReX_GPU_BACKEND | CUDA, HIP, SYCL |

Src/Base/AMReX_GpuComplex.H

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -53,8 +53,9 @@ struct alignas(2*sizeof(T)) GpuComplex
5353
/**
5454
* \brief Add a real number to this complex number.
5555
*/
56+
template <typename U>
5657
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
57-
GpuComplex<T>& operator+= (const T& a_t) noexcept
58+
GpuComplex<T>& operator+= (const U& a_t) noexcept
5859
{
5960
m_real += a_t;
6061
return *this;
@@ -63,8 +64,9 @@ struct alignas(2*sizeof(T)) GpuComplex
6364
/**
6465
* \brief Subtract a real number from this complex number.
6566
*/
67+
template <typename U>
6668
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
67-
GpuComplex<T>& operator-= (const T& a_t) noexcept
69+
GpuComplex<T>& operator-= (const U& a_t) noexcept
6870
{
6971
m_real -= a_t;
7072
return *this;
@@ -73,8 +75,9 @@ struct alignas(2*sizeof(T)) GpuComplex
7375
/**
7476
* \brief Multiply this complex number by a real.
7577
*/
78+
template <typename U>
7679
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
77-
GpuComplex<T>& operator*= (const T& a_t) noexcept
80+
GpuComplex<T>& operator*= (const U& a_t) noexcept
7881
{
7982
m_real *= a_t;
8083
m_imag *= a_t;
@@ -84,8 +87,9 @@ struct alignas(2*sizeof(T)) GpuComplex
8487
/**
8588
* \brief Divide this complex number by a real.
8689
*/
90+
template <typename U>
8791
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
88-
GpuComplex<T>& operator/= (const T& a_t) noexcept
92+
GpuComplex<T>& operator/= (const U& a_t) noexcept
8993
{
9094
m_real /= a_t;
9195
m_imag /= a_t;
@@ -247,9 +251,9 @@ GpuComplex<T> operator+ (const T& a_x, const GpuComplex<T>& a_y) noexcept
247251
/**
248252
* \brief Multiply two complex numbers.
249253
*/
250-
template <typename T>
254+
template <typename T, typename U>
251255
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
252-
GpuComplex<T> operator* (const GpuComplex<T>& a_x, const GpuComplex<T>& a_y) noexcept
256+
GpuComplex<T> operator* (const GpuComplex<T>& a_x, const GpuComplex<U>& a_y) noexcept
253257
{
254258
GpuComplex<T> r = a_x;
255259
r *= a_y;
@@ -259,9 +263,9 @@ GpuComplex<T> operator* (const GpuComplex<T>& a_x, const GpuComplex<T>& a_y) noe
259263
/**
260264
* \brief Multiply a complex number by a real one.
261265
*/
262-
template <typename T>
266+
template <typename T, typename U>
263267
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
264-
GpuComplex<T> operator* (const GpuComplex<T>& a_x, const T& a_y) noexcept
268+
GpuComplex<T> operator* (const GpuComplex<T>& a_x, const U& a_y) noexcept
265269
{
266270
GpuComplex<T> r = a_x;
267271
r *= a_y;

Src/Base/AMReX_GpuLaunch.H

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,7 @@ namespace Gpu {
234234
#else
235235
#include <AMReX_GpuLaunchMacrosC.H>
236236
#include <AMReX_GpuLaunchFunctsC.H>
237+
#include <AMReX_SIMD.H>
237238
#endif
238239

239240
#include <AMReX_GpuLaunch.nolint.H>

Src/Base/AMReX_GpuLaunchFunctsC.H

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,24 @@
44

55
namespace amrex {
66

7+
/** Helper type to store/access the SIMD width in ParallelForSIMD lambdas
8+
*
9+
* Use instead of int as the running index i. Used to pass the
10+
* SIMD WIDTH as compile-time meta-data into a called function/method.
11+
*
12+
* @tparam WIDTH SIMD width in elements
13+
* @tparam N index type (integer)
14+
*/
15+
template<int WIDTH, class N=int>
16+
struct SIMDindex
17+
{
18+
/** SIMD width in elements */
19+
static constexpr int width = WIDTH;
20+
21+
/** The linear loop index of ParallelFor(SIMD) */
22+
N index = 0;
23+
};
24+
725
namespace detail {
826

927
// call_f_scalar_handler
@@ -175,6 +193,31 @@ void ParallelFor (T n, L&& f) noexcept
175193
ParallelFor(n, std::forward<L>(f));
176194
}
177195

196+
/** ParallelFor with a SIMD Width (in elements)
197+
*
198+
* SIMD load/Write-back operations need to be performed before/after calling this.
199+
*
200+
* @tparam WIDTH SIMD width in elements
201+
* @tparam N index type (integer)
202+
* @tparam L function/functor to call per SIMD set of elements
203+
*/
204+
template <int WIDTH, typename N, typename L, typename M=std::enable_if_t<std::is_integral_v<N>> >
205+
AMREX_ATTRIBUTE_FLATTEN_FOR
206+
void ParallelForSIMD (N n, L const& f) noexcept
207+
{
208+
N i = 0;
209+
// vectorize full lanes
210+
for (; i + WIDTH <= n; i+=WIDTH) {
211+
f(SIMDindex<WIDTH, N>{i});
212+
}
213+
// scalar handling of the remainder
214+
// note: we could make the remainder calls faster, by repeatedly
215+
// decreasing the SIMD width by 2 until we reach 1
216+
for (; i < n; ++i) {
217+
f(SIMDindex<1, N>{i});
218+
}
219+
}
220+
178221
template <typename T, typename L, typename M=std::enable_if_t<std::is_integral_v<T>> >
179222
void ParallelFor (Gpu::KernelInfo const&, T n, L&& f) noexcept
180223
{

Src/Base/AMReX_Math.H

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
#include <AMReX_GpuQualifiers.H>
66
#include <AMReX_Extension.H>
77
#include <AMReX_INT.H>
8+
#include <AMReX_SIMD.H>
89
#include <AMReX_REAL.H>
910
#include <cmath>
1011
#include <cstdlib>
@@ -133,6 +134,24 @@ namespace detail {
133134
}
134135

135136
//! Return sine and cosine of given number
137+
template<typename T_Real>
138+
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
139+
std::pair<T_Real,T_Real> sincos (T_Real x)
140+
{
141+
#ifdef AMREX_USE_SIMD
142+
using namespace amrex::simd::stdx;
143+
#else
144+
using namespace std;
145+
#endif
146+
147+
std::pair<T_Real,T_Real> r;
148+
r.first = sin(x);
149+
r.second = cos(x);
150+
return r;
151+
}
152+
153+
//! Return sine and cosine of given number
154+
template<>
136155
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
137156
std::pair<double,double> sincos (double x)
138157
{
@@ -147,6 +166,7 @@ std::pair<double,double> sincos (double x)
147166
}
148167

149168
//! Return sine and cosine of given number
169+
template<>
150170
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
151171
std::pair<float,float> sincos (float x)
152172
{
@@ -161,6 +181,25 @@ std::pair<float,float> sincos (float x)
161181
}
162182

163183
//! Return sin(pi*x) and cos(pi*x) given x
184+
template<typename T_Real>
185+
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
186+
std::pair<T_Real,T_Real> sincospi (T_Real x)
187+
{
188+
#ifdef AMREX_USE_SIMD
189+
using namespace amrex::simd::stdx;
190+
#else
191+
using namespace std;
192+
#endif
193+
194+
T_Real const px = pi<T_Real>() * x;
195+
std::pair<T_Real,T_Real> r;
196+
r.first = sin(px);
197+
r.second = cos(px);
198+
return r;
199+
}
200+
201+
//! Return sin(pi*x) and cos(pi*x) given x
202+
template<>
164203
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
165204
std::pair<double,double> sincospi (double x)
166205
{
@@ -175,6 +214,7 @@ std::pair<double,double> sincospi (double x)
175214
}
176215

177216
//! Return sin(pi*x) and cos(pi*x) given x
217+
template<>
178218
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
179219
std::pair<float,float> sincospi (float x)
180220
{

Src/Base/AMReX_SIMD.H

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
#ifndef AMREX_SIMD_H_
2+
#define AMREX_SIMD_H_
3+
4+
#include <AMReX_Config.H>
5+
6+
#include <AMReX_REAL.H>
7+
8+
#ifdef AMREX_USE_SIMD
9+
// TODO make SIMD provider configurable: VIR (C++17 TS2) or C++26 (later)
10+
# include <vir/simd.h> // includes SIMD TS2 header <experimental/simd>
11+
#endif
12+
13+
#include <cstdint>
14+
#include <type_traits>
15+
16+
17+
namespace amrex::simd
18+
{
19+
// TODO make SIMD provider configurable: VIR (C++17 TS2) or C++26 (later)
20+
//namespace stdx = std::experimental;
21+
// for https://en.cppreference.com/w/cpp/experimental/simd/simd_cast.html
22+
namespace stdx {
23+
#ifdef AMREX_USE_SIMD
24+
using namespace std::experimental;
25+
using namespace std::experimental::__proposed;
26+
using namespace vir::stdx;
27+
#else
28+
// fallback implementations for functions that are commonly used in portable code paths
29+
30+
AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
31+
bool any_of (bool const v) { return v; }
32+
#endif
33+
}
34+
35+
// TODO: move to AMReX_REAL.H?
36+
37+
#ifdef AMREX_USE_SIMD
38+
// TODO: not sure why std::experimental::simd_abi::native<T> does not work, so we use this long version
39+
constexpr auto native_simd_size_real = std::experimental::native_simd<amrex::Real>::size();
40+
constexpr auto native_simd_size_particlereal = std::experimental::native_simd<amrex::ParticleReal>::size();
41+
42+
// Note: to make use of not only vector registers but also ILP, user might want to use * 2 or more of the native size
43+
// for selected compute kernels.
44+
// TODO Check if a default with * 2 or similar is sensible.
45+
template<int SIMD_WIDTH = native_simd_size_real>
46+
using SIMDReal = std::experimental::fixed_size_simd<amrex::Real, SIMD_WIDTH>;
47+
48+
template<int SIMD_WIDTH = native_simd_size_particlereal>
49+
using SIMDParticleReal = std::experimental::fixed_size_simd<amrex::ParticleReal, SIMD_WIDTH>;
50+
51+
// Type that has the same amount of IdCpu SIMD elements as the SIMDParticleReal type
52+
template<typename T_ParticleReal = SIMDParticleReal<>>
53+
using SIMDIdCpu = std::experimental::rebind_simd_t<std::uint64_t, T_ParticleReal>;
54+
#else
55+
constexpr auto native_simd_size_real = 1;
56+
constexpr auto native_simd_size_particlereal = 1;
57+
58+
template<int SIMD_WIDTH = native_simd_size_real>
59+
using SIMDReal = amrex::Real;
60+
61+
template<int SIMD_WIDTH = native_simd_size_particlereal>
62+
using SIMDParticleReal = amrex::ParticleReal;
63+
64+
// Type that has the same amount of IdCpu SIMD elements as the SIMDParticleReal type
65+
template<typename T_ParticleReal = SIMDParticleReal<>>
66+
using SIMDIdCpu = std::uint64_t;
67+
#endif
68+
69+
namespace detail {
70+
struct InternalVectorized {};
71+
}
72+
73+
/** Mixin Helper Class
74+
*
75+
* Use this class as a mixin (base) class to simplify writing functors that support/do not support
76+
* ParallelForSIMD.
77+
*
78+
* Example:
79+
* ```c++
80+
* struct Compute : public Vectorized<>
81+
* {
82+
* template<typename T_Real>
83+
* AMREX_GPU_HOST_DEVICE AMREX_FORCE_INLINE
84+
* void operator() (T_Real & AMREX_RESTRICT x) { ... }
85+
* };
86+
*
87+
* // ... call site below
88+
* {
89+
* Compute c;
90+
*
91+
* if constexpr (amrex::simd::is_vectorized<Compute>) {
92+
* ParallelForSIMD<...>(np, c).
93+
* } else {
94+
* ParallelFor(np, c);
95+
* }
96+
* }
97+
* ```
98+
*/
99+
template<int SIMD_WIDTH = native_simd_size_real>
100+
struct
101+
Vectorized : detail::InternalVectorized
102+
{
103+
//! SIMD width in elements
104+
static constexpr int simd_width = SIMD_WIDTH;
105+
};
106+
107+
/** Check if a Functor Class works with amrex::ParallelForSIMD
108+
*
109+
* @see amrex::simd::Vectorized
110+
*/
111+
template<typename T>
112+
constexpr bool is_vectorized = std::is_base_of_v<detail::InternalVectorized, T>;
113+
114+
/** Check if a function argument is declared as non-const
115+
*
116+
* Use in conjunction with conditional write-back logic from vector registers, e.g.,
117+
*
118+
* ```c++
119+
* template<typename T_Real>
120+
* void compute (T_Real & AMREX_RESTRICT x,
121+
* T_Real const & AMREX_RESTRICT y) { ... }
122+
*
123+
* part_x.copy_from(&m_part_x[i], stdx::element_aligned);
124+
* part_y.copy_from(&m_part_y[i], stdx::element_aligned);
125+
*
126+
* compute(part_x, part_y);
127+
*
128+
* if constexpr (is_nth_arg_non_const(compute<double>, 0))
129+
* part_x.copy_to(&m_part_x[i], stdx::element_aligned);
130+
* if constexpr (is_nth_arg_non_const(compute<double>, 1))
131+
* part_y.copy_to(&m_part_y[i], stdx::element_aligned);
132+
*/
133+
template<typename R, typename... Args>
134+
constexpr bool is_nth_arg_non_const (R(*)(Args...), int n)
135+
{
136+
constexpr bool val_arr[sizeof...(Args)] {!std::is_const_v<std::remove_reference_t<Args>>...};
137+
return val_arr[n];
138+
}
139+
// same for functors (const/non-const ::operator() members)
140+
template<typename C, typename R, typename... Args>
141+
constexpr bool is_nth_arg_non_const (R(C::*)(Args...), int n)
142+
{
143+
constexpr bool val_arr[sizeof...(Args)] {!std::is_const_v<std::remove_reference_t<Args>>...};
144+
return val_arr[n];
145+
}
146+
template<typename C, typename R, typename... Args>
147+
constexpr bool is_nth_arg_non_const (R(C::*)(Args...) const, int n)
148+
{
149+
constexpr bool val_arr[sizeof...(Args)] {!std::is_const_v<std::remove_reference_t<Args>>...};
150+
return val_arr[n];
151+
}
152+
153+
} // namespace amrex::simd
154+
155+
#endif

0 commit comments

Comments
 (0)