Feat/mulhilo#1344
Conversation
Some toolchains (notably certain GCC builds) define shift and mul immediate intrinsics as macros that apply a textual C-style cast to their operand. That cast does not traverse the multi-level alias inheritance of simd_register (e.g. avx512bw -> avx512dq -> avx512cd -> avx512f), so a batch<T, ISA> fails to convert to its native register type in those contexts. Declare the conversion operator on batch itself so the native type is always one member-lookup away.
serge-sans-paille
left a comment
There was a problem hiding this comment.
First round of comments, I haven't reviewed the avx512 not the rvv part yet.
| * textual C-style cast inside the macro does not traverse the alias | ||
| * inheritance chain. Declaring the operator here makes it visible on | ||
| * the batch type directly. */ | ||
| XSIMD_INLINE operator register_type() const noexcept |
There was a problem hiding this comment.
you could use using types::simd_register<T, A>::operator register_type instead.
There was a problem hiding this comment.
you could use
using types::simd_register<T, A>::operator register_typeinstead.
Tried it, had to revert. The using-declaration is evaluated at the
point of class-template instantiation, not lazily on use. But
types::simd_register<T, A> only carries operator register_type for
supported (T, A) pairs — for unsupported pairs (e.g. batch<double, neon> on 32-bit ARM, where double isn't a native NEON element type)
the primary template is empty. With the using-decl in place, code that
merely names batch<double, neon> (xsimd's arch dispatcher walks the
list and instantiates each batch<T, Arch> it considers) fails to
compile:
error: 'operator xsimd::batch<double, xsimd::neon>::register_type' has
not been declared in 'struct xsimd::types::simd_register<double,
xsimd::neon>'
A redefined member operator is only instantiated when called, so
unsupported pairs stay well-formed up to the point a user actually
uses them. Restored the redefined inline operator and added a comment
explaining the constraint so this isn't re-suggested.
(Caught by the aarch64 cross-build / qemu run; it slips through the
x86 native run because the dispatcher there never walks neon.)
|
|
||
| #if defined(__SIZEOF_INT128__) | ||
| template <class T> | ||
| typename std::enable_if<std::is_integral<T>::value && (sizeof(T) == 8), T>::type |
There was a problem hiding this comment.
I fear the `is_integral`` test might fail here, as this type is not a standard type, see https://godbolt.org/z/3nd7KEj9K (works on gcc though)
There was a problem hiding this comment.
I fear the
is_integral<T>test might fail here, as this type is not a standard type, see https://godbolt.org/z/3nd7KEj9K (works on gcc though)
The reference helper is only ever instantiated with T ∈ {int8_t, …, int64_t, uint8_t, …, uint64_t} — those are the value_types of the
xsimd integer batches that exercise this path. None of them are
__int128, so std::is_integral<T> is well-defined and true on all
supported toolchains. __int128 only appears as the intermediate
widening type W inside the body, gated on __SIZEOF_INT128__; we
never feed __int128 through std::is_integral.
| array_type hi_expected; | ||
| for (std::size_t i = 0; i < size; ++i) | ||
| { | ||
| lo_expected[i] = static_cast<value_type>(static_cast<UT>(a[i]) * static_cast<UT>(b[i])); |
There was a problem hiding this comment.
Can you explain this intermediate cast to an unsigend type?
There was a problem hiding this comment.
Can you explain this intermediate cast to an unsigned type?
Signed integer overflow is UB in C++, and the test explicitly feeds
vmin*vmin, vmin*-1, etc. — pairs that overflow the signed range.
Casting to the corresponding unsigned type makes the multiplication
well-defined (modulo 2^N), and the low N bits of an N×N→2N product are
bit-identical for the signed and unsigned interpretations, so the
unsigned wrap value is the correct expected result for mullo.
I've added a short comment to that effect at the call site so a future
reader doesn't have to reconstruct this.
| # So vlen=128 RVV coverage lives in this workflow, which runs the build | ||
| # and test inside an `archlinux:latest` container (qemu 11 + gcc 15.1). | ||
| # The matching ubuntu-runner workflow `cross-rvv.yml` keeps multi-compiler | ||
| # matrix coverage (gcc-14, clang-17/18) for vlens >= 256, where the apt |
There was a problem hiding this comment.
I guess there's no way to have all compilers running on archlinux?
There was a problem hiding this comment.
I guess there's no way to have all compilers running on archlinux?
There migth be a way. I am not an arch expert though. Maybe a follow-up
PR to clean this?
| +---------------------------------------+----------------------------------------------------+ | ||
| | :cpp:func:`mul` | per slot multiply | | ||
| +---------------------------------------+----------------------------------------------------+ | ||
| | :cpp:func:`mullo` | low N bits of the 2N-bit integer product | |
There was a problem hiding this comment.
indentation seems odd.
There was a problem hiding this comment.
indentation seems odd. / same here
Fixed. The mullo and mulhi rows were one column short, breaking
the right-hand pipe. Padded both rows to match the rest of the table.
| XSIMD_INLINE batch<uint16_t, A> mulhi(batch<uint16_t, A> const& self, batch<uint16_t, A> const& other, requires_arch<sse2>) noexcept | ||
| { | ||
| return _mm_mulhi_epu16(self, other); | ||
| } |
There was a problem hiding this comment.
note for self: we would implement the 8 bit version of all those multiply, don't know if it makes sense though.
There was a problem hiding this comment.
note for self: we would implement the 8 bit version of all those
multiply, don't know if it makes sense though.
Done — added at the SSE2 baseline (so SSE4.1, AVX, … inherit it):
-
mul<uint8_t>(the body) — split each 16-bit lane into even/odd
bytes, do two 16-bitpmullws, re-interleave withpand/psllw/por.
No 8-bit-blend needed (avoids requiring SSE4.1). -
mul<int8_t>is a one-linebitwise_castforwarder to the unsigned
body, since the low N bits of an N×N→2N product are bit-identical
for both signed and unsigned. Codegen for the two overloads is
byte-identical (gcc 15 folds the cast):mul_i8_sse: psrlw ; psrlw ; pmullw ; psrlw ; pand ; pand ; pmullw ; psllw ; pand ; por mul_u8_sse: psrlw ; psrlw ; pmullw ; psrlw ; pand ; pand ; pmullw ; psllw ; pand ; por -
mulhi<int8_t>andmulhi<uint8_t>get separate bodies because the
results aren't bit-identical (sign- vs zero-extension changes the
product). Both use the standard widen→pmullw→narrow pattern.
Codegen:mulhi_i8: unpcklbw + unpckhbw ; psraw ×2 ; pmullw ×2 ; psraw ×2 ; packsswb mulhi_u8: unpcklbw/unpckhbw against zero ; pmullw ×2 ; psrlw ×2 ; packuswb
AVX2 and AVX-512BW already had templated 8-bit mul. They now also
get native-width 8-bit mulhi overloads (added in this PR rather
than left as a follow-up): direct ports of the SSE2 algorithm to
_mm256_* and _mm512_* intrinsics. The per-128-bit-lane
vpunpcklbw / vpunpckhbw and vpacksswb / vpackuswb form an
inverse pair within each lane, so no inter-lane vpermq /
vpermq2-style fix-up is needed — the byte ordering is preserved end
to end. Codegen on Haswell / Skylake-X:
mulhi_i8 (avx2): vpunpcklbw + vpunpckhbw ×2 ; vpsraw ×4 ; vpmullw ×2 ; vpsraw ×2 ; vpacksswb
mulhi_u8 (avx2): vpunpcklbw + vpunpckhbw against zero ; vpmullw ×2 ; vpsrlw ×2 ; vpackuswb
mulhi_i8 (avx512bw): same shape on ZMM
mulhi_u8 (avx512bw): same shape on ZMM
| XSIMD_INLINE batch<int32_t, A> mulhi(batch<int32_t, A> const& self, batch<int32_t, A> const& other, requires_arch<sse4_1>) noexcept | ||
| { | ||
| __m128i even = _mm_mul_epi32(self, other); // 64-bit products in lanes 0,2 | ||
| __m128i odd = _mm_mul_epi32(_mm_shuffle_epi32(self, _MM_SHUFFLE(3, 3, 1, 1)), |
There was a problem hiding this comment.
you could just _mm_srli_epi32(self, 32) instead, right?
There was a problem hiding this comment.
and same for other, of course
There was a problem hiding this comment.
you could just
_mm_srli_epi32(self, 32)instead, right? / and same forother, of course
_mm_srli_epi32(x, 32) would clear every 32-bit lane to zero (the SSE
intrinsic saturates past-width shifts to 0). I think you meant
_mm_srli_epi64(x, 32) — same as the unsigned overload right below
already does. Done — switched the signed mulhi<int32> to use
_mm_srli_epi64(self, 32) and _mm_srli_epi64(other, 32) to produce
the odd-lane operands, matching the uint32 overload. pmuldq reads
the low 32 bits of each 64-bit lane as signed, and srli_epi64 puts
the original odd lane there with zero-extended high bits, so signedness
is preserved.
Codegen comparison (g++ -O3 -msse4.2 -mno-avx):
before (signed): pshufd ; pshufd ; pmuldq ; pmuldq ; psrlq ; pblendw
after (signed): psrlq ; psrlq ; pmuldq ; pmuldq ; psrlq ; pblendw
unsigned (unchanged): psrlq ; psrlq ; pmuludq ; pmuludq ; psrlq ; pblendw
Same op count, but the signed and unsigned overloads are now structural
duplicates — easier to read and easier to keep in sync.
| __m128i odd = _mm_mul_epi32(_mm_shuffle_epi32(self, _MM_SHUFFLE(3, 3, 1, 1)), | ||
| _mm_shuffle_epi32(other, _MM_SHUFFLE(3, 3, 1, 1))); | ||
| // hi halves in the low 32 of each 64 lane of (even>>32), and in the high 32 of odd | ||
| __m128i even_hi = _mm_srli_epi64(even, 32); |
There was a problem hiding this comment.
you could avoid this shift by using directly a call to _mm_shuffle_ps on even and odd
There was a problem hiding this comment.
you could avoid this shift by using directly a call to
_mm_shuffle_psonevenandodd. / same here for blend vs. shuffle. You did use_mm_srli_epi64here though, so there is probably a reason I don't get ^^!
I tried this and it doesn't fall out as a single instruction:
_mm_shuffle_ps(even, odd, mask) always picks two lanes from even
then two from odd, so the best we get in one shuffle is
[E0hi, E1hi, O0hi, O1hi] — but the mulhi lane order has to be
[E0hi, O0hi, E1hi, O1hi] (lane i = high half of x[i]*y[i]).
Recovering the right interleave costs a second shuffle, which is the
same op-count as the current srli_epi64 + blend_epi16 and adds an
int<->fp domain crossing on most µarchs.
Codegen confirms (gcc 15, -O3 -msse4.2 -mno-avx):
current (srli + pblendw):
movdqa ; psrlq ; pmuldq ; psrlq ; pmuldq ; psrlq ; pblendw (7 ops)
shufps + fixup pshufd:
movdqa ; psrlq ; pmuldq ; psrlq ; pmuldq ; shufps ; pshufd (7 ops)
blendps (same shape, different blend):
movdqa ; psrlq ; pmuldq ; psrlq ; pmuldq ; psrlq ; blendps (7 ops)
Three 7-op sequences. Same instruction count, but the shufps + pshufd
variant pins two ops to port-5 (Skylake/Alderlake) on top of the
two pmuldqs, which already use port 5 — the srli + pblendw keeps
the trailing pair on the shift/blend ports (0/1/5) instead. So no win,
and a likely loss under back-to-back issue. Keeping the current
sequence.
| { | ||
| __m128i even = _mm_mul_epu32(self, other); | ||
| __m128i odd = _mm_mul_epu32(_mm_srli_epi64(self, 32), _mm_srli_epi64(other, 32)); | ||
| __m128i even_hi = _mm_srli_epi64(even, 32); |
There was a problem hiding this comment.
same here for blend vs. shuffle. You did use _mm_srli_epi64 here though, so there is probably a reason I don't get ^^!
| { | ||
| int16x8_t lo = vmull_s8(vget_low_s8(lhs), vget_low_s8(rhs)); | ||
| int16x8_t hi = vmull_s8(vget_high_s8(lhs), vget_high_s8(rhs)); | ||
| return vcombine_s8(vshrn_n_s16(lo, 8), vshrn_n_s16(hi, 8)); |
There was a problem hiding this comment.
What about vuzpq_s8 (and same below)
There was a problem hiding this comment.
What about
vuzpq_s8(and same below)
Done. I am not super familiar with neon/arm so this slipped.
Switched all six byte/half/word mulhi overloads to
vuzpq_*(reinterpret(lo), reinterpret(hi)).val[1], dropping the pair
of vshrn_n_* + vcombine_*.
Codegen on aarch64 (aarch64-linux-gnu-g++-15 -O3):
mulhi int8 : smull v30.8h, ... ; smull v31.8h, ... ; uzp2 v31.16b, ... ; str
mulhi int16: smull v30.4s, ... ; smull v31.4s, ... ; uzp2 v31.8h , ... ; str
mulhi int32: smull v30.2d, ... ; smull v31.2d, ... ; uzp2 v31.4s , ... ; str
Two widening multiplies + a single uzp2 per overload. The high lane
of each widened product lives at every odd index of [lo | hi], so
uzp2 extracts them directly.
Adds three integer-multiplication primitives exposed via the public API:
- mullo(x, y): low half of the lane-wise product (equivalent to x * y)
- mulhi(x, y): high half of the lane-wise product
- mulhilo(x, y): returns {mulhi, mullo} as a pair
Native kernels are provided for:
- NEON (vmull_* + vshrn for 8/16/32-bit; software path for 64-bit)
- SVE (svmulh_x)
- RVV (rvvmulh)
- SSE2 (mulhi_epi16 / mulhi_epu16)
- SSE4.1 (mul_epi32/mul_epu32 + blend for 32-bit; shared 64-bit core)
- AVX2 (mulhi_epi16/epu16, mul_epi32/mul_epu32 + blend; shared 64-bit core)
- AVX-512F (shared 64-bit core)
- AVX-512BW (mulhi_epi16/epu16)
The 64-bit x86 cores share a single implementation in common/xsimd_common_arithmetic.hpp:
mulhi_u64_core and mulhi_i64_core express the ll/lh/hl/hh decomposition with
xsimd batch operators (&, >>, +, -, bitwise_cast) plus an arch-specific
widening-mul functor (_mm*_mul_epu32). This eliminates three copies of the
same 64x64 -> hi software path and unifies the signed-fixup to a single
arithmetic-shift-by-63 pattern (maps to vpsraq on AVX-512, emulated on
SSE4.1/AVX2 via bitwise_rshift).
The generic fallback in common dispatches per-type through mulhi_helper,
using a wider native integer for <=32-bit types and software split-and-
multiply (or __int128 when available) for 64-bit.
…vlen=128) QEMU < 11's RVV TCG emulation is dramatically slower than scalar (QEMU issue #2137). At vlen=128, gcc's RVV codegen for our test_xsimd ends up running long enough under apt-shipped qemu-user-static (8.2.x noble, 9.x plucky, 10.x trixie) to overflow the 6h GHA job timeout while making no observable progress. Measured locally: qemu 8.2.2 (Ubuntu 24.04 apt) : test_xsimd at vlen=128 times out qemu 9.2.1 (Ubuntu 25.04 plucky) : ditto qemu 10.0.8 (Debian trixie) : ditto qemu 11.0.0 (Arch) + gcc 15.1 : 367 cases / 5664 asserts in <10 min Vlens >= 256 stay within the test step budget on apt qemu (smaller emulator slowdown per logical op). Keep the existing cross-rvv.yml workflow as-is — multi-compiler matrix (gcc-14, clang-17/18), apt qemu-user-static — but drop vector_bits=128 from its matrix and add fail-fast: false plus a 15 min timeout-minutes safety net so a stuck entry doesn't cancel its peers or burn 6h. Add a sibling workflow cross-rvv-arch.yml that runs the build and test inside archlinux:latest (qemu 11 + gcc 15.1) and covers vector_bits=128/256/512. This restores RVV vlen=128 coverage today without waiting for ubuntu-latest to ship qemu 11. References: QEMU 11.0.0 release notes https://www.qemu.org/2026/04/22/qemu-11-0-0/ QEMU RVV slowdowns issue https://gitlab.com/qemu-project/qemu/-/issues/2137 Ubuntu RVV vstart bug https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/2095169
Successor to #1334 (closed). Adds mulhi/mullo/mulhilo for integer batches; CI split for RVV cross-compile to use qemu 11 (arch container) at vlen=128 due to known qemu RVV slowdowns at vlen=128 in qemu < 11 (QEMU issue #2137). All 20 fork CI jobs green on this branch.
Took me forever to find the issue as I am using qemu 11 locally...