Feat/mulhilo by DiamonDinoia · Pull Request #1344 · xtensor-stack/xsimd

DiamonDinoia · 2026-05-06T17:50:01Z

Successor to #1334 (closed). Adds mulhi/mullo/mulhilo for integer batches; CI split for RVV cross-compile to use qemu 11 (arch container) at vlen=128 due to known qemu RVV slowdowns at vlen=128 in qemu < 11 (QEMU issue #2137). All 20 fork CI jobs green on this branch.

Took me forever to find the issue as I am using qemu 11 locally...

Some toolchains (notably certain GCC builds) define shift and mul immediate intrinsics as macros that apply a textual C-style cast to their operand. That cast does not traverse the multi-level alias inheritance of simd_register (e.g. avx512bw -> avx512dq -> avx512cd -> avx512f), so a batch<T, ISA> fails to convert to its native register type in those contexts. Declare the conversion operator on batch itself so the native type is always one member-lookup away.

serge-sans-paille

First round of comments, I haven't reviewed the avx512 not the rvv part yet.

serge-sans-paille · 2026-05-06T20:17:23Z

+         * textual C-style cast inside the macro does not traverse the alias
+         * inheritance chain. Declaring the operator here makes it visible on
+         * the batch type directly. */
+        XSIMD_INLINE operator register_type() const noexcept


you could use using types::simd_register<T, A>::operator register_type instead.

you could use using types::simd_register<T, A>::operator register_type instead.

Tried it, had to revert. The using-declaration is evaluated at the
point of class-template instantiation, not lazily on use. But
types::simd_register<T, A> only carries operator register_type for
supported (T, A) pairs — for unsupported pairs (e.g. batch<double, neon> on 32-bit ARM, where double isn't a native NEON element type)
the primary template is empty. With the using-decl in place, code that
merely names batch<double, neon> (xsimd's arch dispatcher walks the
list and instantiates each batch<T, Arch> it considers) fails to
compile:

error: 'operator xsimd::batch<double, xsimd::neon>::register_type' has not been declared in 'struct xsimd::types::simd_register<double, xsimd::neon>'

A redefined member operator is only instantiated when called, so
unsupported pairs stay well-formed up to the point a user actually
uses them. Restored the redefined inline operator and added a comment
explaining the constraint so this isn't re-suggested.

(Caught by the aarch64 cross-build / qemu run; it slips through the
x86 native run because the dispatcher there never walks neon.)

serge-sans-paille · 2026-05-06T20:22:37Z

+
+#if defined(__SIZEOF_INT128__)
+    template <class T>
+    typename std::enable_if<std::is_integral<T>::value && (sizeof(T) == 8), T>::type


I fear the `is_integral`` test might fail here, as this type is not a standard type, see https://godbolt.org/z/3nd7KEj9K (works on gcc though)

I fear the is_integral<T> test might fail here, as this type is not a standard type, see https://godbolt.org/z/3nd7KEj9K (works on gcc though)

The reference helper is only ever instantiated with T ∈ {int8_t, …, int64_t, uint8_t, …, uint64_t} — those are the value_types of the
xsimd integer batches that exercise this path. None of them are
__int128, so std::is_integral<T> is well-defined and true on all
supported toolchains. __int128 only appears as the intermediate
widening type W inside the body, gated on __SIZEOF_INT128__; we
never feed __int128 through std::is_integral.

serge-sans-paille · 2026-05-06T20:29:30Z

+            array_type hi_expected;
+            for (std::size_t i = 0; i < size; ++i)
+            {
+                lo_expected[i] = static_cast<value_type>(static_cast<UT>(a[i]) * static_cast<UT>(b[i]));


Can you explain this intermediate cast to an unsigend type?

Can you explain this intermediate cast to an unsigned type?

Signed integer overflow is UB in C++, and the test explicitly feeds
vmin*vmin, vmin*-1, etc. — pairs that overflow the signed range.
Casting to the corresponding unsigned type makes the multiplication
well-defined (modulo 2^N), and the low N bits of an N×N→2N product are
bit-identical for the signed and unsigned interpretations, so the
unsigned wrap value is the correct expected result for mullo.

I've added a short comment to that effect at the call site so a future
reader doesn't have to reconstruct this.

serge-sans-paille · 2026-05-06T20:31:35Z

+# So vlen=128 RVV coverage lives in this workflow, which runs the build
+# and test inside an `archlinux:latest` container (qemu 11 + gcc 15.1).
+# The matching ubuntu-runner workflow `cross-rvv.yml` keeps multi-compiler
+# matrix coverage (gcc-14, clang-17/18) for vlens >= 256, where the apt


I guess there's no way to have all compilers running on archlinux?

I guess there's no way to have all compilers running on archlinux?

There migth be a way. I am not an arch expert though. Maybe a follow-up
PR to clean this?

serge-sans-paille · 2026-05-06T20:32:10Z

 +---------------------------------------+----------------------------------------------------+
 | :cpp:func:`mul`                       | per slot multiply                                  |
 +---------------------------------------+----------------------------------------------------+
+| :cpp:func:`mullo`                    | low N bits of the 2N-bit integer product           |


indentation seems odd.

indentation seems odd. / same here

Fixed. The mullo and mulhi rows were one column short, breaking
the right-hand pipe. Padded both rows to match the rest of the table.

serge-sans-paille · 2026-05-06T20:49:13Z

+        XSIMD_INLINE batch<uint16_t, A> mulhi(batch<uint16_t, A> const& self, batch<uint16_t, A> const& other, requires_arch<sse2>) noexcept
+        {
+            return _mm_mulhi_epu16(self, other);
+        }


note for self: we would implement the 8 bit version of all those multiply, don't know if it makes sense though.

note for self: we would implement the 8 bit version of all those
multiply, don't know if it makes sense though.

Done — added at the SSE2 baseline (so SSE4.1, AVX, … inherit it):

mul<uint8_t> (the body) — split each 16-bit lane into even/odd
bytes, do two 16-bit pmullws, re-interleave with pand/psllw/por.
No 8-bit-blend needed (avoids requiring SSE4.1).

mul<int8_t> is a one-line bitwise_cast forwarder to the unsigned
body, since the low N bits of an N×N→2N product are bit-identical
for both signed and unsigned. Codegen for the two overloads is
byte-identical (gcc 15 folds the cast):

mul_i8_sse: psrlw ; psrlw ; pmullw ; psrlw ; pand ; pand ; pmullw ; psllw ; pand ; por mul_u8_sse: psrlw ; psrlw ; pmullw ; psrlw ; pand ; pand ; pmullw ; psllw ; pand ; por

mulhi<int8_t> and mulhi<uint8_t> get separate bodies because the
results aren't bit-identical (sign- vs zero-extension changes the
product). Both use the standard widen→pmullw→narrow pattern.
Codegen:

mulhi_i8: unpcklbw + unpckhbw ; psraw ×2 ; pmullw ×2 ; psraw ×2 ; packsswb mulhi_u8: unpcklbw/unpckhbw against zero ; pmullw ×2 ; psrlw ×2 ; packuswb

AVX2 and AVX-512BW already had templated 8-bit mul. They now also
get native-width 8-bit mulhi overloads (added in this PR rather
than left as a follow-up): direct ports of the SSE2 algorithm to
_mm256_* and _mm512_* intrinsics. The per-128-bit-lane
vpunpcklbw / vpunpckhbw and vpacksswb / vpackuswb form an
inverse pair within each lane, so no inter-lane vpermq /
vpermq2-style fix-up is needed — the byte ordering is preserved end
to end. Codegen on Haswell / Skylake-X:

mulhi_i8 (avx2): vpunpcklbw + vpunpckhbw ×2 ; vpsraw ×4 ; vpmullw ×2 ; vpsraw ×2 ; vpacksswb mulhi_u8 (avx2): vpunpcklbw + vpunpckhbw against zero ; vpmullw ×2 ; vpsrlw ×2 ; vpackuswb mulhi_i8 (avx512bw): same shape on ZMM mulhi_u8 (avx512bw): same shape on ZMM

serge-sans-paille · 2026-05-06T20:54:17Z

+        XSIMD_INLINE batch<int32_t, A> mulhi(batch<int32_t, A> const& self, batch<int32_t, A> const& other, requires_arch<sse4_1>) noexcept
+        {
+            __m128i even = _mm_mul_epi32(self, other); // 64-bit products in lanes 0,2
+            __m128i odd = _mm_mul_epi32(_mm_shuffle_epi32(self, _MM_SHUFFLE(3, 3, 1, 1)),


you could just _mm_srli_epi32(self, 32) instead, right?

and same for other, of course

you could just _mm_srli_epi32(self, 32) instead, right? / and same for other, of course

_mm_srli_epi32(x, 32) would clear every 32-bit lane to zero (the SSE
intrinsic saturates past-width shifts to 0). I think you meant
_mm_srli_epi64(x, 32) — same as the unsigned overload right below
already does. Done — switched the signed mulhi<int32> to use
_mm_srli_epi64(self, 32) and _mm_srli_epi64(other, 32) to produce
the odd-lane operands, matching the uint32 overload. pmuldq reads
the low 32 bits of each 64-bit lane as signed, and srli_epi64 puts
the original odd lane there with zero-extended high bits, so signedness
is preserved.

Codegen comparison (g++ -O3 -msse4.2 -mno-avx):

before (signed): pshufd ; pshufd ; pmuldq ; pmuldq ; psrlq ; pblendw after (signed): psrlq ; psrlq ; pmuldq ; pmuldq ; psrlq ; pblendw unsigned (unchanged): psrlq ; psrlq ; pmuludq ; pmuludq ; psrlq ; pblendw

Same op count, but the signed and unsigned overloads are now structural
duplicates — easier to read and easier to keep in sync.

serge-sans-paille · 2026-05-06T20:55:51Z

+            __m128i odd = _mm_mul_epi32(_mm_shuffle_epi32(self, _MM_SHUFFLE(3, 3, 1, 1)),
+                                        _mm_shuffle_epi32(other, _MM_SHUFFLE(3, 3, 1, 1)));
+            // hi halves in the low 32 of each 64 lane of (even>>32), and in the high 32 of odd
+            __m128i even_hi = _mm_srli_epi64(even, 32);


you could avoid this shift by using directly a call to _mm_shuffle_ps on even and odd

you could avoid this shift by using directly a call to _mm_shuffle_ps on even and odd. / same here for blend vs. shuffle. You did use _mm_srli_epi64 here though, so there is probably a reason I don't get ^^!

I tried this and it doesn't fall out as a single instruction:
_mm_shuffle_ps(even, odd, mask) always picks two lanes from even
then two from odd, so the best we get in one shuffle is
[E0hi, E1hi, O0hi, O1hi] — but the mulhi lane order has to be
[E0hi, O0hi, E1hi, O1hi] (lane i = high half of x[i]*y[i]).
Recovering the right interleave costs a second shuffle, which is the
same op-count as the current srli_epi64 + blend_epi16 and adds an
int<->fp domain crossing on most µarchs.

Codegen confirms (gcc 15, -O3 -msse4.2 -mno-avx):

current (srli + pblendw): movdqa ; psrlq ; pmuldq ; psrlq ; pmuldq ; psrlq ; pblendw (7 ops) shufps + fixup pshufd: movdqa ; psrlq ; pmuldq ; psrlq ; pmuldq ; shufps ; pshufd (7 ops) blendps (same shape, different blend): movdqa ; psrlq ; pmuldq ; psrlq ; pmuldq ; psrlq ; blendps (7 ops)

Three 7-op sequences. Same instruction count, but the shufps + pshufd
variant pins two ops to port-5 (Skylake/Alderlake) on top of the
two pmuldqs, which already use port 5 — the srli + pblendw keeps
the trailing pair on the shift/blend ports (0/1/5) instead. So no win,
and a likely loss under back-to-back issue. Keeping the current
sequence.

serge-sans-paille · 2026-05-06T20:56:56Z

+        {
+            __m128i even = _mm_mul_epu32(self, other);
+            __m128i odd = _mm_mul_epu32(_mm_srli_epi64(self, 32), _mm_srli_epi64(other, 32));
+            __m128i even_hi = _mm_srli_epi64(even, 32);


same here for blend vs. shuffle. You did use _mm_srli_epi64 here though, so there is probably a reason I don't get ^^!

serge-sans-paille · 2026-05-06T21:07:43Z

+        {
+            int16x8_t lo = vmull_s8(vget_low_s8(lhs), vget_low_s8(rhs));
+            int16x8_t hi = vmull_s8(vget_high_s8(lhs), vget_high_s8(rhs));
+            return vcombine_s8(vshrn_n_s16(lo, 8), vshrn_n_s16(hi, 8));


What about vuzpq_s8 (and same below)

What about vuzpq_s8 (and same below)

Done. I am not super familiar with neon/arm so this slipped.
Switched all six byte/half/word mulhi overloads to
vuzpq_*(reinterpret(lo), reinterpret(hi)).val[1], dropping the pair
of vshrn_n_* + vcombine_*.

Codegen on aarch64 (aarch64-linux-gnu-g++-15 -O3):

mulhi int8 : smull v30.8h, ... ; smull v31.8h, ... ; uzp2 v31.16b, ... ; str mulhi int16: smull v30.4s, ... ; smull v31.4s, ... ; uzp2 v31.8h , ... ; str mulhi int32: smull v30.2d, ... ; smull v31.2d, ... ; uzp2 v31.4s , ... ; str

Two widening multiplies + a single uzp2 per overload. The high lane
of each widened product lives at every odd index of [lo | hi], so
uzp2 extracts them directly.

Adds three integer-multiplication primitives exposed via the public API: - mullo(x, y): low half of the lane-wise product (equivalent to x * y) - mulhi(x, y): high half of the lane-wise product - mulhilo(x, y): returns {mulhi, mullo} as a pair Native kernels are provided for: - NEON (vmull_* + vshrn for 8/16/32-bit; software path for 64-bit) - SVE (svmulh_x) - RVV (rvvmulh) - SSE2 (mulhi_epi16 / mulhi_epu16) - SSE4.1 (mul_epi32/mul_epu32 + blend for 32-bit; shared 64-bit core) - AVX2 (mulhi_epi16/epu16, mul_epi32/mul_epu32 + blend; shared 64-bit core) - AVX-512F (shared 64-bit core) - AVX-512BW (mulhi_epi16/epu16) The 64-bit x86 cores share a single implementation in common/xsimd_common_arithmetic.hpp: mulhi_u64_core and mulhi_i64_core express the ll/lh/hl/hh decomposition with xsimd batch operators (&, >>, +, -, bitwise_cast) plus an arch-specific widening-mul functor (_mm*_mul_epu32). This eliminates three copies of the same 64x64 -> hi software path and unifies the signed-fixup to a single arithmetic-shift-by-63 pattern (maps to vpsraq on AVX-512, emulated on SSE4.1/AVX2 via bitwise_rshift). The generic fallback in common dispatches per-type through mulhi_helper, using a wider native integer for <=32-bit types and software split-and- multiply (or __int128 when available) for 64-bit.

…vlen=128) QEMU < 11's RVV TCG emulation is dramatically slower than scalar (QEMU issue #2137). At vlen=128, gcc's RVV codegen for our test_xsimd ends up running long enough under apt-shipped qemu-user-static (8.2.x noble, 9.x plucky, 10.x trixie) to overflow the 6h GHA job timeout while making no observable progress. Measured locally: qemu 8.2.2 (Ubuntu 24.04 apt) : test_xsimd at vlen=128 times out qemu 9.2.1 (Ubuntu 25.04 plucky) : ditto qemu 10.0.8 (Debian trixie) : ditto qemu 11.0.0 (Arch) + gcc 15.1 : 367 cases / 5664 asserts in <10 min Vlens >= 256 stay within the test step budget on apt qemu (smaller emulator slowdown per logical op). Keep the existing cross-rvv.yml workflow as-is — multi-compiler matrix (gcc-14, clang-17/18), apt qemu-user-static — but drop vector_bits=128 from its matrix and add fail-fast: false plus a 15 min timeout-minutes safety net so a stuck entry doesn't cancel its peers or burn 6h. Add a sibling workflow cross-rvv-arch.yml that runs the build and test inside archlinux:latest (qemu 11 + gcc 15.1) and covers vector_bits=128/256/512. This restores RVV vlen=128 coverage today without waiting for ubuntu-latest to ship qemu 11. References: QEMU 11.0.0 release notes https://www.qemu.org/2026/04/22/qemu-11-0-0/ QEMU RVV slowdowns issue https://gitlab.com/qemu-project/qemu/-/issues/2137 Ubuntu RVV vstart bug https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/2095169

DiamonDinoia requested a review from serge-sans-paille May 6, 2026 20:02

serge-sans-paille reviewed May 6, 2026

View reviewed changes

DiamonDinoia added 2 commits May 7, 2026 12:34

DiamonDinoia force-pushed the feat/mulhilo branch from 3ecd6ec to 66cc2e1 Compare May 7, 2026 16:34

Conversation

DiamonDinoia commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serge-sans-paille left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DiamonDinoia commented May 6, 2026 •

edited

Loading

DiamonDinoia May 7, 2026 •

edited

Loading