feat: migrate take_primitive_simd to stable AVX2 kernel by a10y · Pull Request #3579 · vortex-data/vortex

a10y · 2025-06-19T23:47:19Z

An implementation of TakeKernel for PrimitiveArray that uses AVX2 explicit instructions, falling back to Scalar. For non-x86_64 platforms the portable_simd implementation is still loaded if nightly compiler is being used.

This is part of the #3546 series of PRs.

Additionally: fixed a soundness issue with lack of bounds checking of indices for portable_simd impl. This requires us to do a full scan of the indices upfront before running the kernel to avoid out of bounds memory access.

Implementation

The biggest source of added complexity in this PR is the new avx2 module which implements a take kernel for primitive indices/values that uses the AVX2 GATHER operation.

Intel ISA provides 4 different gather intrinsics:

_mm256_i32gather_epi32 -> gathering 8x 32bit values with 32bit indices
_mm_i32gather_epi64 -> gathering 4x 64bit values with 32bit indices
_mm_i64gather_epi32 -> gathering 8x 32bit values with 32bit indices
_mm256_i32gather_epi32 -> gathering 8x 32bit values with 32bit indices

We implement a generic inner loop with a trait parameter GatherFn<I, V>, and allow specialization to insert the proper loop logic for each valid index/value type combination.

robert3005 · 2025-06-19T23:52:45Z

I wouldn't merge this and instead leave simd impl behind a feature flag. Portable simd probably gives us more than just avx2 impl?

vortex-array/src/arrays/primitive/compute/take/x86.rs

a10y · 2025-06-19T23:59:33Z

vortex-array/src/arrays/primitive/compute/take/x86.rs

+        let offset = chunk_idx * SIMD_WIDTH;
+
+        // Load the next 8 indices into a vector
+        let indices_vec = unsafe { _mm256_loadu_si256(indices.as_ptr().add(offset).cast()) };


interestingly, tantivy uses the _mm256_lddqu_si256 intrinsic instead, which according to the Intel docs seems to indicate that it's very similar to mm256_loadu_si256, and indeed this StackOverflow answer also backs that up:

There's no reason to ever use _mm256_lddqu_si256, consider it a synonym for _mm256_loadu_si256. lddqu only exists for historical reasons as x86 evolved towards having better unaligned vector load support, and CPUs that support the AVX version run them identically. There's no AVX512 version.

a10y · 2025-06-20T00:01:45Z

@robert3005 I probably don't have as good intuition about this as you or Alex, but AFAICT, the most beneficial part of the existing portable_simd implementation is the gather operation, which exists on AVX2 but has no equivalent on NEON.

Similarly, from other things I've read, AVX512 generally executes a 512-bit load as two instructions instead of one, so the speedup is not really 2x it's something short of that. I think that might be why things like Tantivy which implement direct SIMD support only have avx2 kernels and don't bother with avx512.

a10y · 2025-06-20T00:04:10Z

I'm also open to just shoving existing impl behind some nightly-only feature flag, if that's possible. I was just hoping that if this were valuable we can preserve it for everyone, since it is doable.

robert3005 · 2025-06-20T01:18:11Z

Ok, avx2 gather is probably the only widespread use simd implementation of this function so might be worth having a stable version

joseph-isaacs · 2025-06-20T07:16:41Z

Are there any benchmarks for this?

0ax1 · 2025-06-20T08:32:24Z

Are there any benchmarks for this?

This is covered by https://github.com/vortex-data/vortex/blob/develop/encodings/dict/benches/dict_compress.rs

0ax1 · 2025-06-20T08:41:09Z

@robert3005 I probably don't have as good intuition about this as you or Alex, but AFAICT, the most beneficial part of the existing portable_simd implementation is the gather operation, which exists on AVX2 but has no equivalent on NEON.

Similarly, from other things I've read, AVX512 generally executes a 512-bit load as two instructions instead of one, so the speedup is not really 2x it's something short of that. I think that might be why things like Tantivy which implement direct SIMD support only have avx2 kernels and don't bother with avx512.

Yep, this is only about gather. Would be interesting though to also compare perf on macOS between portable simd and non-SIMD to double check that moving away from portable SIMD doesn't introduce a regression there. So the assumption is there shouldn't be, as there's no gather equivalent for NEON.

codspeed-hq · 2025-06-20T17:28:36Z

CodSpeed Performance Report

Merging #3579 will improve performances by 35.48%

_{Comparing aduffy/simd-take-primitive (f694309) with develop (8b184c2)}

Summary

⚡ 11 improvements
✅ 782 untouched benchmarks

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
⚡	`decode_primitives[f32, (10000, 128)]`	77.2 µs	57.3 µs	+34.59%
⚡	`decode_primitives[f32, (10000, 2)]`	76.9 µs	57 µs	+34.93%
⚡	`decode_primitives[f32, (10000, 32)]`	76.7 µs	57.1 µs	+34.36%
⚡	`decode_primitives[f32, (10000, 4)]`	76.9 µs	57.7 µs	+33.35%
⚡	`decode_primitives[f32, (10000, 512)]`	82.1 µs	62.4 µs	+31.59%
⚡	`decode_primitives[f32, (10000, 8)]`	76.8 µs	56.7 µs	+35.48%
⚡	`decode_primitives[i64, (10000, 128)]`	100.7 µs	79.7 µs	+26.48%
⚡	`decode_primitives[i64, (10000, 2)]`	100.1 µs	79.9 µs	+25.31%
⚡	`decode_primitives[i64, (10000, 32)]`	100 µs	79.6 µs	+25.64%
⚡	`decode_primitives[i64, (10000, 4)]`	100.3 µs	79.6 µs	+26.13%
⚡	`decode_primitives[i64, (10000, 8)]`	100.6 µs	79.2 µs	+27.08%

robert3005 · 2025-06-20T17:30:50Z

Interesting that i64 improved

a10y · 2025-06-20T17:41:14Z

That's b/c the latest commit I hand-rolled a kernel for u8 indices and i64 values, now the AVX pathway is actually being hit

a10y · 2025-06-20T17:43:55Z

vortex-array/src/arrays/primitive/compute/take/x86.rs

+#[cfg(target_arch = "x86_64")]
+#[target_feature(enable = "avx2")]
+unsafe fn take_u8_i64_avx2(indices: &[u8], values: &[i64]) -> Buffer<i64> {
+    const SIMD_WIDTH: usize = 4; // 256 bits / 32 bits per element


should say 64 bits per element

a10y · 2025-06-25T20:23:41Z

Alright, I think I've convinced myself that this PR adds value.

One important thing to note: the old portable_simd implementation was unsound, b/c it failed to bounds-check the indices before issuing the gather instructions, meaning that you could very easily trigger unsafe memory access. So no matter what, bounds checking needs to be added and that will undoubtedly hurt performance a bit from the baseline numbers.

I did two CodSpeed runs:

Run 1: portable_simd Kernel only with bounds check added: https://codspeed.io/vortex-data/vortex/runs/compare/685c2abfbd0c18c2abcc32be..685c56398a5bab413a016f8d
Run 2: Using AVX2 with bounds check added: https://codspeed.io/vortex-data/vortex/runs/compare/685c2abfbd0c18c2abcc32be..685c4aa98a5bab413a016f07

Note that in Run1, all of the decode_primitives benchmarks regress considerably. In Run 2, a few regress (less than in Run1) but the majority speed up by up to 20% despite the added bounds checking.

a10y · 2025-06-25T20:28:42Z

vortex-array/src/arrays/primitive/compute/take/avx2.rs

+    };
+}
+
+impl_gather!(u8,


we should implement f32 and f64 as value types since there are relevant gather instructions for them. This PR is probably enough to review so I'd prefer to do in follow up

a10y · 2025-06-25T20:29:16Z

vortex-array/src/arrays/primitive/compute/take/avx2.rs

+/// AVX2 version of GatherFn defined for 32- and 64-bit value types.
+enum AVX2Gather {}
+
+unsafe fn identity<T>(input: T) -> T {


some of the impls don't need an extend operation (when the indices and values are the same size) so we use this and it should get optimized away

use std::convert::identity instead?

a10y · 2025-06-25T20:54:51Z

Hmm, I'm not seeing the nice improvements anymore. I'd like to merge #3653 first so that way we have an apples-to-apples comparison

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y · 2025-07-01T16:04:47Z

Some nice speedups for several of these

Some slight regression for f32 (~8%). Those can be addressed pretty easily though.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

robert3005

I think apart from the mismatch between avx2 and portable simd on when we can use simd everything else looks good

robert3005 · 2025-07-02T23:56:49Z

vortex-array/src/arrays/primitive/compute/take/avx2.rs

+        if values.ptype() != PType::F16
+            && indices.dtype().is_unsigned_int()
+            && indices.all_valid()?
+            && values.all_valid()?


You don't need these, f16 can be reinterpreted casted to u16 and back. We can adapt the logic from the portable_simd kernel that I made

Only 32/64 bit values are eligible for the kernel for now. There is a way to extend it to types narrower than dword but it's complex.

I have updated the kernel to add impls for f32/f64 though and updated the test macro to generate test cases for them

Ah I missed that part. Portable simd makes this look very easy

robert3005 · 2025-07-02T23:57:48Z

vortex-array/src/arrays/primitive/compute/take/avx2.rs

+/// AVX2 version of GatherFn defined for 32- and 64-bit value types.
+enum AVX2Gather {}
+
+unsafe fn identity<T>(input: T) -> T {


use std::convert::identity instead?

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

An implementation of TakeKernel for PrimitiveArray that uses AVX2 explicit instructions, falling back to Scalar. For non-x86_64 platforms the `portable_simd` implementation is still loaded if nightly compiler is being used. This is part of the #3546 series of PRs. Additionally: fixed a soundness issue with lack of bounds checking of indices for portable_simd impl. This requires us to do a full scan of the indices upfront before running the kernel to avoid out of bounds memory access. ## Implementation The biggest source of added complexity in this PR is the new `avx2` module which implements a take kernel for primitive indices/values that uses the AVX2 GATHER operation. Intel ISA provides 4 different gather intrinsics: - `_mm256_i32gather_epi32` -> gathering 8x 32bit values with 32bit indices - `_mm_i32gather_epi64` -> gathering 4x 64bit values with 32bit indices - `_mm_i64gather_epi32` -> gathering 8x 32bit values with 32bit indices - `_mm256_i32gather_epi32` -> gathering 8x 32bit values with 32bit indices We implement a generic inner loop with a trait parameter `GatherFn<I, V>`, and allow specialization to insert the proper loop logic for each valid index/value type combination. --------- Signed-off-by: Andrew Duffy <andrew@a10y.dev> Signed-off-by: blaginin <dima@spiraldb.com>

…3579) An implementation of TakeKernel for PrimitiveArray that uses AVX2 explicit instructions, falling back to Scalar. For non-x86_64 platforms the `portable_simd` implementation is still loaded if nightly compiler is being used. This is part of the vortex-data#3546 series of PRs. Additionally: fixed a soundness issue with lack of bounds checking of indices for portable_simd impl. This requires us to do a full scan of the indices upfront before running the kernel to avoid out of bounds memory access. ## Implementation The biggest source of added complexity in this PR is the new `avx2` module which implements a take kernel for primitive indices/values that uses the AVX2 GATHER operation. Intel ISA provides 4 different gather intrinsics: - `_mm256_i32gather_epi32` -> gathering 8x 32bit values with 32bit indices - `_mm_i32gather_epi64` -> gathering 4x 64bit values with 32bit indices - `_mm_i64gather_epi32` -> gathering 8x 32bit values with 32bit indices - `_mm256_i32gather_epi32` -> gathering 8x 32bit values with 32bit indices We implement a generic inner loop with a trait parameter `GatherFn<I, V>`, and allow specialization to insert the proper loop logic for each valid index/value type combination. --------- Signed-off-by: Andrew Duffy <andrew@a10y.dev> Signed-off-by: mwlon <m.w.loncaric@gmail.com>

github-actions bot added the changelog/feature A new feature label Jun 19, 2025

a10y commented Jun 19, 2025

View reviewed changes

vortex-array/src/arrays/primitive/compute/take/x86.rs Outdated Show resolved Hide resolved

a10y commented Jun 19, 2025

View reviewed changes

vortex-array/src/arrays/primitive/compute/take/x86.rs Outdated Show resolved Hide resolved

a10y commented Jun 19, 2025

View reviewed changes

a10y changed the title ~~feat: rewrite take_primitive_simd from portable_simd -> avx2~~ feat: migrate take_primitive_simd to stable AVX2 kernel Jun 20, 2025

a10y commented Jun 20, 2025

View reviewed changes

a10y force-pushed the aduffy/simd-take-primitive branch 9 times, most recently from 5b6cbbd to 4244f55 Compare June 25, 2025 19:58

a10y commented Jun 25, 2025

View reviewed changes

a10y force-pushed the aduffy/simd-take-primitive branch from cee456f to bdfd0f1 Compare June 26, 2025 14:03

a10y force-pushed the aduffy/simd-take-primitive branch from bdfd0f1 to 4f2e4d6 Compare June 26, 2025 14:07

feat: Raw SIMD take kernel + AVX2 implementation

12ee944

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y force-pushed the aduffy/simd-take-primitive branch from 4f2e4d6 to 03f9aeb Compare July 1, 2025 14:56

push index check into gather

9336041

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y force-pushed the aduffy/simd-take-primitive branch from 03f9aeb to 9336041 Compare July 1, 2025 15:16

run avx2 unit tests on gh

b9979e0

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y force-pushed the aduffy/simd-take-primitive branch from 48a33be to b9979e0 Compare July 1, 2025 15:56

a10y marked this pull request as ready for review July 1, 2025 16:04

a10y added 3 commits July 2, 2025 12:14

fix selection of kernels

6a674de

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

more test cases for avx2 kernels

67aba66

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

BEGONE FOUL BEAST

a7fbac5

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y force-pushed the aduffy/simd-take-primitive branch from 4a1507f to a7fbac5 Compare July 2, 2025 14:23

fix test case generation

5e2a478

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y force-pushed the aduffy/simd-take-primitive branch 2 times, most recently from 47fdc95 to 34bf1cc Compare July 2, 2025 15:57

u32 / u64

f3c8da4

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y force-pushed the aduffy/simd-take-primitive branch from 34bf1cc to f3c8da4 Compare July 2, 2025 16:17

a10y requested a review from robert3005 July 2, 2025 16:24

robert3005 requested changes Jul 3, 2025

View reviewed changes

expand to f32/f64 values, add to test cases, cleanups

f694309

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

robert3005 approved these changes Jul 3, 2025

View reviewed changes

robert3005 merged commit 105d6ab into develop Jul 3, 2025
53 of 54 checks passed

robert3005 deleted the aduffy/simd-take-primitive branch July 3, 2025 13:30

Conversation

a10y commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Uh oh!

robert3005 commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

a10y commented Jun 20, 2025

Uh oh!

a10y commented Jun 20, 2025

Uh oh!

robert3005 commented Jun 20, 2025

Uh oh!

joseph-isaacs commented Jun 20, 2025

Uh oh!

0ax1 commented Jun 20, 2025

Uh oh!

0ax1 commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq bot commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging #3579 will improve performances by 35.48%

Summary

Benchmarks breakdown

Uh oh!

robert3005 commented Jun 20, 2025

Uh oh!

a10y commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

a10y commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

a10y commented Jun 25, 2025

Uh oh!

a10y commented Jul 1, 2025

Uh oh!

robert3005 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

a10y commented Jun 19, 2025 •

edited

Loading

0ax1 commented Jun 20, 2025 •

edited

Loading

codspeed-hq bot commented Jun 20, 2025 •

edited

Loading

a10y commented Jun 20, 2025 •

edited

Loading