Skip to content

feat: migrate take_primitive_simd to stable AVX2 kernel#3579

Merged
robert3005 merged 9 commits intodevelopfrom
aduffy/simd-take-primitive
Jul 3, 2025
Merged

feat: migrate take_primitive_simd to stable AVX2 kernel#3579
robert3005 merged 9 commits intodevelopfrom
aduffy/simd-take-primitive

Conversation

@a10y
Copy link
Contributor

@a10y a10y commented Jun 19, 2025

An implementation of TakeKernel for PrimitiveArray that uses AVX2 explicit instructions, falling back to Scalar. For non-x86_64 platforms the portable_simd implementation is still loaded if nightly compiler is being used.

This is part of the #3546 series of PRs.

Additionally: fixed a soundness issue with lack of bounds checking of indices for portable_simd impl. This requires us to do a full scan of the indices upfront before running the kernel to avoid out of bounds memory access.

Implementation

The biggest source of added complexity in this PR is the new avx2 module which implements a take kernel for primitive indices/values that uses the AVX2 GATHER operation.

Intel ISA provides 4 different gather intrinsics:

  • _mm256_i32gather_epi32 -> gathering 8x 32bit values with 32bit indices
  • _mm_i32gather_epi64 -> gathering 4x 64bit values with 32bit indices
  • _mm_i64gather_epi32 -> gathering 8x 32bit values with 32bit indices
  • _mm256_i32gather_epi32 -> gathering 8x 32bit values with 32bit indices

We implement a generic inner loop with a trait parameter GatherFn<I, V>, and allow specialization to insert the proper loop logic for each valid index/value type combination.

@github-actions github-actions bot added the changelog/feature A new feature label Jun 19, 2025
@robert3005
Copy link
Contributor

I wouldn't merge this and instead leave simd impl behind a feature flag. Portable simd probably gives us more than just avx2 impl?

let offset = chunk_idx * SIMD_WIDTH;

// Load the next 8 indices into a vector
let indices_vec = unsafe { _mm256_loadu_si256(indices.as_ptr().add(offset).cast()) };
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interestingly, tantivy uses the _mm256_lddqu_si256 intrinsic instead, which according to the Intel docs seems to indicate that it's very similar to mm256_loadu_si256, and indeed this StackOverflow answer also backs that up:

There's no reason to ever use _mm256_lddqu_si256, consider it a synonym for _mm256_loadu_si256. lddqu only exists for historical reasons as x86 evolved towards having better unaligned vector load support, and CPUs that support the AVX version run them identically. There's no AVX512 version.

@a10y
Copy link
Contributor Author

a10y commented Jun 20, 2025

@robert3005 I probably don't have as good intuition about this as you or Alex, but AFAICT, the most beneficial part of the existing portable_simd implementation is the gather operation, which exists on AVX2 but has no equivalent on NEON.

Similarly, from other things I've read, AVX512 generally executes a 512-bit load as two instructions instead of one, so the speedup is not really 2x it's something short of that. I think that might be why things like Tantivy which implement direct SIMD support only have avx2 kernels and don't bother with avx512.

@a10y
Copy link
Contributor Author

a10y commented Jun 20, 2025

I'm also open to just shoving existing impl behind some nightly-only feature flag, if that's possible. I was just hoping that if this were valuable we can preserve it for everyone, since it is doable.

@robert3005
Copy link
Contributor

Ok, avx2 gather is probably the only widespread use simd implementation of this function so might be worth having a stable version

@a10y a10y changed the title feat: rewrite take_primitive_simd from portable_simd -> avx2 feat: migrate take_primitive_simd to stable AVX2 kernel Jun 20, 2025
@joseph-isaacs
Copy link
Contributor

Are there any benchmarks for this?

@0ax1
Copy link
Contributor

0ax1 commented Jun 20, 2025

Are there any benchmarks for this?

This is covered by https://github.com/vortex-data/vortex/blob/develop/encodings/dict/benches/dict_compress.rs

@0ax1
Copy link
Contributor

0ax1 commented Jun 20, 2025

@robert3005 I probably don't have as good intuition about this as you or Alex, but AFAICT, the most beneficial part of the existing portable_simd implementation is the gather operation, which exists on AVX2 but has no equivalent on NEON.

Similarly, from other things I've read, AVX512 generally executes a 512-bit load as two instructions instead of one, so the speedup is not really 2x it's something short of that. I think that might be why things like Tantivy which implement direct SIMD support only have avx2 kernels and don't bother with avx512.

Yep, this is only about gather. Would be interesting though to also compare perf on macOS between portable simd and non-SIMD to double check that moving away from portable SIMD doesn't introduce a regression there. So the assumption is there shouldn't be, as there's no gather equivalent for NEON.

@codspeed-hq
Copy link

codspeed-hq bot commented Jun 20, 2025

CodSpeed Performance Report

Merging #3579 will improve performances by 35.48%

Comparing aduffy/simd-take-primitive (f694309) with develop (8b184c2)

Summary

⚡ 11 improvements
✅ 782 untouched benchmarks

Benchmarks breakdown

Benchmark BASE HEAD Change
decode_primitives[f32, (10000, 128)] 77.2 µs 57.3 µs +34.59%
decode_primitives[f32, (10000, 2)] 76.9 µs 57 µs +34.93%
decode_primitives[f32, (10000, 32)] 76.7 µs 57.1 µs +34.36%
decode_primitives[f32, (10000, 4)] 76.9 µs 57.7 µs +33.35%
decode_primitives[f32, (10000, 512)] 82.1 µs 62.4 µs +31.59%
decode_primitives[f32, (10000, 8)] 76.8 µs 56.7 µs +35.48%
decode_primitives[i64, (10000, 128)] 100.7 µs 79.7 µs +26.48%
decode_primitives[i64, (10000, 2)] 100.1 µs 79.9 µs +25.31%
decode_primitives[i64, (10000, 32)] 100 µs 79.6 µs +25.64%
decode_primitives[i64, (10000, 4)] 100.3 µs 79.6 µs +26.13%
decode_primitives[i64, (10000, 8)] 100.6 µs 79.2 µs +27.08%

@robert3005
Copy link
Contributor

Interesting that i64 improved

@a10y
Copy link
Contributor Author

a10y commented Jun 20, 2025

That's b/c the latest commit I hand-rolled a kernel for u8 indices and i64 values, now the AVX pathway is actually being hit

image

#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")]
unsafe fn take_u8_i64_avx2(indices: &[u8], values: &[i64]) -> Buffer<i64> {
const SIMD_WIDTH: usize = 4; // 256 bits / 32 bits per element
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should say 64 bits per element

@a10y a10y force-pushed the aduffy/simd-take-primitive branch 9 times, most recently from 5b6cbbd to 4244f55 Compare June 25, 2025 19:58
@a10y
Copy link
Contributor Author

a10y commented Jun 25, 2025

Alright, I think I've convinced myself that this PR adds value.

One important thing to note: the old portable_simd implementation was unsound, b/c it failed to bounds-check the indices before issuing the gather instructions, meaning that you could very easily trigger unsafe memory access. So no matter what, bounds checking needs to be added and that will undoubtedly hurt performance a bit from the baseline numbers.

I did two CodSpeed runs:

  1. Run 1: portable_simd Kernel only with bounds check added: https://codspeed.io/vortex-data/vortex/runs/compare/685c2abfbd0c18c2abcc32be..685c56398a5bab413a016f8d
  2. Run 2: Using AVX2 with bounds check added: https://codspeed.io/vortex-data/vortex/runs/compare/685c2abfbd0c18c2abcc32be..685c4aa98a5bab413a016f07

Note that in Run1, all of the decode_primitives benchmarks regress considerably. In Run 2, a few regress (less than in Run1) but the majority speed up by up to 20% despite the added bounds checking.

};
}

impl_gather!(u8,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should implement f32 and f64 as value types since there are relevant gather instructions for them. This PR is probably enough to review so I'd prefer to do in follow up

/// AVX2 version of GatherFn defined for 32- and 64-bit value types.
enum AVX2Gather {}

unsafe fn identity<T>(input: T) -> T {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of the impls don't need an extend operation (when the indices and values are the same size) so we use this and it should get optimized away

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use std::convert::identity instead?

@a10y
Copy link
Contributor Author

a10y commented Jun 25, 2025

Hmm, I'm not seeing the nice improvements anymore. I'd like to merge #3653 first so that way we have an apples-to-apples comparison

@a10y a10y force-pushed the aduffy/simd-take-primitive branch from cee456f to bdfd0f1 Compare June 26, 2025 14:03
@a10y a10y force-pushed the aduffy/simd-take-primitive branch from bdfd0f1 to 4f2e4d6 Compare June 26, 2025 14:07
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/simd-take-primitive branch from 4f2e4d6 to 03f9aeb Compare July 1, 2025 14:56
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/simd-take-primitive branch from 03f9aeb to 9336041 Compare July 1, 2025 15:16
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/simd-take-primitive branch from 48a33be to b9979e0 Compare July 1, 2025 15:56
@a10y
Copy link
Contributor Author

a10y commented Jul 1, 2025

Some nice speedups for several of these

image

Some slight regression for f32 (~8%). Those can be addressed pretty easily though.

@a10y a10y marked this pull request as ready for review July 1, 2025 16:04
a10y added 3 commits July 2, 2025 12:14
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/simd-take-primitive branch from 4a1507f to a7fbac5 Compare July 2, 2025 14:23
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/simd-take-primitive branch 2 times, most recently from 47fdc95 to 34bf1cc Compare July 2, 2025 15:57
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/simd-take-primitive branch from 34bf1cc to f3c8da4 Compare July 2, 2025 16:17
@a10y a10y requested a review from robert3005 July 2, 2025 16:24
Copy link
Contributor

@robert3005 robert3005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think apart from the mismatch between avx2 and portable simd on when we can use simd everything else looks good

Comment on lines 40 to 43
if values.ptype() != PType::F16
&& indices.dtype().is_unsigned_int()
&& indices.all_valid()?
&& values.all_valid()?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need these, f16 can be reinterpreted casted to u16 and back. We can adapt the logic from the portable_simd kernel that I made

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only 32/64 bit values are eligible for the kernel for now. There is a way to extend it to types narrower than dword but it's complex.

I have updated the kernel to add impls for f32/f64 though and updated the test macro to generate test cases for them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I missed that part. Portable simd makes this look very easy

/// AVX2 version of GatherFn defined for 32- and 64-bit value types.
enum AVX2Gather {}

unsafe fn identity<T>(input: T) -> T {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use std::convert::identity instead?

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@robert3005 robert3005 merged commit 105d6ab into develop Jul 3, 2025
53 of 54 checks passed
@robert3005 robert3005 deleted the aduffy/simd-take-primitive branch July 3, 2025 13:30
blaginin pushed a commit that referenced this pull request Jul 3, 2025
An implementation of TakeKernel for PrimitiveArray that uses AVX2
explicit instructions, falling back to Scalar. For non-x86_64 platforms
the `portable_simd` implementation is still loaded if nightly compiler
is being used.

This is part of the #3546 series of PRs.

Additionally: fixed a soundness issue with lack of bounds checking of
indices for portable_simd impl. This requires us to do a full scan of
the indices upfront before running the kernel to avoid out of bounds
memory access.

## Implementation

The biggest source of added complexity in this PR is the new `avx2`
module which implements a take kernel for primitive indices/values that
uses the AVX2 GATHER operation.

Intel ISA provides 4 different gather intrinsics:

- `_mm256_i32gather_epi32` -> gathering 8x 32bit values with 32bit
indices
- `_mm_i32gather_epi64` -> gathering 4x 64bit values with 32bit indices
- `_mm_i64gather_epi32` -> gathering 8x 32bit values with 32bit indices
- `_mm256_i32gather_epi32` -> gathering 8x 32bit values with 32bit
indices

We implement a generic inner loop with a trait parameter `GatherFn<I,
V>`, and allow specialization to insert the proper loop logic for each
valid index/value type combination.

---------

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: blaginin <dima@spiraldb.com>
mwlon pushed a commit to mwlon/vortex that referenced this pull request Jul 4, 2025
…3579)

An implementation of TakeKernel for PrimitiveArray that uses AVX2
explicit instructions, falling back to Scalar. For non-x86_64 platforms
the `portable_simd` implementation is still loaded if nightly compiler
is being used.

This is part of the vortex-data#3546 series of PRs.

Additionally: fixed a soundness issue with lack of bounds checking of
indices for portable_simd impl. This requires us to do a full scan of
the indices upfront before running the kernel to avoid out of bounds
memory access.

## Implementation

The biggest source of added complexity in this PR is the new `avx2`
module which implements a take kernel for primitive indices/values that
uses the AVX2 GATHER operation.

Intel ISA provides 4 different gather intrinsics:

- `_mm256_i32gather_epi32` -> gathering 8x 32bit values with 32bit
indices
- `_mm_i32gather_epi64` -> gathering 4x 64bit values with 32bit indices
- `_mm_i64gather_epi32` -> gathering 8x 32bit values with 32bit indices
- `_mm256_i32gather_epi32` -> gathering 8x 32bit values with 32bit
indices

We implement a generic inner loop with a trait parameter `GatherFn<I,
V>`, and allow specialization to insert the proper loop logic for each
valid index/value type combination.

---------

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: mwlon <m.w.loncaric@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants