feat: migrate take_primitive_simd to stable AVX2 kernel#3579
feat: migrate take_primitive_simd to stable AVX2 kernel#3579robert3005 merged 9 commits intodevelopfrom
Conversation
|
I wouldn't merge this and instead leave simd impl behind a feature flag. Portable simd probably gives us more than just avx2 impl? |
| let offset = chunk_idx * SIMD_WIDTH; | ||
|
|
||
| // Load the next 8 indices into a vector | ||
| let indices_vec = unsafe { _mm256_loadu_si256(indices.as_ptr().add(offset).cast()) }; |
There was a problem hiding this comment.
interestingly, tantivy uses the _mm256_lddqu_si256 intrinsic instead, which according to the Intel docs seems to indicate that it's very similar to mm256_loadu_si256, and indeed this StackOverflow answer also backs that up:
There's no reason to ever use _mm256_lddqu_si256, consider it a synonym for _mm256_loadu_si256. lddqu only exists for historical reasons as x86 evolved towards having better unaligned vector load support, and CPUs that support the AVX version run them identically. There's no AVX512 version.
|
@robert3005 I probably don't have as good intuition about this as you or Alex, but AFAICT, the most beneficial part of the existing portable_simd implementation is the gather operation, which exists on AVX2 but has no equivalent on NEON. Similarly, from other things I've read, AVX512 generally executes a 512-bit load as two instructions instead of one, so the speedup is not really 2x it's something short of that. I think that might be why things like Tantivy which implement direct SIMD support only have avx2 kernels and don't bother with avx512. |
|
I'm also open to just shoving existing impl behind some nightly-only feature flag, if that's possible. I was just hoping that if this were valuable we can preserve it for everyone, since it is doable. |
|
Ok, avx2 gather is probably the only widespread use simd implementation of this function so might be worth having a stable version |
|
Are there any benchmarks for this? |
This is covered by https://github.com/vortex-data/vortex/blob/develop/encodings/dict/benches/dict_compress.rs |
Yep, this is only about gather. Would be interesting though to also compare perf on macOS between portable simd and non-SIMD to double check that moving away from portable SIMD doesn't introduce a regression there. So the assumption is there shouldn't be, as there's no gather equivalent for NEON. |
CodSpeed Performance ReportMerging #3579 will improve performances by 35.48%Comparing Summary
Benchmarks breakdown
|
|
Interesting that i64 improved |
| #[cfg(target_arch = "x86_64")] | ||
| #[target_feature(enable = "avx2")] | ||
| unsafe fn take_u8_i64_avx2(indices: &[u8], values: &[i64]) -> Buffer<i64> { | ||
| const SIMD_WIDTH: usize = 4; // 256 bits / 32 bits per element |
There was a problem hiding this comment.
should say 64 bits per element
5b6cbbd to
4244f55
Compare
|
Alright, I think I've convinced myself that this PR adds value. One important thing to note: the old I did two CodSpeed runs:
Note that in Run1, all of the decode_primitives benchmarks regress considerably. In Run 2, a few regress (less than in Run1) but the majority speed up by up to 20% despite the added bounds checking. |
| }; | ||
| } | ||
|
|
||
| impl_gather!(u8, |
There was a problem hiding this comment.
we should implement f32 and f64 as value types since there are relevant gather instructions for them. This PR is probably enough to review so I'd prefer to do in follow up
| /// AVX2 version of GatherFn defined for 32- and 64-bit value types. | ||
| enum AVX2Gather {} | ||
|
|
||
| unsafe fn identity<T>(input: T) -> T { |
There was a problem hiding this comment.
some of the impls don't need an extend operation (when the indices and values are the same size) so we use this and it should get optimized away
There was a problem hiding this comment.
use std::convert::identity instead?
|
Hmm, I'm not seeing the nice improvements anymore. I'd like to merge #3653 first so that way we have an apples-to-apples comparison |
cee456f to
bdfd0f1
Compare
bdfd0f1 to
4f2e4d6
Compare
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
4f2e4d6 to
03f9aeb
Compare
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
03f9aeb to
9336041
Compare
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
48a33be to
b9979e0
Compare
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
4a1507f to
a7fbac5
Compare
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
47fdc95 to
34bf1cc
Compare
34bf1cc to
f3c8da4
Compare
robert3005
left a comment
There was a problem hiding this comment.
I think apart from the mismatch between avx2 and portable simd on when we can use simd everything else looks good
| if values.ptype() != PType::F16 | ||
| && indices.dtype().is_unsigned_int() | ||
| && indices.all_valid()? | ||
| && values.all_valid()? |
There was a problem hiding this comment.
You don't need these, f16 can be reinterpreted casted to u16 and back. We can adapt the logic from the portable_simd kernel that I made
There was a problem hiding this comment.
Only 32/64 bit values are eligible for the kernel for now. There is a way to extend it to types narrower than dword but it's complex.
I have updated the kernel to add impls for f32/f64 though and updated the test macro to generate test cases for them
There was a problem hiding this comment.
Ah I missed that part. Portable simd makes this look very easy
| /// AVX2 version of GatherFn defined for 32- and 64-bit value types. | ||
| enum AVX2Gather {} | ||
|
|
||
| unsafe fn identity<T>(input: T) -> T { |
There was a problem hiding this comment.
use std::convert::identity instead?
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
An implementation of TakeKernel for PrimitiveArray that uses AVX2 explicit instructions, falling back to Scalar. For non-x86_64 platforms the `portable_simd` implementation is still loaded if nightly compiler is being used. This is part of the #3546 series of PRs. Additionally: fixed a soundness issue with lack of bounds checking of indices for portable_simd impl. This requires us to do a full scan of the indices upfront before running the kernel to avoid out of bounds memory access. ## Implementation The biggest source of added complexity in this PR is the new `avx2` module which implements a take kernel for primitive indices/values that uses the AVX2 GATHER operation. Intel ISA provides 4 different gather intrinsics: - `_mm256_i32gather_epi32` -> gathering 8x 32bit values with 32bit indices - `_mm_i32gather_epi64` -> gathering 4x 64bit values with 32bit indices - `_mm_i64gather_epi32` -> gathering 8x 32bit values with 32bit indices - `_mm256_i32gather_epi32` -> gathering 8x 32bit values with 32bit indices We implement a generic inner loop with a trait parameter `GatherFn<I, V>`, and allow specialization to insert the proper loop logic for each valid index/value type combination. --------- Signed-off-by: Andrew Duffy <andrew@a10y.dev> Signed-off-by: blaginin <dima@spiraldb.com>
…3579) An implementation of TakeKernel for PrimitiveArray that uses AVX2 explicit instructions, falling back to Scalar. For non-x86_64 platforms the `portable_simd` implementation is still loaded if nightly compiler is being used. This is part of the vortex-data#3546 series of PRs. Additionally: fixed a soundness issue with lack of bounds checking of indices for portable_simd impl. This requires us to do a full scan of the indices upfront before running the kernel to avoid out of bounds memory access. ## Implementation The biggest source of added complexity in this PR is the new `avx2` module which implements a take kernel for primitive indices/values that uses the AVX2 GATHER operation. Intel ISA provides 4 different gather intrinsics: - `_mm256_i32gather_epi32` -> gathering 8x 32bit values with 32bit indices - `_mm_i32gather_epi64` -> gathering 4x 64bit values with 32bit indices - `_mm_i64gather_epi32` -> gathering 8x 32bit values with 32bit indices - `_mm256_i32gather_epi32` -> gathering 8x 32bit values with 32bit indices We implement a generic inner loop with a trait parameter `GatherFn<I, V>`, and allow specialization to insert the proper loop logic for each valid index/value type combination. --------- Signed-off-by: Andrew Duffy <andrew@a10y.dev> Signed-off-by: mwlon <m.w.loncaric@gmail.com>


An implementation of TakeKernel for PrimitiveArray that uses AVX2 explicit instructions, falling back to Scalar. For non-x86_64 platforms the
portable_simdimplementation is still loaded if nightly compiler is being used.This is part of the #3546 series of PRs.
Additionally: fixed a soundness issue with lack of bounds checking of indices for portable_simd impl. This requires us to do a full scan of the indices upfront before running the kernel to avoid out of bounds memory access.
Implementation
The biggest source of added complexity in this PR is the new
avx2module which implements a take kernel for primitive indices/values that uses the AVX2 GATHER operation.Intel ISA provides 4 different gather intrinsics:
_mm256_i32gather_epi32-> gathering 8x 32bit values with 32bit indices_mm_i32gather_epi64-> gathering 4x 64bit values with 32bit indices_mm_i64gather_epi32-> gathering 8x 32bit values with 32bit indices_mm256_i32gather_epi32-> gathering 8x 32bit values with 32bit indicesWe implement a generic inner loop with a trait parameter
GatherFn<I, V>, and allow specialization to insert the proper loop logic for each valid index/value type combination.