-
Notifications
You must be signed in to change notification settings - Fork 1k
perf: improve GenericByteBuilder::append_array
to use SIMD for extending the offsets
#8388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ing the offsets Changing from: ```rust let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1); for &offset in &offsets[1..] { intermediate.push(offset + shift) } ``` to: ```rust let mut intermediate = vec![T::Offset::zero(); offsets.len() - 1]; for (index, &offset) in offsets[1..].iter().enumerate() { intermediate[index] = offset + shift; } ``` improve the performance of concating bytes array between 8% to 50% on local machine: ```bash concat str 1024 time: [7.2598 µs 7.2772 µs 7.2957 µs] change: [+12.545% +13.070% +13.571%] (p = 0.00 < 0.05) Performance has regressed. Found 6 outliers among 100 measurements (6.00%) 4 (4.00%) high mild 2 (2.00%) high severe concat str nulls 1024 time: [4.6791 µs 4.6895 µs 4.7010 µs] change: [+23.206% +23.792% +24.425%] (p = 0.00 < 0.05) Performance has regressed. Found 13 outliers among 100 measurements (13.00%) 5 (5.00%) high mild 8 (8.00%) high severe concat 1024 arrays str 4 time: [45.018 µs 45.213 µs 45.442 µs] change: [+6.4195% +8.7377% +11.279%] (p = 0.00 < 0.05) Performance has regressed. Found 13 outliers among 100 measurements (13.00%) 6 (6.00%) high mild 7 (7.00%) high severe concat str 8192 over 100 arrays time: [3.7561 ms 3.7814 ms 3.8086 ms] change: [+25.394% +26.833% +28.370%] (p = 0.00 < 0.05) Performance has regressed. Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild concat str nulls 8192 over 100 arrays time: [2.3144 ms 2.3269 ms 2.3403 ms] change: [+51.533% +52.826% +54.109%] (p = 0.00 < 0.05) Performance has regressed. Found 8 outliers among 100 measurements (8.00%) 6 (6.00%) high mild 2 (2.00%) high severe ``` When looking at the assembly > Used rustc 1.89.0 and compiler flags `-C opt-level=2 -C target-feature=+avx2 -C codegen-units=1` in [godbold](https://godbolt.org/) you see that for the old code: ```rust let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1); for &offset in &offsets[1..] { intermediate.push(offset + shift) } ``` the assembly for the loop is: ```asm .LBB3_22: mov rbx, qword ptr [r13 + 8*rbp + 8] add rbx, r15 cmp rbp, qword ptr [rsp] jne .LBB3_25 mov rdi, rsp lea rsi, [rip + .Lanon.da681cffc384a5add117668a344b291b.6] call qword ptr [rip + alloc::raw_vec::RawVec<T,A>::grow_one::ha1b398ade64b0727@GOTPCREL] mov r14, qword ptr [rsp + 8] jmp .LBB3_25 .LBB3_25: mov qword ptr [r14 + 8*rbp], rbx inc rbp mov qword ptr [rsp + 16], rbp add r12, -8 je .LBB3_9 ``` and for the new code: ```rust let mut intermediate = vec![T::Offset::zero(); offsets.len() - 1]; for (index, &offset) in offsets[1..].iter().enumerate() { intermediate[index] = offset + shift; } ``` the assembly for the loop is: ```asm .LBB2_21: vpaddq ymm1, ymm0, ymmword ptr [r12 + 8*rdx + 8] vpaddq ymm2, ymm0, ymmword ptr [r12 + 8*rdx + 40] vpaddq ymm3, ymm0, ymmword ptr [r12 + 8*rdx + 72] vpaddq ymm4, ymm0, ymmword ptr [r12 + 8*rdx + 104] vmovdqu ymmword ptr [rbx + 8*rdx], ymm1 vmovdqu ymmword ptr [rbx + 8*rdx + 32], ymm2 vmovdqu ymmword ptr [rbx + 8*rdx + 64], ymm3 vmovdqu ymmword ptr [rbx + 8*rdx + 96], ymm4 add rdx, 16 cmp rax, rdx jne .LBB2_21 ``` which uses SIMD instructions. The code that I wrote in GodBolt: For the old code: ```rust #[inline(always)] fn extend_offsets<T: std::ops::Add<Output = T> + Copy + Default>(output: &mut Vec<T>, offsets: &[T], next_offset: T) { assert_ne!(offsets.len(), 0); let shift: T = next_offset + offsets[0]; let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1); // Make it easier to find the loop in the assembly let mut dummy = 0u64; unsafe { std::arch::asm!( "# VECTORIZED_START mov {}, 1", out(reg) dummy, options(nostack) ); } for &offset in &offsets[1..] { intermediate.push(offset + shift) } // Make it easier to find the loop in the assembly unsafe { std::arch::asm!( "# VECTORIZED_END mov {}, 2", out(reg) dummy, options(nostack) ); } std::hint::black_box(dummy); output.extend_from_slice(&intermediate); } #[no_mangle] pub fn extend_offsets_usize(output: &mut Vec<usize>, offsets: &[usize], next_offset: usize) { extend_offsets(output, offsets, next_offset); } ``` And for the new code: ```rust #[inline(always)] fn extend_offsets<T: std::ops::Add<Output = T> + Copy + Default>(output: &mut Vec<T>, offsets: &[T], next_offset: T) { assert_ne!(offsets.len(), 0); let shift: T = next_offset + offsets[0]; let mut intermediate = vec![T::default(); offsets.len() - 1]; // Make it easier to find the loop in the assembly let mut dummy = 0u64; unsafe { std::arch::asm!( "# VECTORIZED_START mov {}, 1", out(reg) dummy, options(nostack) ); } for (index, &offset) in offsets[1..].iter().enumerate() { intermediate[index] = offset + shift } // Make it easier to find the loop in the assembly unsafe { std::arch::asm!( "# VECTORIZED_END mov {}, 2", out(reg) dummy, options(nostack) ); } std::hint::black_box(dummy); output.extend_from_slice(&intermediate); } #[no_mangle] pub fn extend_offsets_usize(output: &mut Vec<usize>, offsets: &[usize], next_offset: usize) { extend_offsets(output, offsets, next_offset); } ```
GenericByteBuilder::append_array
to use SIMD for extending the offsets
Thanks @rluvaton -- I have scheduled a benchmark run. |
🤖 |
🤖: Benchmark completed Details
|
Those are some pretty impressive results 👍 thank you @rluvaton
|
Is it possible that the benchmark is not running with target cpu native? |
ea5835d
to
43e6317
Compare
I've updated the code to no longer have intermediate buffer AND use SIMD |
43e6317
to
fd5a012
Compare
🤖 |
🤖: Benchmark completed Details
|
this is so good. Thank you @rluvaton |
Which issue does this PR close?
N/A
Rationale for this change
Just making things faster
What changes are included in this PR?
Explained below
Are these changes tested?
Existing tests
Are there any user-facing changes?
Nope
Changing from:
to:
When looking at the assembly
you see that for the old code:
the assembly for the loop is:
and for the new code:
the assembly for the loop is:
which uses SIMD instructions.
The code that I wrote in GodBolt:
For the old code:
And for the new code: