Skip to content

Conversation

rluvaton
Copy link
Member

@rluvaton rluvaton commented Sep 18, 2025

Which issue does this PR close?

N/A

Rationale for this change

Just making things faster

What changes are included in this PR?

Explained below

Are these changes tested?

Existing tests

Are there any user-facing changes?

Nope


Changing from:

let mut intermediate = Vec::with_capacity(offsets.len() - 1);

for &offset in &offsets[1..] {
    intermediate.push(offset + shift)
}

self.offsets_builder.extend_from_slice(&intermediate);

to:

self.offsets_builder.extend(
    offsets[..offsets.len() - 1]
        .iter()
        .map(|&offset| offset + shift),
);

When looking at the assembly

Used rustc 1.89.0 and compiler flags -C opt-level=2 -C target-feature=+avx2 -C codegen-units=1 in godbold

you see that for the old code:

let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1);
for &offset in &offsets[1..] {
    intermediate.push(offset + shift)
}

the assembly for the loop is:

.LBB3_22:
        mov     rbx, qword ptr [r13 + 8*rbp + 8]
        add     rbx, r15
        cmp     rbp, qword ptr [rsp]
        jne     .LBB3_25
        mov     rdi, rsp
        lea     rsi, [rip + .Lanon.da681cffc384a5add117668a344b291b.6]
        call    qword ptr [rip + alloc::raw_vec::RawVec<T,A>::grow_one::ha1b398ade64b0727@GOTPCREL]
        mov     r14, qword ptr [rsp + 8]
        jmp     .LBB3_25

.LBB3_25:
        mov     qword ptr [r14 + 8*rbp], rbx
        inc     rbp
        mov     qword ptr [rsp + 16], rbp
        add     r12, -8
        je      .LBB3_9

and for the new code:

self.offsets_builder.extend(
    offsets[..offsets.len() - 1]
        .iter()
        .map(|&offset| offset + shift),
);

the assembly for the loop is:

.LBB2_7:
        vpaddq  ymm1, ymm0, ymmword ptr [r14 + 8*r9]
        vpaddq  ymm2, ymm0, ymmword ptr [r14 + 8*r9 + 32]
        vpaddq  ymm3, ymm0, ymmword ptr [r14 + 8*r9 + 64]
        vpaddq  ymm4, ymm0, ymmword ptr [r14 + 8*r9 + 96]
        vmovdqu ymmword ptr [r8 + 8*r9 - 96], ymm1
        vmovdqu ymmword ptr [r8 + 8*r9 - 64], ymm2
        vmovdqu ymmword ptr [r8 + 8*r9 - 32], ymm3
        vmovdqu ymmword ptr [r8 + 8*r9], ymm4
        add     r9, 16
        cmp     rdx, r9
        jne     .LBB2_7
        cmp     rbx, rdx
        je      .LBB2_12

which uses SIMD instructions.

The code that I wrote in GodBolt:

For the old code:

#[inline(always)]
fn extend_offsets<T: std::ops::Add<Output = T> + Copy + Default>(output: &mut Vec<T>, offsets: &[T], next_offset: T) {
    assert_ne!(offsets.len(), 0);
    let shift: T = next_offset + offsets[0];

    let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1);

    // Make it easier to find the loop in the assembly
    let mut dummy = 0u64;
    unsafe {
        std::arch::asm!(
            "# VECTORIZED_START
             mov {}, 1",
            out(reg) dummy,
            options(nostack)
        );
    }

    for &offset in &offsets[1..] {
        intermediate.push(offset + shift)
    }

    // Make it easier to find the loop in the assembly
    unsafe {
        std::arch::asm!(
            "# VECTORIZED_END
             mov {}, 2",
            out(reg) dummy,
            options(nostack)
        );
    }
    std::hint::black_box(dummy);

    output.extend_from_slice(&intermediate);
}

#[no_mangle]
pub fn extend_offsets_usize(output: &mut Vec<usize>, offsets: &[usize], next_offset: usize) {
  extend_offsets(output, offsets, next_offset);
}

And for the new code:

#[inline(always)]
fn extend_offsets<T: std::ops::Add<Output = T> + Copy + Default>(output: &mut Vec<T>, offsets: &[T], next_offset: T) {
    assert_ne!(offsets.len(), 0);

    let shift: T = next_offset + offsets[0];
    output.extend(offsets[..(offsets.len() - 1)]
        .iter()
        .map(|&offset| offset + shift));
}

#[no_mangle]
pub fn extend_offsets_usize(output: &mut Vec<usize>, offsets: &[usize], next_offset: usize) {
  extend_offsets(output, offsets, next_offset);
}

…ing the offsets

Changing from:
```rust
let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1);
for &offset in &offsets[1..] {
    intermediate.push(offset + shift)
}
```

to:
```rust
let mut intermediate = vec![T::Offset::zero(); offsets.len() - 1];
for (index, &offset) in offsets[1..].iter().enumerate() {
    intermediate[index] = offset + shift;
}
```

improve the performance of concating bytes array between 8% to 50% on local machine:

```bash
concat str 1024         time:   [7.2598 µs 7.2772 µs 7.2957 µs]
                        change: [+12.545% +13.070% +13.571%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

concat str nulls 1024   time:   [4.6791 µs 4.6895 µs 4.7010 µs]
                        change: [+23.206% +23.792% +24.425%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

concat 1024 arrays str 4
                        time:   [45.018 µs 45.213 µs 45.442 µs]
                        change: [+6.4195% +8.7377% +11.279%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

concat str 8192 over 100 arrays
                        time:   [3.7561 ms 3.7814 ms 3.8086 ms]
                        change: [+25.394% +26.833% +28.370%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

concat str nulls 8192 over 100 arrays
                        time:   [2.3144 ms 2.3269 ms 2.3403 ms]
                        change: [+51.533% +52.826% +54.109%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
```

When looking at the assembly
> Used rustc 1.89.0 and compiler flags `-C opt-level=2 -C target-feature=+avx2 -C codegen-units=1` in [godbold](https://godbolt.org/)

you see that for the old code:
```rust
let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1);
for &offset in &offsets[1..] {
    intermediate.push(offset + shift)
}
```

the assembly for the loop is:
```asm
.LBB3_22:
        mov     rbx, qword ptr [r13 + 8*rbp + 8]
        add     rbx, r15
        cmp     rbp, qword ptr [rsp]
        jne     .LBB3_25
        mov     rdi, rsp
        lea     rsi, [rip + .Lanon.da681cffc384a5add117668a344b291b.6]
        call    qword ptr [rip + alloc::raw_vec::RawVec<T,A>::grow_one::ha1b398ade64b0727@GOTPCREL]
        mov     r14, qword ptr [rsp + 8]
        jmp     .LBB3_25

.LBB3_25:
        mov     qword ptr [r14 + 8*rbp], rbx
        inc     rbp
        mov     qword ptr [rsp + 16], rbp
        add     r12, -8
        je      .LBB3_9
```

and for the new code:
```rust
let mut intermediate = vec![T::Offset::zero(); offsets.len() - 1];
for (index, &offset) in offsets[1..].iter().enumerate() {
    intermediate[index] = offset + shift;
}
```

the assembly for the loop is:
```asm
.LBB2_21:
        vpaddq  ymm1, ymm0, ymmword ptr [r12 + 8*rdx + 8]
        vpaddq  ymm2, ymm0, ymmword ptr [r12 + 8*rdx + 40]
        vpaddq  ymm3, ymm0, ymmword ptr [r12 + 8*rdx + 72]
        vpaddq  ymm4, ymm0, ymmword ptr [r12 + 8*rdx + 104]
        vmovdqu ymmword ptr [rbx + 8*rdx], ymm1
        vmovdqu ymmword ptr [rbx + 8*rdx + 32], ymm2
        vmovdqu ymmword ptr [rbx + 8*rdx + 64], ymm3
        vmovdqu ymmword ptr [rbx + 8*rdx + 96], ymm4
        add     rdx, 16
        cmp     rax, rdx
        jne     .LBB2_21
```

which uses SIMD instructions.

The code that I wrote in GodBolt:

For the old code:
```rust

#[inline(always)]
fn extend_offsets<T: std::ops::Add<Output = T> + Copy + Default>(output: &mut Vec<T>, offsets: &[T], next_offset: T) {
    assert_ne!(offsets.len(), 0);
    let shift: T = next_offset + offsets[0];

    let mut intermediate = Vec::<T>::with_capacity(offsets.len() - 1);

    // Make it easier to find the loop in the assembly
    let mut dummy = 0u64;
    unsafe {
        std::arch::asm!(
            "# VECTORIZED_START
             mov {}, 1",
            out(reg) dummy,
            options(nostack)
        );
    }

    for &offset in &offsets[1..] {
        intermediate.push(offset + shift)
    }

    // Make it easier to find the loop in the assembly
    unsafe {
        std::arch::asm!(
            "# VECTORIZED_END
             mov {}, 2",
            out(reg) dummy,
            options(nostack)
        );
    }
    std::hint::black_box(dummy);

    output.extend_from_slice(&intermediate);
}

#[no_mangle]
pub fn extend_offsets_usize(output: &mut Vec<usize>, offsets: &[usize], next_offset: usize) {
  extend_offsets(output, offsets, next_offset);
}
```

And for the new code:
```rust
#[inline(always)]
fn extend_offsets<T: std::ops::Add<Output = T> + Copy + Default>(output: &mut Vec<T>, offsets: &[T], next_offset: T) {
    assert_ne!(offsets.len(), 0);
    let shift: T = next_offset + offsets[0];

    let mut intermediate = vec![T::default(); offsets.len() - 1];

    // Make it easier to find the loop in the assembly
    let mut dummy = 0u64;
    unsafe {
        std::arch::asm!(
            "# VECTORIZED_START
             mov {}, 1",
            out(reg) dummy,
            options(nostack)
        );
    }

    for (index, &offset) in offsets[1..].iter().enumerate() {
        intermediate[index] = offset + shift
    }

    // Make it easier to find the loop in the assembly
    unsafe {
        std::arch::asm!(
            "# VECTORIZED_END
             mov {}, 2",
            out(reg) dummy,
            options(nostack)
        );
    }
    std::hint::black_box(dummy);

    output.extend_from_slice(&intermediate);
}

#[no_mangle]
pub fn extend_offsets_usize(output: &mut Vec<usize>, offsets: &[usize], next_offset: usize) {
  extend_offsets(output, offsets, next_offset);
}
```
@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 18, 2025
@rluvaton rluvaton changed the title perf: improve GenericByteBuilder::append_array to use SIMD for extending the offsets perf: improve GenericByteBuilder::append_array to use SIMD for extending the offsets Sep 18, 2025
@alamb
Copy link
Contributor

alamb commented Sep 19, 2025

Thanks @rluvaton -- I have scheduled a benchmark run.

@alamb
Copy link
Contributor

alamb commented Sep 19, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing make-append-array-more-use-simd (2ee80db) to f4840f6 diff
BENCH_NAME=concatenate_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench concatenate_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=make-append-array-more-use-simd
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 19, 2025

🤖: Benchmark completed

Details

group                                                          main                                   make-append-array-more-use-simd
-----                                                          ----                                   -------------------------------
concat 1024 arrays boolean 4                                   1.01     28.6±0.05µs        ? ?/sec    1.00     28.3±0.03µs        ? ?/sec
concat 1024 arrays i32 4                                       1.02     14.5±0.03µs        ? ?/sec    1.00     14.2±0.12µs        ? ?/sec
concat 1024 arrays str 4                                       1.00     55.9±0.13µs        ? ?/sec    1.07     60.0±0.77µs        ? ?/sec
concat boolean 1024                                            1.00    400.2±0.68ns        ? ?/sec    1.09    437.1±1.02ns        ? ?/sec
concat boolean 8192 over 100 arrays                            1.00     44.2±0.04µs        ? ?/sec    1.16     51.2±0.06µs        ? ?/sec
concat boolean nulls 1024                                      1.00    719.0±0.70ns        ? ?/sec    1.06    759.5±1.08ns        ? ?/sec
concat boolean nulls 8192 over 100 arrays                      1.00     96.3±0.19µs        ? ?/sec    1.14    110.1±0.24µs        ? ?/sec
concat fixed size lists                                        1.00   812.2±33.78µs        ? ?/sec    1.00   808.4±51.64µs        ? ?/sec
concat i32 1024                                                1.03    408.8±0.88ns        ? ?/sec    1.00    397.8±1.02ns        ? ?/sec
concat i32 8192 over 100 arrays                                1.00   239.1±10.18µs        ? ?/sec    1.02    244.2±6.82µs        ? ?/sec
concat i32 nulls 1024                                          1.00    731.4±2.95ns        ? ?/sec    1.01    738.4±1.62ns        ? ?/sec
concat i32 nulls 8192 over 100 arrays                          1.00    307.5±7.52µs        ? ?/sec    1.07   328.3±21.74µs        ? ?/sec
concat str 1024                                                1.14     14.7±0.72µs        ? ?/sec    1.00     12.9±0.53µs        ? ?/sec
concat str 8192 over 100 arrays                                1.01    103.3±0.71ms        ? ?/sec    1.00    102.2±0.97ms        ? ?/sec
concat str nulls 1024                                          1.22      7.2±0.44µs        ? ?/sec    1.00      5.9±0.17µs        ? ?/sec
concat str nulls 8192 over 100 arrays                          1.02     53.5±0.42ms        ? ?/sec    1.00     52.3±0.60ms        ? ?/sec
concat str_dict 1024                                           1.06      3.0±0.01µs        ? ?/sec    1.00      2.8±0.01µs        ? ?/sec
concat str_dict_sparse 1024                                    1.00      6.9±0.02µs        ? ?/sec    1.00      6.9±0.01µs        ? ?/sec
concat struct with int32 and dicts size=1024 count=2           1.04      7.0±0.08µs        ? ?/sec    1.00      6.7±0.04µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0               1.00     77.3±0.39µs        ? ?/sec    1.00     77.3±0.29µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0.2             1.00     83.7±0.23µs        ? ?/sec    1.00     83.4±0.35µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0                1.14     87.8±0.22µs        ? ?/sec    1.00     77.1±0.21µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0.2              1.13     94.0±0.13µs        ? ?/sec    1.00     83.6±0.25µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0      1.00     45.4±2.50µs        ? ?/sec    1.00     45.3±3.23µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0.2    1.00     53.1±2.68µs        ? ?/sec    1.00     53.0±2.99µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Sep 19, 2025

Those are some pretty impressive results 👍 thank you @rluvaton

concat str 1024                                                1.14     14.7±0.72µs        ? ?/sec    1.00     12.9±0.53µs        ? ?/sec
concat str 8192 over 100 arrays                                1.01    103.3±0.71ms        ? ?/sec    1.00    102.2±0.97ms        ? ?/sec
concat str nulls 1024                                          1.22      7.2±0.44µs        ? ?/sec    1.00      5.9±0.17µs        ? ?/sec
concat str nulls 8192 over 100 arrays                          1.02     53.5±0.42ms        ? ?/sec    1.00     52.3±0.60ms        ? ?/sec

@rluvaton
Copy link
Member Author

Is it possible that the benchmark is not running with target cpu native?

@rluvaton rluvaton force-pushed the make-append-array-more-use-simd branch from ea5835d to 43e6317 Compare September 20, 2025 23:41
@rluvaton
Copy link
Member Author

I've updated the code to no longer have intermediate buffer AND use SIMD

@rluvaton rluvaton force-pushed the make-append-array-more-use-simd branch from 43e6317 to fd5a012 Compare September 21, 2025 00:00
@alamb
Copy link
Contributor

alamb commented Sep 22, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing make-append-array-more-use-simd (fd5a012) to f4840f6 diff
BENCH_NAME=concatenate_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench concatenate_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=make-append-array-more-use-simd
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 22, 2025

🤖: Benchmark completed

Details

group                                                          main                                   make-append-array-more-use-simd
-----                                                          ----                                   -------------------------------
concat 1024 arrays boolean 4                                   1.00     27.9±0.08µs        ? ?/sec    1.02     28.4±0.04µs        ? ?/sec
concat 1024 arrays i32 4                                       1.00     13.9±0.02µs        ? ?/sec    1.00     13.9±0.02µs        ? ?/sec
concat 1024 arrays str 4                                       1.54     55.0±0.30µs        ? ?/sec    1.00     35.6±0.30µs        ? ?/sec
concat boolean 1024                                            1.10    434.6±0.32ns        ? ?/sec    1.00    394.6±7.21ns        ? ?/sec
concat boolean 8192 over 100 arrays                            1.15     50.8±0.04µs        ? ?/sec    1.00     44.2±0.12µs        ? ?/sec
concat boolean nulls 1024                                      1.02    738.4±3.33ns        ? ?/sec    1.00    724.8±3.73ns        ? ?/sec
concat boolean nulls 8192 over 100 arrays                      1.14    109.7±0.16µs        ? ?/sec    1.00     96.3±0.13µs        ? ?/sec
concat fixed size lists                                        1.07   763.1±22.77µs        ? ?/sec    1.00   713.7±45.55µs        ? ?/sec
concat i32 1024                                                1.03    400.4±1.12ns        ? ?/sec    1.00    389.5±0.63ns        ? ?/sec
concat i32 8192 over 100 arrays                                1.03    214.0±2.71µs        ? ?/sec    1.00    207.7±6.93µs        ? ?/sec
concat i32 nulls 1024                                          1.01    713.4±3.30ns        ? ?/sec    1.00    704.6±1.82ns        ? ?/sec
concat i32 nulls 8192 over 100 arrays                          1.05    286.1±9.38µs        ? ?/sec    1.00    271.2±9.11µs        ? ?/sec
concat str 1024                                                1.18     14.3±0.88µs        ? ?/sec    1.00     12.1±1.05µs        ? ?/sec
concat str 8192 over 100 arrays                                1.01    104.4±0.78ms        ? ?/sec    1.00    103.4±0.60ms        ? ?/sec
concat str nulls 1024                                          1.32      7.4±0.31µs        ? ?/sec    1.00      5.6±0.44µs        ? ?/sec
concat str nulls 8192 over 100 arrays                          1.03     53.5±0.65ms        ? ?/sec    1.00     51.9±0.60ms        ? ?/sec
concat str_dict 1024                                           1.07      2.9±0.01µs        ? ?/sec    1.00      2.7±0.02µs        ? ?/sec
concat str_dict_sparse 1024                                    1.00      7.0±0.03µs        ? ?/sec    1.03      7.2±0.03µs        ? ?/sec
concat struct with int32 and dicts size=1024 count=2           1.00      6.4±0.02µs        ? ?/sec    1.09      6.9±0.22µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0               1.00     77.5±0.48µs        ? ?/sec    1.00     77.5±0.69µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0.2             1.01     83.8±0.42µs        ? ?/sec    1.00     83.3±0.79µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0                1.15     88.6±0.71µs        ? ?/sec    1.00     77.1±0.33µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0.2              1.15     94.8±0.28µs        ? ?/sec    1.00     82.5±0.54µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0      1.00     48.2±3.55µs        ? ?/sec    1.01     48.9±4.39µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0.2    1.06     54.4±2.97µs        ? ?/sec    1.00     51.4±2.65µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Sep 22, 2025

this is so good. Thank you @rluvaton

@alamb alamb merged commit 13fb041 into apache:main Sep 22, 2025
26 checks passed
@rluvaton rluvaton deleted the make-append-array-more-use-simd branch September 22, 2025 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants