Skip to content

Conversation

@lyne7-sc
Copy link
Contributor

@lyne7-sc lyne7-sc commented Dec 29, 2025

Which issue does this PR close?

  • Closes #.

Rationale for this change

This PR improves performance by:

  • Using exact Utf8View byte size (sum of data buffers) instead of row-based approximation.
  • Building results via .concat()/.join(sep) on a pre-allocated Vec<&str> to avoid String reallocations.

Benchmark

Case Before After Change
concat_ws_scalar/8 299.18 ns 233.18 ns -21.93%
concat_ws_scalar/32 327.53 ns 251.44 ns -23.23%
concat_ws_scalar/128 405.80 ns 271.27 ns -33.15%
concat_ws_scalar/4096 976.02 ns 791.33 ns -18.92%
concat_scalar/8 248.71 ns 221.24 ns -11.05%
concat_scalar/32 284.26 ns 240.53 ns -15.39%
concat_scalar/128 301.91 ns 257.61 ns -14.67%
concat_scalar/4096 916.68 ns 805.33 ns -12.15%

What changes are included in this PR?

Performance optimization for concat and concat_ws functions scalar path.

Are these changes tested?

  • Existing unit and integration tests pass.
  • New benchmarks added to verify performance improvement.

Are there any user-facing changes?

No. It's a pure performance optimization.

@github-actions github-actions bot added the functions Changes to functions implementation label Dec 29, 2025
@Omega359
Copy link
Contributor

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux fedora 6.17.12-300.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Dec 13 05:06:24 UTC 2025 x86_64 GNU/Linux
Comparing perf/concat (301ab2e) to 83ed192 diff
BENCH_NAME=concat
BENCH_COMMAND=cargo bench --bench concat
BENCH_FILTER=
BENCH_BRANCH_NAME=perf_concat
Results will be posted here when complete

1 similar comment
@Omega359
Copy link
Contributor

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux fedora 6.17.12-300.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Dec 13 05:06:24 UTC 2025 x86_64 GNU/Linux
Comparing perf/concat (301ab2e) to 83ed192 diff
BENCH_NAME=concat
BENCH_COMMAND=cargo bench --bench concat
BENCH_FILTER=
BENCH_BRANCH_NAME=perf_concat
Results will be posted here when complete

@Omega359
Copy link
Contributor

🤖: Benchmark completed

Details

group                            main                                   perf_concat
-----                            ----                                   -----------
concat function/concat/1024      1.03      9.3±0.85µs        ? ?/sec    1.00      9.0±0.29µs        ? ?/sec
concat function/concat/4096      1.00     35.1±0.71µs        ? ?/sec    1.00     35.1±1.01µs        ? ?/sec
concat function/concat/8192      1.01     69.9±3.74µs        ? ?/sec    1.00     69.6±0.95µs        ? ?/sec
concat function/concat/scalar                                           1.00     32.5±0.13µs        ? ?/sec

@andygrove andygrove added the performance Make DataFusion faster label Dec 29, 2025
@lyne7-sc
Copy link
Contributor Author

lyne7-sc commented Jan 1, 2026

Friendly ping @andygrove. All CI checks are green and it's ready for review when you have time.

Comment on lines 210 to 214
data_size += string_array
.data_buffers()
.iter()
.map(|buf| buf.len())
.sum::<usize>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit cautious that this could significantly overestimate the size required, depending on the string view passed in 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit cautious that this could significantly overestimate the size required, depending on the string view passed in 🤔

Agreed, thanks for pointing this out.
I checked the StringViewArray doc, and I think we can use ByteView to derive the logical length of each data slice more accurately.

data_size += string_array
    .views()
    .iter()
    .map(|&v| {
        ByteView::from(v).length as usize
    })
    .sum::<usize>();

If this looks reasonable to you, please let me know and I’ll proceed with this approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a hard tradeoff to be sure as now we have to iterate the whole array 🤔

I would be curious to see what the benchmarks say; I'm not too sure on this myself, would love it if there were an easy way to estimate view size 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a hard tradeoff to be sure as now we have to iterate the whole array 🤔

I would be curious to see what the benchmarks say; I'm not too sure on this myself, would love it if there were an easy way to estimate view size 😅

Hi @Jefffrey, I ran a benchmark comparing the pre-estimation logic (iterating buffers/views) against the current implementation.

group                               concat_main_branch                     concat_perf_data_buffers               concat_perf_views_iter
-----                               ------------------                     ------------------------               ----------------------
concat function/concat_view/1024    1.00    109.7±3.28µs        ? ?/sec    1.14    125.3±3.71µs        ? ?/sec    1.04    113.6±4.48µs        ? ?/sec
concat function/concat_view/4096    1.02   608.1±20.13µs        ? ?/sec    1.00   595.9±15.17µs        ? ?/sec    1.01   603.1±19.92µs        ? ?/sec
concat function/concat_view/8192    1.00  1206.7±52.12µs        ? ?/sec    1.06  1281.5±44.05µs        ? ?/sec    1.06  1277.6±37.06µs        ? ?/sec

The results showed that the overhead of iteration actually outweighed the allocation savings, leading to a slight regression in some cases (see the benchmark table above). Given StringView's design, the default growth strategy seems more efficient here. So I’ve reverted that part.

We might want to revisit this optimization later if we have a more efficient way to determine the total data view size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I noticed a similar pre-estimation logic in concat_ws using data_buffers.

Given the results for concat, I suspect concat_ws might also suffer from iteration overhead and overestimation.

What do you think about removing this logic from concat_ws as well to keep the implementation consistent? I'm happy to add more targeted test cases for StringViewArray in concat_ws to ensure we handle these scenarios correctly and efficiently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is too noticeable iteration overhead if iterating over data_buffers as I imagine most view arrays wouldn't have that many data buffers; I was more referring to iterating over the views themselves to grab the lengths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. That makes sense.
I also ran a benchmark for concat_ws and found that keeping the current implementation is likely the better choice for now as well.
The cargo fmt issues are also fixed. Please let me know if there's anything else!

DataType::LargeUtf8 => {
let string_array = as_largestring_array(array);

data_size += string_array.values().len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're looking at having more accurate estimates, we could fix these to ensure we look at only the data sliced by our string arrays

result.push_str(s);
}
match scalar.try_as_str() {
Some(Some(v)) => values.push(v),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering should inserting a sep after the token be faster than calling join later?

values.push(sep)
values.push(v)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on local benchmark, the difference between the two is negligible. In general, join even tends to be slightly faster.
This may be because push sep method introduces additional branching operations.
Since there is no performance gain, perhaps sticking with join would be better for readability?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation performance Make DataFusion faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants