Optimize `concat/concat_ws` scalar path by pre-allocating memory #19547

lyne7-sc · 2025-12-29T14:02:08Z

Which issue does this PR close?

Closes #.

Rationale for this change

This PR improves performance by:

~~Using exact Utf8View byte size (sum of data buffers) instead of row-based approximation.~~
Building results via .concat()/.join(sep) on a pre-allocated Vec<&str> to avoid String reallocations.

Benchmark

Case	Before	After	Change
concat_ws_scalar/8	299.18 ns	233.18 ns	-21.93%
concat_ws_scalar/32	327.53 ns	251.44 ns	-23.23%
concat_ws_scalar/128	405.80 ns	271.27 ns	-33.15%
concat_ws_scalar/4096	976.02 ns	791.33 ns	-18.92%
concat_scalar/8	248.71 ns	221.24 ns	-11.05%
concat_scalar/32	284.26 ns	240.53 ns	-15.39%
concat_scalar/128	301.91 ns	257.61 ns	-14.67%
concat_scalar/4096	916.68 ns	805.33 ns	-12.15%

What changes are included in this PR?

Performance optimization for concat and concat_ws functions scalar path.

Are these changes tested?

Existing unit and integration tests pass.
New benchmarks added to verify performance improvement.

Are there any user-facing changes?

No. It's a pure performance optimization.

Omega359 · 2025-12-29T16:57:30Z

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux fedora 6.17.12-300.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Dec 13 05:06:24 UTC 2025 x86_64 GNU/Linux
Comparing perf/concat (301ab2e) to 83ed192 diff
BENCH_NAME=concat
BENCH_COMMAND=cargo bench --bench concat
BENCH_FILTER=
BENCH_BRANCH_NAME=perf_concat
Results will be posted here when complete

Omega359 · 2025-12-29T17:04:21Z

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux fedora 6.17.12-300.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Dec 13 05:06:24 UTC 2025 x86_64 GNU/Linux
Comparing perf/concat (301ab2e) to 83ed192 diff
BENCH_NAME=concat
BENCH_COMMAND=cargo bench --bench concat
BENCH_FILTER=
BENCH_BRANCH_NAME=perf_concat
Results will be posted here when complete

Omega359 · 2025-12-29T17:07:21Z

🤖: Benchmark completed

Details

group                            main                                   perf_concat
-----                            ----                                   -----------
concat function/concat/1024      1.03      9.3±0.85µs        ? ?/sec    1.00      9.0±0.29µs        ? ?/sec
concat function/concat/4096      1.00     35.1±0.71µs        ? ?/sec    1.00     35.1±1.01µs        ? ?/sec
concat function/concat/8192      1.01     69.9±3.74µs        ? ?/sec    1.00     69.6±0.95µs        ? ?/sec
concat function/concat/scalar                                           1.00     32.5±0.13µs        ? ?/sec

lyne7-sc · 2026-01-01T08:18:36Z

Friendly ping @andygrove. All CI checks are green and it's ready for review when you have time.

Jefffrey · 2026-01-02T02:58:23Z

datafusion/functions/src/string/concat.rs

+                            data_size += string_array
+                                .data_buffers()
+                                .iter()
+                                .map(|buf| buf.len())
+                                .sum::<usize>();


I'm a bit cautious that this could significantly overestimate the size required, depending on the string view passed in 🤔

I'm a bit cautious that this could significantly overestimate the size required, depending on the string view passed in 🤔

Agreed, thanks for pointing this out.
I checked the StringViewArray doc, and I think we can use ByteView to derive the logical length of each data slice more accurately.

data_size += string_array .views() .iter() .map(|&v| { ByteView::from(v).length as usize }) .sum::<usize>();

If this looks reasonable to you, please let me know and I’ll proceed with this approach.

It's a hard tradeoff to be sure as now we have to iterate the whole array 🤔

I would be curious to see what the benchmarks say; I'm not too sure on this myself, would love it if there were an easy way to estimate view size 😅

It's a hard tradeoff to be sure as now we have to iterate the whole array 🤔

I would be curious to see what the benchmarks say; I'm not too sure on this myself, would love it if there were an easy way to estimate view size 😅

Hi @Jefffrey, I ran a benchmark comparing the pre-estimation logic (iterating buffers/views) against the current implementation.

group concat_main_branch concat_perf_data_buffers concat_perf_views_iter ----- ------------------ ------------------------ ---------------------- concat function/concat_view/1024 1.00 109.7±3.28µs ? ?/sec 1.14 125.3±3.71µs ? ?/sec 1.04 113.6±4.48µs ? ?/sec concat function/concat_view/4096 1.02 608.1±20.13µs ? ?/sec 1.00 595.9±15.17µs ? ?/sec 1.01 603.1±19.92µs ? ?/sec concat function/concat_view/8192 1.00 1206.7±52.12µs ? ?/sec 1.06 1281.5±44.05µs ? ?/sec 1.06 1277.6±37.06µs ? ?/sec

The results showed that the overhead of iteration actually outweighed the allocation savings, leading to a slight regression in some cases (see the benchmark table above). Given StringView's design, the default growth strategy seems more efficient here. So I’ve reverted that part.

We might want to revisit this optimization later if we have a more efficient way to determine the total data view size.

Additionally, I noticed a similar pre-estimation logic in concat_ws using data_buffers.

Given the results for concat, I suspect concat_ws might also suffer from iteration overhead and overestimation.

What do you think about removing this logic from concat_ws as well to keep the implementation consistent? I'm happy to add more targeted test cases for StringViewArray in concat_ws to ensure we handle these scenarios correctly and efficiently.

I don't think there is too noticeable iteration overhead if iterating over data_buffers as I imagine most view arrays wouldn't have that many data buffers; I was more referring to iterating over the views themselves to grab the lengths.

Understood. That makes sense.
I also ran a benchmark for concat_ws and found that keeping the current implementation is likely the better choice for now as well.
The cargo fmt issues are also fixed. Please let me know if there's anything else!

Jefffrey · 2026-01-02T03:02:14Z

datafusion/functions/src/string/concat.rs

                        DataType::LargeUtf8 => {
                            let string_array = as_largestring_array(array);

                            data_size += string_array.values().len();


If we're looking at having more accurate estimates, we could fix these to ensure we look at only the data sliced by our string arrays

comphead · 2026-01-06T18:19:09Z

datafusion/functions/src/string/concat_ws.rs

-                        result.push_str(s);
-                    }
+                match scalar.try_as_str() {
+                    Some(Some(v)) => values.push(v),


wondering should inserting a sep after the token be faster than calling join later?

values.push(sep) values.push(v)

Based on local benchmark, the difference between the two is negligible. In general, join even tends to be slightly faster.
This may be because push sep method introduces additional branching operations.
Since there is no performance gain, perhaps sticking with join would be better for readability?

Jefffrey · 2026-01-10T02:33:24Z

Thanks @lyne7-sc, @Omega359 & @comphead

lyne7-sc and others added 5 commits December 28, 2025 23:23

optimize concat/concat_ws

e6fbe5d

Merge branch 'apache:main' into perf/concat

5007766

Merge branch 'apache:main' into perf/concat

037539b

perf: concat function utf8view

f1abe4c

fmt code

301ab2e

github-actions bot added the functions Changes to functions implementation label Dec 29, 2025

andygrove added the performance Make DataFusion faster label Dec 29, 2025

Jefffrey reviewed Jan 2, 2026

View reviewed changes

lyne7-sc and others added 3 commits January 3, 2026 13:07

Merge branch 'main' into perf/concat

3eb0590

revert string_view pre-allocation

cbd5ed5

cargo fmt

482ea98

Jefffrey approved these changes Jan 4, 2026

View reviewed changes

coderfender mentioned this pull request Jan 5, 2026

[EPIC] Optimize performance for slow expressions apache/datafusion-comet#2986

Open

andygrove requested a review from comphead January 6, 2026 16:41

comphead reviewed Jan 6, 2026

View reviewed changes

Jefffrey added this pull request to the merge queue Jan 10, 2026

Merged via the queue into apache:main with commit afc9121 Jan 10, 2026
32 checks passed

Optimize concat/concat_ws scalar path by pre-allocating memory #19547

Optimize concat/concat_ws scalar path by pre-allocating memory #19547

Conversation

lyne7-sc commented Dec 29, 2025 • edited by Jefffrey Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Benchmark

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Omega359 commented Dec 29, 2025

Uh oh!

Omega359 commented Dec 29, 2025

Uh oh!

Omega359 commented Dec 29, 2025

Uh oh!

lyne7-sc commented Jan 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jefffrey commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Optimize `concat/concat_ws` scalar path by pre-allocating memory #19547

Optimize `concat/concat_ws` scalar path by pre-allocating memory #19547

lyne7-sc commented Dec 29, 2025 •

edited by Jefffrey

Loading