-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Optimize concat/concat_ws scalar path by pre-allocating memory
#19547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
🤖 |
1 similar comment
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Friendly ping @andygrove. All CI checks are green and it's ready for review when you have time. |
| data_size += string_array | ||
| .data_buffers() | ||
| .iter() | ||
| .map(|buf| buf.len()) | ||
| .sum::<usize>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit cautious that this could significantly overestimate the size required, depending on the string view passed in 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit cautious that this could significantly overestimate the size required, depending on the string view passed in 🤔
Agreed, thanks for pointing this out.
I checked the StringViewArray doc, and I think we can use ByteView to derive the logical length of each data slice more accurately.
data_size += string_array
.views()
.iter()
.map(|&v| {
ByteView::from(v).length as usize
})
.sum::<usize>();If this looks reasonable to you, please let me know and I’ll proceed with this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a hard tradeoff to be sure as now we have to iterate the whole array 🤔
I would be curious to see what the benchmarks say; I'm not too sure on this myself, would love it if there were an easy way to estimate view size 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a hard tradeoff to be sure as now we have to iterate the whole array 🤔
I would be curious to see what the benchmarks say; I'm not too sure on this myself, would love it if there were an easy way to estimate view size 😅
Hi @Jefffrey, I ran a benchmark comparing the pre-estimation logic (iterating buffers/views) against the current implementation.
group concat_main_branch concat_perf_data_buffers concat_perf_views_iter
----- ------------------ ------------------------ ----------------------
concat function/concat_view/1024 1.00 109.7±3.28µs ? ?/sec 1.14 125.3±3.71µs ? ?/sec 1.04 113.6±4.48µs ? ?/sec
concat function/concat_view/4096 1.02 608.1±20.13µs ? ?/sec 1.00 595.9±15.17µs ? ?/sec 1.01 603.1±19.92µs ? ?/sec
concat function/concat_view/8192 1.00 1206.7±52.12µs ? ?/sec 1.06 1281.5±44.05µs ? ?/sec 1.06 1277.6±37.06µs ? ?/sec
The results showed that the overhead of iteration actually outweighed the allocation savings, leading to a slight regression in some cases (see the benchmark table above). Given StringView's design, the default growth strategy seems more efficient here. So I’ve reverted that part.
We might want to revisit this optimization later if we have a more efficient way to determine the total data view size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, I noticed a similar pre-estimation logic in concat_ws using data_buffers.
Given the results for concat, I suspect concat_ws might also suffer from iteration overhead and overestimation.
What do you think about removing this logic from concat_ws as well to keep the implementation consistent? I'm happy to add more targeted test cases for StringViewArray in concat_ws to ensure we handle these scenarios correctly and efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is too noticeable iteration overhead if iterating over data_buffers as I imagine most view arrays wouldn't have that many data buffers; I was more referring to iterating over the views themselves to grab the lengths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. That makes sense.
I also ran a benchmark for concat_ws and found that keeping the current implementation is likely the better choice for now as well.
The cargo fmt issues are also fixed. Please let me know if there's anything else!
| DataType::LargeUtf8 => { | ||
| let string_array = as_largestring_array(array); | ||
|
|
||
| data_size += string_array.values().len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're looking at having more accurate estimates, we could fix these to ensure we look at only the data sliced by our string arrays
| result.push_str(s); | ||
| } | ||
| match scalar.try_as_str() { | ||
| Some(Some(v)) => values.push(v), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering should inserting a sep after the token be faster than calling join later?
values.push(sep)
values.push(v)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on local benchmark, the difference between the two is negligible. In general, join even tends to be slightly faster.
This may be because push sep method introduces additional branching operations.
Since there is no performance gain, perhaps sticking with join would be better for readability?
Which issue does this PR close?
Rationale for this change
This PR improves performance by:
Using exactUtf8Viewbyte size (sum of data buffers) instead of row-based approximation..concat()/.join(sep)on a pre-allocatedVec<&str>to avoidStringreallocations.Benchmark
What changes are included in this PR?
Performance optimization for
concatandconcat_wsfunctions scalar path.Are these changes tested?
Are there any user-facing changes?
No. It's a pure performance optimization.