-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Perf: Optimize substring_index via single-byte fast path and direct indexing
#19590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
uzqw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I noticed some Clippy warnings when running locally:
cargo clippy -p datafusion-functions --all-targets --all-features
| match string.get(string.len().saturating_sub(length)..) { | ||
| Some(substring) => builder.append_value(substring), | ||
| None => builder.append_null(), | ||
| if n > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this else { if .. } block can be collapsed
|
|
||
| for batch_size in batch_sizes { | ||
| group.bench_function( | ||
| &format!("substr_index_{}_single_delimiter", batch_size), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variables can be used directly in the format! string
| ); | ||
|
|
||
| group.bench_function( | ||
| &format!("substr_index_{}_long_delimiter", batch_size), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variables can be used directly in the format! string
|
|
Hi @Jefffrey , could you please help trigger the workflows for this PR? I'd appreciate a review when you have a moment. Thanks! |
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark substr_index |
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| match string.get(..length) { | ||
| Some(substring) => builder.append_value(substring), | ||
| None => builder.append_null(), | ||
| let result_idx = if delimiter.len() == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked that this checks for length in bytes
https://doc.rust-lang.org/std/string/struct.String.html#method.len
| T::Native: OffsetSizeTrait, | ||
| { | ||
| let mut builder = StringBuilder::new(); | ||
| let num_rows = string_array.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted to make this really fast, you could also implement special case code for the common cases where delimiter and count were scalar values -- I suspect you could make it quite fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I’ll try adding a special-case implementation for that. Thanks!
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Onwards 🚀 |
Which issue does this PR close?
Rationale for this change
This PR improves the performance of the
substring_indexfunction by optimizingdelimiter search and substring extraction:
.,,), avoiding UTF-8 pattern matching overhead.match_indices/rmatch_indices.What changes are included in this PR?
delimiter.len() == 1using byte-based search.match_indicesandrmatch_indicesfor more efficient positioning.Benchmarks
Are these changes tested?
Are there any user-facing changes?
No.