-
Notifications
You must be signed in to change notification settings - Fork 268
perf: implement native Rust trim functions for better performance #2988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: implement native Rust trim functions for better performance #2988
Conversation
Fixes apache#2977 Implements native Rust implementations for string trimming functions to address performance regression in issue apache#2977. Changes: - Add trim.rs with spark_trim, spark_ltrim, spark_rtrim, spark_btrim - Use efficient Arrow array operations - Include fast-path for strings without whitespace - Handle both Utf8 and LargeUtf8 types - Add comprehensive unit tests Implementation avoids JVM overhead and unnecessary allocations that caused the 0.6-0.7x performance shown in benchmarks. Expected to achieve >1.0x performance vs Spark baseline. Testing: - Build successful - Unit tests pass - CI will verify benchmark improvements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces native Rust implementations for Spark's string trimming functions (trim, ltrim, rtrim, btrim) to address performance regressions caused by JVM fallbacks. The implementation leverages Arrow array operations and includes a fast-path optimization for strings that don't require trimming.
Key changes:
- Native Rust trim implementations replacing JVM fallbacks for improved performance
- Fast-path optimization that avoids allocations when no trimming is needed
- Support for both
Utf8andLargeUtf8Arrow array types
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| native/spark-expr/src/string_funcs/trim.rs | New file implementing four trim function variants with array and scalar handling |
| native/spark-expr/src/string_funcs/mod.rs | Adds public module declaration for the new trim module |
| native/spark-expr/src/comet_scalar_funcs.rs | Registers the four trim functions in the scalar function registry |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Fast path: Check if any strings actually need trimming | ||
| // If not, we can return a clone of the original array | ||
| let needs_trimming = (0..array.len()).any(|i| { | ||
| if array.is_null(i) { | ||
| false | ||
| } else { | ||
| let s = array.value(i); | ||
| match trim_type { | ||
| TrimType::Left => s.starts_with(|c: char| c.is_whitespace()), | ||
| TrimType::Right => s.ends_with(|c: char| c.is_whitespace()), | ||
| TrimType::Both => { | ||
| s.starts_with(|c: char| c.is_whitespace()) | ||
| || s.ends_with(|c: char| c.is_whitespace()) | ||
| } | ||
| } | ||
| } | ||
| }); | ||
|
|
||
| if !needs_trimming { | ||
| // No trimming needed, return a clone of the input |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fast-path optimization iterates through the entire array to check if any string needs trimming, then potentially iterates through it again to perform the trimming. This double iteration may actually hurt performance for arrays where most strings need trimming, which is likely a common case.
For example, if all strings in an array have leading/trailing whitespace (a typical scenario), this optimization will:
- Iterate through all strings to check if trimming is needed
- Then iterate through all strings again to perform the trimming
This doubles the iteration cost. Consider removing this optimization or adding a short-circuit mechanism that stops checking after finding a certain number of strings that need trimming (e.g., if the first N strings need trimming, assume the rest might too).
| // Fast path: Check if any strings actually need trimming | |
| // If not, we can return a clone of the original array | |
| let needs_trimming = (0..array.len()).any(|i| { | |
| if array.is_null(i) { | |
| false | |
| } else { | |
| let s = array.value(i); | |
| match trim_type { | |
| TrimType::Left => s.starts_with(|c: char| c.is_whitespace()), | |
| TrimType::Right => s.ends_with(|c: char| c.is_whitespace()), | |
| TrimType::Both => { | |
| s.starts_with(|c: char| c.is_whitespace()) | |
| || s.ends_with(|c: char| c.is_whitespace()) | |
| } | |
| } | |
| } | |
| }); | |
| if !needs_trimming { | |
| // No trimming needed, return a clone of the input | |
| // Fast path: Check if any strings actually need trimming. | |
| // If not, we can return a clone of the original array. | |
| // To avoid a full double scan, we bound the pre-scan: once we've | |
| // seen enough strings that need trimming, we stop checking and go | |
| // straight to building the trimmed array. | |
| const MAX_FAST_PATH_CHECK: usize = 16; | |
| let mut needs_trimming = false; | |
| let mut fully_scanned = true; | |
| for i in 0..array.len() { | |
| if array.is_null(i) { | |
| continue; | |
| } | |
| let s = array.value(i); | |
| let string_needs_trimming = match trim_type { | |
| TrimType::Left => s.starts_with(|c: char| c.is_whitespace()), | |
| TrimType::Right => s.ends_with(|c: char| c.is_whitespace()), | |
| TrimType::Both => { | |
| s.starts_with(|c: char| c.is_whitespace()) | |
| || s.ends_with(|c: char| c.is_whitespace()) | |
| } | |
| }; | |
| if string_needs_trimming { | |
| needs_trimming = true; | |
| if i + 1 >= MAX_FAST_PATH_CHECK { | |
| // We've already seen enough evidence that trimming is needed. | |
| // Avoid scanning the rest of the array just for the fast path. | |
| fully_scanned = false; | |
| break; | |
| } | |
| } | |
| } | |
| if !needs_trimming && fully_scanned { | |
| // No trimming needed for any element, return a clone of the input. |
| #[cfg(test)] | ||
| mod tests { | ||
| use super::*; | ||
| use arrow::array::StringArray; | ||
|
|
||
| #[test] | ||
| fn test_trim() { | ||
| let input = StringArray::from(vec![ | ||
| Some(" hello "), | ||
| Some("world"), | ||
| Some(" spaces "), | ||
| None, | ||
| ]); | ||
| let input_ref: ArrayRef = Arc::new(input); | ||
|
|
||
| let result = trim_array(&input_ref, TrimType::Both).unwrap(); | ||
| let result_array = result.as_any().downcast_ref::<StringArray>().unwrap(); | ||
|
|
||
| assert_eq!(result_array.value(0), "hello"); | ||
| assert_eq!(result_array.value(1), "world"); | ||
| assert_eq!(result_array.value(2), "spaces"); | ||
| assert!(result_array.is_null(3)); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_ltrim() { | ||
| let input = StringArray::from(vec![Some(" hello "), Some("world ")]); | ||
| let input_ref: ArrayRef = Arc::new(input); | ||
|
|
||
| let result = trim_array(&input_ref, TrimType::Left).unwrap(); | ||
| let result_array = result.as_any().downcast_ref::<StringArray>().unwrap(); | ||
|
|
||
| assert_eq!(result_array.value(0), "hello "); | ||
| assert_eq!(result_array.value(1), "world "); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_rtrim() { | ||
| let input = StringArray::from(vec![Some(" hello "), Some(" world")]); | ||
| let input_ref: ArrayRef = Arc::new(input); | ||
|
|
||
| let result = trim_array(&input_ref, TrimType::Right).unwrap(); | ||
| let result_array = result.as_any().downcast_ref::<StringArray>().unwrap(); | ||
|
|
||
| assert_eq!(result_array.value(0), " hello"); | ||
| assert_eq!(result_array.value(1), " world"); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_trim_no_whitespace_fast_path() { | ||
| // Test the fast path where no trimming is needed | ||
| let input = StringArray::from(vec![ | ||
| Some("hello"), | ||
| Some("world"), | ||
| Some("no spaces"), | ||
| None, | ||
| ]); | ||
| let input_ref: ArrayRef = Arc::new(input.clone()); | ||
|
|
||
| let result = trim_array(&input_ref, TrimType::Both).unwrap(); | ||
| let result_array = result.as_any().downcast_ref::<StringArray>().unwrap(); | ||
|
|
||
| // Verify values are correct | ||
| assert_eq!(result_array.value(0), "hello"); | ||
| assert_eq!(result_array.value(1), "world"); | ||
| assert_eq!(result_array.value(2), "no spaces"); | ||
| assert!(result_array.is_null(3)); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_ltrim_no_whitespace() { | ||
| // Test ltrim with strings that have no leading whitespace | ||
| let input = StringArray::from(vec![Some("hello "), Some("world")]); | ||
| let input_ref: ArrayRef = Arc::new(input); | ||
|
|
||
| let result = trim_array(&input_ref, TrimType::Left).unwrap(); | ||
| let result_array = result.as_any().downcast_ref::<StringArray>().unwrap(); | ||
|
|
||
| assert_eq!(result_array.value(0), "hello "); | ||
| assert_eq!(result_array.value(1), "world"); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_rtrim_no_whitespace() { | ||
| // Test rtrim with strings that have no trailing whitespace | ||
| let input = StringArray::from(vec![Some(" hello"), Some("world")]); | ||
| let input_ref: ArrayRef = Arc::new(input); | ||
|
|
||
| let result = trim_array(&input_ref, TrimType::Right).unwrap(); | ||
| let result_array = result.as_any().downcast_ref::<StringArray>().unwrap(); | ||
|
|
||
| assert_eq!(result_array.value(0), " hello"); | ||
| assert_eq!(result_array.value(1), "world"); | ||
| } | ||
| } |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test coverage is missing several important edge cases:
- Empty strings - test how empty string "" is handled (should remain empty)
- Strings with only whitespace - test strings like " " (should become empty after trim)
- Different types of whitespace - test tabs, newlines, and other Unicode whitespace characters
- Large string arrays - test performance characteristics with larger data sets
- LargeUtf8 arrays - only Utf8 arrays are tested, but the code supports both
- Scalar value paths - the scalar handling branches in spark_trim/ltrim/rtrim are not tested
Consider adding tests for these scenarios to ensure correctness across all supported input types and edge cases.
| pub fn spark_trim(args: &[ColumnarValue]) -> DataFusionResult<ColumnarValue> { | ||
| if args.len() != 1 { | ||
| return Err(datafusion::common::DataFusionError::Execution( | ||
| format!("trim expects 1 argument, got {}", args.len()), | ||
| )); | ||
| } | ||
|
|
||
| match &args[0] { | ||
| ColumnarValue::Array(array) => { | ||
| let result = trim_array(array, TrimType::Both)?; | ||
| Ok(ColumnarValue::Array(result)) | ||
| } | ||
| ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => { | ||
| Ok(ColumnarValue::Scalar(ScalarValue::Utf8(Some( | ||
| s.trim().to_string(), | ||
| )))) | ||
| } | ||
| ColumnarValue::Scalar(ScalarValue::Utf8(None)) => { | ||
| Ok(ColumnarValue::Scalar(ScalarValue::Utf8(None))) | ||
| } | ||
| ColumnarValue::Scalar(ScalarValue::LargeUtf8(Some(s))) => { | ||
| Ok(ColumnarValue::Scalar(ScalarValue::LargeUtf8(Some( | ||
| s.trim().to_string(), | ||
| )))) | ||
| } | ||
| ColumnarValue::Scalar(ScalarValue::LargeUtf8(None)) => { | ||
| Ok(ColumnarValue::Scalar(ScalarValue::LargeUtf8(None))) | ||
| } | ||
| _ => Err(datafusion::common::DataFusionError::Execution( | ||
| "trim expects string argument".to_string(), | ||
| )), | ||
| } | ||
| } |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation only handles the single-argument form of TRIM, but Spark's TRIM function supports an optional second parameter to specify custom characters to trim (e.g., TRIM('SL' FROM col) or TRIM(BOTH 'SL' FROM col)).
According to the test in CometStringExpressionSuite.scala line 249, queries like SELECT trim('SL', col) FROM table are expected to work. The current implementation will reject these with an argument count error.
The implementation should support both:
trim(string)- trim whitespace (current implementation)trim(trimChars, string)- trim specific characters
This applies to all trim variants (trim, btrim, ltrim, rtrim). Consider either:
- Extending this implementation to support both argument counts
- Documenting that this is a partial implementation and updating tests/documentation accordingly
|
@Brijesh-Thakkar , there seems to be no tests for the new functionality. Could you please add some tests on spark side to make sure we are not missing out on the correctness ? (or mark the PR as WIP /draft if you are still working on it ?) |
|
Also, why do you think we need a new |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2988 +/- ##
============================================
- Coverage 56.12% 54.90% -1.23%
- Complexity 976 1337 +361
============================================
Files 119 167 +48
Lines 11743 15493 +3750
Branches 2251 2569 +318
============================================
+ Hits 6591 8506 +1915
- Misses 4012 5766 +1754
- Partials 1140 1221 +81 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Use simple Rust string trim methods - Works for 1-argument case (whitespace trimming) - Add TODO for 2-argument case (custom chars) - All tests pass
|
@coderfender Thank you for the feedback! You're absolutely right. I've updated the implementation to use a simpler approach with Rust's built-in Current implementation:
Regarding 2-argument support:
Next steps: Let me know if this approach looks better or if you'd prefer I investigate using DataFusion's internal trim functions instead (though I wasn't able to find them exported in the available crates). |
|
@coderfender , also can you please suggest me some ways from which i could understand the codebase and architecture properly and fast |
|
I believe we are using datafusion's inbuilt function and wonder why we cant optimize that ? |
|
I see the trim functions are registered in My questions:
Could you point me to the right place to look? I want to make sure I'm fixing the actual bottleneck. |
|
@Brijesh-Thakkar yeah the scalar fucntions (atleast trim) should be in datafusion codebase . I would suggest taking a look at it and perhaps optimizing it there? |
|
@coderfender Ah, I understand now! So the trim functions are implemented directly in DataFusion's codebase, not in Comet. So the right approach would be to:
That makes much more sense right?? |
|
Exactly! |
|
Perfect! Thank you for the guidance @coderfender. I'll close this PR
Closing this PR now. |
|
Thank you ! Please tag me the in DF PR so that I can help you there |
Fixes #2977
Rationale for this change
Comet currently falls back to JVM-based implementations for string trimming functions, which leads to a significant performance regression compared to Spark (approximately 0.6–0.7x in benchmarks, as reported in #2977).
This change introduces native Rust implementations for trim-related string expressions, eliminating JVM overhead and unnecessary allocations. The goal is to restore and exceed Spark baseline performance for these operations.
What changes are included in this PR?
trim.rscontaining native Rust implementations for:spark_trimspark_ltrimspark_rtrimspark_btrimUtf8andLargeUtf8Arrow string array typesThe implementation avoids JVM execution paths and reduces allocations that previously caused the observed performance degradation.
How are these changes tested?