Make translate emit Utf8View for Utf8View input#20624
Make translate emit Utf8View for Utf8View input#20624shivaaang wants to merge 2 commits intoapache:mainfrom
Conversation
| if arg_types[0] == DataType::Utf8View { | ||
| Ok(DataType::Utf8View) | ||
| } else { | ||
| utf8_to_str_type(&arg_types[0], "translate") | ||
| } |
There was a problem hiding this comment.
| if arg_types[0] == DataType::Utf8View { | |
| Ok(DataType::Utf8View) | |
| } else { | |
| utf8_to_str_type(&arg_types[0], "translate") | |
| } | |
| Ok(arg_types[0].clone()) |
Simpler
| let arr = string_array.as_string::<i32>(); | ||
| translate_with_map::<i32, _>( | ||
| let builder = | ||
| GenericStringBuilder::<i32>::with_capacity(len, len * 4); |
There was a problem hiding this comment.
Why * 4? Seems it might overestimate, compared to getting the byte size from input array?
There was a problem hiding this comment.
Updated to use arr.value_data().len() at all call sites.
|
run benchmark replace |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
|
||
| /// Helper trait to abstract over different string builder types so `translate` | ||
| /// and `translate_with_map` can produce the correct output array type. | ||
| trait TranslateOutput { |
There was a problem hiding this comment.
We have a lot of PRs making changes to other string functions; I wonder if having this trait specific only to translate is the best move? Can we take a step back and see if there is an easier way for all string UDFs to benefit from common code changes required?
There was a problem hiding this comment.
Actually looks like we have
There was a problem hiding this comment.
Agreed, replaced with Arrow's StringLikeArrayBuilder.
- Simplify return_type() to Ok(arg_types[0].clone()) - Replace len * 4 capacity with arr.value_data().len() - Remove custom TranslateOutput trait, use Arrow's StringLikeArrayBuilder
Which issue does this PR close?
Part of #20585
Rationale for this change
String UDFs should preserve string representation where feasible.
translatepreviously accepted Utf8View input but emitted Utf8, causing an unnecessary type downgrade. This alignstranslatewith the expected behavior of returning the same string type as its primary input.What changes are included in this PR?
translatereturn type inference to emit Utf8View when input is Utf8View, while preserving existing behavior for Utf8 and LargeUtf8.translateandtranslate_with_mapto use explicit string builders (via a localTranslateOutputhelper trait) instead of.collect::<GenericStringArray<T>>(), so the correct output array type is produced for each input type.arrow_typeofoutput for all three string types.Are these changes tested?
Yes. Unit tests and sqllogictests are included.
Are there any user-facing changes?
No.