Skip to content

Make translate emit Utf8View for Utf8View input#20624

Open
shivaaang wants to merge 2 commits intoapache:mainfrom
shivaaang:translate-utf8view
Open

Make translate emit Utf8View for Utf8View input#20624
shivaaang wants to merge 2 commits intoapache:mainfrom
shivaaang:translate-utf8view

Conversation

@shivaaang
Copy link

Which issue does this PR close?

Part of #20585

Rationale for this change

String UDFs should preserve string representation where feasible. translate previously accepted Utf8View input but emitted Utf8, causing an unnecessary type downgrade. This aligns translate with the expected behavior of returning the same string type as its primary input.

What changes are included in this PR?

  1. Updated translate return type inference to emit Utf8View when input is Utf8View, while preserving existing behavior for Utf8 and LargeUtf8.
  2. Refactored translate and translate_with_map to use explicit string builders (via a local TranslateOutput helper trait) instead of .collect::<GenericStringArray<T>>(), so the correct output array type is produced for each input type.
  3. Added unit tests for Utf8View input (basic, null, non-ASCII) and sqllogictests verifying arrow_typeof output for all three string types.

Are these changes tested?

Yes. Unit tests and sqllogictests are included.

Are there any user-facing changes?

No.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Feb 28, 2026
Comment on lines +97 to +101
if arg_types[0] == DataType::Utf8View {
Ok(DataType::Utf8View)
} else {
utf8_to_str_type(&arg_types[0], "translate")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if arg_types[0] == DataType::Utf8View {
Ok(DataType::Utf8View)
} else {
utf8_to_str_type(&arg_types[0], "translate")
}
Ok(arg_types[0].clone())

Simpler

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, simplified.

let arr = string_array.as_string::<i32>();
translate_with_map::<i32, _>(
let builder =
GenericStringBuilder::<i32>::with_capacity(len, len * 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why * 4? Seems it might overestimate, compared to getting the byte size from input array?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use arr.value_data().len() at all call sites.

@Jefffrey
Copy link
Contributor

Jefffrey commented Mar 2, 2026

run benchmark replace

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing translate-utf8view (2f933d7) to 8df75c3 diff
BENCH_NAME=replace
BENCH_COMMAND=cargo bench --features=parquet --bench replace
BENCH_FILTER=
BENCH_BRANCH_NAME=translate-utf8view
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                     main                                   translate-utf8view
-----                                                                     ----                                   ------------------
replace size=1024/replace_large_string [size=1024, str_len=128]           1.01    164.8±3.47µs        ? ?/sec    1.00    163.5±1.47µs        ? ?/sec
replace size=1024/replace_large_string [size=1024, str_len=32]            1.02    118.1±3.14µs        ? ?/sec    1.00    116.1±1.02µs        ? ?/sec
replace size=1024/replace_string [size=1024, str_len=128]                 1.02    168.6±2.42µs        ? ?/sec    1.00    165.5±1.55µs        ? ?/sec
replace size=1024/replace_string [size=1024, str_len=32]                  1.01    118.3±1.95µs        ? ?/sec    1.00    117.6±0.71µs        ? ?/sec
replace size=1024/replace_string_ascii_single [size=1024, str_len=128]    1.00     94.5±0.24µs        ? ?/sec    1.01     95.7±1.82µs        ? ?/sec
replace size=1024/replace_string_ascii_single [size=1024, str_len=32]     1.01     88.5±2.32µs        ? ?/sec    1.00     87.4±0.58µs        ? ?/sec
replace size=1024/replace_string_view [size=1024, str_len=128]            1.00    169.0±0.89µs        ? ?/sec    1.00    169.3±3.03µs        ? ?/sec
replace size=1024/replace_string_view [size=1024, str_len=32]             1.00    121.7±3.16µs        ? ?/sec    1.01    122.4±1.76µs        ? ?/sec
replace size=4096/replace_large_string [size=4096, str_len=128]           1.00    463.4±3.23µs        ? ?/sec    1.00    464.1±5.75µs        ? ?/sec
replace size=4096/replace_large_string [size=4096, str_len=32]            1.01    287.1±3.68µs        ? ?/sec    1.00    284.5±1.11µs        ? ?/sec
replace size=4096/replace_string [size=4096, str_len=128]                 1.00    475.3±2.00µs        ? ?/sec    1.00    476.5±1.16µs        ? ?/sec
replace size=4096/replace_string [size=4096, str_len=32]                  1.01    289.2±1.60µs        ? ?/sec    1.00    287.6±1.28µs        ? ?/sec
replace size=4096/replace_string_ascii_single [size=4096, str_len=128]    1.00    198.5±0.92µs        ? ?/sec    1.01    199.9±2.76µs        ? ?/sec
replace size=4096/replace_string_ascii_single [size=4096, str_len=32]     1.00    158.0±0.60µs        ? ?/sec    1.02    160.9±1.01µs        ? ?/sec
replace size=4096/replace_string_view [size=4096, str_len=128]            1.00    474.4±0.93µs        ? ?/sec    1.00    474.8±1.60µs        ? ?/sec
replace size=4096/replace_string_view [size=4096, str_len=32]             1.00    296.4±0.98µs        ? ?/sec    1.01    298.4±1.30µs        ? ?/sec


/// Helper trait to abstract over different string builder types so `translate`
/// and `translate_with_map` can produce the correct output array type.
trait TranslateOutput {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a lot of PRs making changes to other string functions; I wonder if having this trait specific only to translate is the best move? Can we take a step back and see if there is an easier way for all string UDFs to benefit from common code changes required?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, replaced with Arrow's StringLikeArrayBuilder.

- Simplify return_type() to Ok(arg_types[0].clone())
- Replace len * 4 capacity with arr.value_data().len()
- Remove custom TranslateOutput trait, use Arrow's StringLikeArrayBuilder
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants