Skip to content

Make lower and upper emit Utf8View for Utf8View input#20616

Open
kumarUjjawal wants to merge 3 commits intoapache:mainfrom
kumarUjjawal:refactor/lower_utf8view
Open

Make lower and upper emit Utf8View for Utf8View input#20616
kumarUjjawal wants to merge 3 commits intoapache:mainfrom
kumarUjjawal:refactor/lower_utf8view

Conversation

@kumarUjjawal
Copy link
Contributor

@kumarUjjawal kumarUjjawal commented Feb 28, 2026

Which issue does this PR close?

Rationale for this change

String UDFs should preserve string representation where feasible. lower and upper previously accepted Utf8View input but emitted Utf8, causing an unnecessary type downgrade. This aligns both with the expected behavior of returning the same string type as its primary input.

What changes are included in this PR?

  • Updated lower and upper return type inference to emit Utf8View when input is Utf8View

Are these changes tested?

Yes

Are there any user-facing changes?

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Feb 28, 2026
Copy link
Contributor

@neilconway neilconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this!

The PR description has a typo (still talks about repeat).

Personally I find it a bit confusing to have the same / overlapping changes in concurrent PRs (e.g., for upper and lower).

}

#[derive(Debug, Clone, Copy)]
enum Utf8ViewOutput {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a new enum -- can't we just pass the return data type the caller expects?

if arg_types[0] == DataType::Utf8View {
Ok(DataType::Utf8View)
} else {
utf8_to_str_type(&arg_types[0], "lower")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be helpful to have a helper here that handles all three string representations and returns the optimal return type. Having a helper for Utf8 vs. LargeUtf8 but handling Utf8View at the call-site seems a bit odd.

} else {
string_builder.append_null();
match utf8view_output {
Utf8ViewOutput::Utf8 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why we need to add code to handle the "Utf8View input, Utf8 output" case -- seems like we don't actually want that behavior. If we fixed up upper and lower as part of the same PR, couldn't we just have case_conversion return a value of the same type as its input? That would avoid this redundant code and also get rid of the utf8view_output parameter.

@kumarUjjawal
Copy link
Contributor Author

Thanks for looking at this!

The PR description has a typo (still talks about repeat).

Personally I find it a bit confusing to have the same / overlapping changes in concurrent PRs (e.g., for upper and lower).

Thanks for the feedbak. I agree I should have done both lower and upper in the same pr, that makes much more sense. I'm working on it, will update the pr and close the other one. Initially I had thougth about it I don't remember why I didn't do it then.

@kumarUjjawal kumarUjjawal changed the title Make lower emit Utf8View for Utf8View input Make lower and upper emit Utf8View for Utf8View input Mar 2, 2026
}

#[test]
fn lower_return_type_dictionary_utf8view() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For symmetry, should we add an analogous test case for upper?

}
}

pub(crate) fn case_conversion_return_type(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem specific to case conversion; we probably need a more generic facility for finding the return return type for a function that takes a string. But we can handle that later.

name: &str,
) -> Result<DataType> {
match arg_type {
DataType::Utf8View | DataType::BinaryView => Ok(DataType::Utf8View),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe BinaryView will be rejected by the type signatures for these functions, so probably clearer if we don't handle it.

LargeUtf8

query T
SELECT arrow_typeof(upper(arrow_cast('foo', 'Utf8View')))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add SLT tests for dictionary with a Utf8View value type?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants