Skip to content

Conversation

@lemorage
Copy link

Which issue does this PR close?

Rationale for this change

The split_part function rejected Binary/BinaryView types, causing queries with bytea columns to fail with: "Expected String but received Binary, DataType: Binary".

What changes are included in this PR?

Changed split_part signature from Coercion::new_exact() to Coercion::new_implicit() for both string parameters.

Are these changes tested?

Yes:

  • Added tests for Binary coercion
  • Added test reproducing the original ClickBench q6 scenario

Are there any user-facing changes?

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jan 15, 2026
@lemorage
Copy link
Author

The fix is prepared, I am not sure if this would be a reasonable change tbh tho. Maybe just setting config is fine? set datafusion.execution.parquet.binary_as_string = true. Need to be discussed before merging.

@Jefffrey
Copy link
Contributor

Personally I'm not sure about this approach as it might suggest all other string functions with similar signature should now be fixed to also ensure they coerce from binary 🤔

e.g.

impl AsciiFunc {
pub fn new() -> Self {
Self {
signature: Signature::coercible(
vec![Coercion::new_exact(TypeSignatureClass::Native(
logical_string(),
))],
Volatility::Immutable,
),
}
}
}

impl BTrimFunc {
pub fn new() -> Self {
Self {
signature: Signature::one_of(
vec![
TypeSignature::Coercible(vec![
Coercion::new_exact(TypeSignatureClass::Native(logical_string())),
Coercion::new_exact(TypeSignatureClass::Native(logical_string())),
]),
TypeSignature::Coercible(vec![Coercion::new_exact(
TypeSignatureClass::Native(logical_string()),
)]),
],
Volatility::Immutable,
),
aliases: vec![String::from("trim")],
}
}
}

impl ContainsFunc {
pub fn new() -> Self {
Self {
signature: Signature::coercible(
vec![
Coercion::new_exact(TypeSignatureClass::Native(logical_string())),
Coercion::new_exact(TypeSignatureClass::Native(logical_string())),
],
Volatility::Immutable,
),
}
}
}

@adriangb
Copy link
Contributor

I guess ClickHouse does coerce them? So is this just a case of our semantics != ClickHouse?

@Jefffrey
Copy link
Contributor

I guess ClickHouse does coerce them? So is this just a case of our semantics != ClickHouse?

It's a similar problem to how Spark will coerce strings to ints for their math functions but we don't.

@alamb
Copy link
Contributor

alamb commented Jan 15, 2026

I guess ClickHouse does coerce them? So is this just a case of our semantics != ClickHouse?

The problem is only with the ClickHouse hits_partitioned dataset (which has 100 parquet files) which was written by an ancient version of pyarrow which did not correctly annotate the string columns as STRING. This is the only usecase for this config:

https://datafusion.apache.org/user-guide/configs.html

datafusion.execution.parquet.binary_as_string false (reading) If true, parquet reader will read columns of Binary/LargeBinary with Utf8, and BinaryView with Utf8View. Parquet files generated by some legacy writers do not correctly set the UTF8 flag for strings, causing string columns to be loaded as BLOB instead.

So TLDR I don't think we need to make split_part work with binary data, as proposed in this PR.

Instead I propose:

  1. Change the internal error to a proper error when running split_part on binary data
  2. Update the benchmarks to properly set the binary as string. I made a PR here: Use correct setting for click bench queries in sql_planner benchmark #19835

@alamb
Copy link
Contributor

alamb commented Jan 15, 2026

Thank you for working on this @lemorage

Did you find out what was causing the internal error? I would expect the query would error with a normal error (about not supporting binary columns)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

type coercion error in clickbench50 query

5 participants