-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add Field to Expr::Cast -- allow logical expressions to express a cast to an extension type
#18136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
datafusion/expr/src/expr_schema.rs
Outdated
| f.as_ref() | ||
| .clone() | ||
| .with_data_type(data_type.data_type().clone()) | ||
| .with_metadata(f.metadata().clone()) | ||
| // TODO: should nullability be overridden here or derived from the | ||
| // input expression? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this type of cast should be able to express nullability or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this PR does not consider the nullability of the cast field, because the destination physical expression won't consider it either (and thus the return fields would be out of sync).
| data_type.clone(), | ||
| // TODO: this drops extension metadata associated with the cast | ||
| data_type.data_type().clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually need physical expressions to be able to cast things...my vauge plan is to use a logical plan transformation or perhaps optimizer rule to replace casts to extension types with a ScalarUDF call. This should possibly error if there is mismatched metadata between the input and destination (i.e., a physical cast would only ever represent a storage cast, which is usually OK).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this to error at the logical expr -> physical expr stage if there is metadata on the cast field. I think this is better than dropping it (and won't break any existing code because before this PR such a cast could not exist).
099dc05 to
00237be
Compare
|
@alamb If updating DataTypes to FieldRefs is still what we're doing in the logical plan, this PR is ready for review! |
|
This makes sense in general! I only got partially into the review before realizing it conflicts with #19097. Maybe that one is the issue to focus on? |
|
Thank you for reviewing! I didn't see that PR but it looks like it focuses on the physical cast. This PR doesn't really have anything to do with the physical cast (in order to actually execute a cast to an extension type we need a registry of how exactly that should be done, which we don't have yet), although I'm sure there's some merge conflict between them. |
|
Oh right two different layers! But they're both essentially adding I will try to take another look here tomorrow. |
adriangb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me.
Is there an example of actually customizing casting, or an issue tracking getting us there? E.g. in postgres you can do select gen_random_uuid() = 'b2bb3fea-2598-449e-ac64-8ad1754df02d';? As far as I can tell that needs more work but this is a step towards that.
| )?); | ||
| let data_type = cast.arrow_type.as_ref().required("arrow_type")?; | ||
| Ok(Expr::Cast(Cast::new(expr, data_type))) | ||
| let field = Field::new("", data_type, cast.nullable.unwrap_or(true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like there should be an UnnamedField or DataTypeWithNullabilityAndMetadata. I saw similar patterns in the literal expressions.
| expressions::cast( | ||
| create_physical_expr(expr, input_dfschema, execution_props)?, | ||
| input_schema, | ||
| field.data_type().clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does end up tying into #19097: I think they'd work well together, we'd just want to pass the field directly here.
| pub data_type: DataType, | ||
| pub field: FieldRef, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep data_type as a deprecated field that we populate from field.data_type() for a couple of releases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also implement this by adding a metadata: FieldMetadata field rather than switching the data_type to a FieldRef. We mostly just switched DataType to FieldRef in other places so that's what I did here.
datafusion/expr/src/expr_schema.rs
Outdated
| // This currently propagates the nullability of the input | ||
| // expression as the resulting physical expression does | ||
| // not currently consider the nullability specified here | ||
| f.as_ref() | ||
| .clone() | ||
| .with_data_type(field.data_type().clone()) | ||
| .with_metadata(f.metadata().clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't fully understand this comment / what's going on here. I'm sure it's my fault but could you try rewording the comment, maybe giving f a more meaningful name and replacing _ with _unused_variable_that_has_a_useful_name?
There's an issue to create a registry for |
Field to Expr::cast -- allow logical expressions to express a cast to an extension type
Field to Expr::cast -- allow logical expressions to express a cast to an extension typeField to Expr::Cast -- allow logical expressions to express a cast to an extension type
|
Is this waiting for anything else before we merge it? We made a |
|
I'm personally happy with this...it seems like there is no opposition to the main change ( |
Which issue does this PR close?
I am sorry that I missed the previous PR implementing this ( #18120 ) and I'm also happy to review that one instead of updating this!
Rationale for this change
Other systems that interact with the logical plan (e.g., SQL, Substrait) can express types that are not strictly within the arrow DataType enum.
What changes are included in this PR?
For the Cast and TryCast structs, the destination data type was changed from a DataType to a FieldRef.
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes, any code using
Cast { .. }to create an expression would need to useCast::new()instead (or pass on field metadata if it has it). Existing matches will need to be upated for thedata_type->fieldmember rename.