-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: Replace DataType with FieldRef in Cast
#18120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1598f94 to
2a06d44
Compare
DataType with FieldRef in Cast
40f0012 to
bd29587
Compare
|
Apologies...I missed this PR! Feel free to lift anything useful from #18136 (I'm just as happy to review this instead). |
6563998 to
7b32f01
Compare
|
(I did clean up #18136 because I had some time on Friday but also feel free to keep going here and I'm happy to review!) |
Hi @paleolimbot , this pr is ready for review |
DataType with FieldRef in CastDataType with FieldRef in Cast
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I have a few high-level concerns but am happy to iterate. Another option is to pursue #18136 and implement the nullability handling here (which I didn't implement) as a follow-on!
- The
DataTypeExt::into_nullable_field_ref()I think will help eliminate the need for theArc::new(Field::new("", ...))changes. - I think that the
cast()andCast::new()signatures should probably remain unchanged to minimize disruption in this PR and downstream - Considering the nullability of the cast field in the logical expression is problematic (highighted by the CI failures here) because all existing casts simply propagate the input expression nullability. I think this is what most people mean by a cast and so we should keep that behaviour. If we were to express a cast with explicit nullability we would have to do something like
Option<bool>(where None means propagate the input, true means explicitly cast to nullable, and false means explicitly cast to non-nullable). All of those would require integration with the physical expression implementing the cast to ensure output schema consistency (which is the reason for the CI failure I believe)
| Expr::Cast(Cast { expr, data_type }) => { | ||
| write!(f, "CAST({expr} AS {data_type})") | ||
| Expr::Cast(Cast { expr, field }) => { | ||
| write!(f, "CAST({expr} AS {})", field.data_type()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's now a format_type_and_metadata() helper in datafusion_common that will ensure that casts with metadata are displayed. That display is a bit verbose at the moment when metadata exists but it makes it quite a bit easier to track down bugs where the metadata was dropped.
| let expr = Expr::Cast(Cast { | ||
| expr: Box::new(Expr::Literal(ScalarValue::Float32(Some(1.23)), None)), | ||
| data_type: DataType::Utf8, | ||
| field: Arc::new(Field::new("cast", DataType::Utf8, false)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| field: Arc::new(Field::new("cast", DataType::Utf8, false)), | |
| field: DataType::Utf8.into_nullable_field_ref(), |
@alamb kindly added a helper for this (there are many places in this PR that would benefit!)
An aside is that I think we should ensure that these fields are always unnamed for consistency (and that the names of such a field are never used).
| /// Types that can be used to describe the result of a cast expression. | ||
| pub trait IntoCastField { | ||
| fn into_cast_field(self, expr: &Expr) -> FieldRef; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are somewhat similar to the FieldExt traits that Andrew added to datafusion_common/src/datatypes.rs. I think DataFusion would benefit if it had a way to say "get me a FieldRef that is actually a data type", but I am also not sure this PR is the right place to do that.
| pub fn cast(expr: Expr, data_type: DataType) -> Expr { | ||
| Expr::Cast(Cast::new(Box::new(expr), data_type)) | ||
| pub fn cast<F>(expr: Expr, field: F) -> Expr | ||
| where | ||
| F: IntoCastField, | ||
| { | ||
| let field = field.into_cast_field(&expr); | ||
| Expr::Cast(Cast::new(Box::new(expr), field)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that a number of references to cast() were updated as cast(Arc::new(Field::new(...)))...I wonder if those can be reverted (in theory IntoCastField should handle the conversion?)
| let (_, input_field) = expr.to_field(schema)?; | ||
| let mut combined_metadata = FieldMetadata::from(input_field.metadata()); | ||
| combined_metadata.extend(FieldMetadata::from(field.metadata())); | ||
| let field = combined_metadata.add_to_field(field.as_ref().clone()); | ||
| Ok(Arc::new(field)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to combine metadata here...this would mean that a Cast for an extension type would basically be a "reinterpret cast" not a "logical cast", which is what I believe the Cast is trying to represent in a logical plan. The Alias can currently add metadata in this way which I think is a bit fishy but is at least not attempting to represent a logical operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think many of these changes should be reverted (I don't think this file is generated?)
| datafusion_common.ArrowType arrow_type = 2; | ||
| datafusion_common.Field field = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also possible to add metadata + nullability here to keep the protobuf backward compatible (not a requirement but is relatively easy to do)
| pub fn new(expr: Box<Expr>, data_type: DataType) -> Self { | ||
| Self { expr, data_type } | ||
| pub fn new(expr: Box<Expr>, field: FieldRef) -> Self { | ||
| Self { expr, field } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change (and the corresponding change for TryCast::new() seem to have had caused significant disruption...it is probably worth keeping the original signature and adding a new function specifically for casts with metadata and/or nullability to minimize the disruption in DataFusion and downstream.
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?