-
Notifications
You must be signed in to change notification settings - Fork 94
feat: handle struct-based UDT literals in core #613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
133b704 to
e4f36ea
Compare
e4f36ea to
cef9798
Compare
cef9798 to
f3379c1
Compare
|
FYI, this PR makes no attempt to actually validate if the struct representation provided matches the definition in the yaml file. I think that this is the right thing to do, but it turned out to be slightly more complicated than I thought, as it involves threading an |
Support both opaque (google.protobuf.Any) and structured (Literal.Struct) encodings for user-defined type literals per Substrait spec. - Split UserDefinedLiteral into UserDefinedAny and UserDefinedStruct - Move type parameters to interface level for parameterized types - Comprehensive test coverage including roundtrip tests - Throw exception on unhandled struct-based representation in isthmus
1eb4fb4 to
f5b6341
Compare
| extensionCollector.getTypeReference(SimpleExtension.TypeAnchor.of(expr.urn(), expr.name())); | ||
| return lit( | ||
| bldr -> { | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception doesn't happen anymore because we don't parse the Any here. Instead, we have a reference to the pre-parsed proto directly.
| public ParameterizedType userDefined( | ||
| int ref, java.util.List<io.substrait.type.Type.Parameter> typeParameters) { | ||
| throw new UnsupportedOperationException( | ||
| "User defined types are not supported in Parameterized Types for now"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is consistent with the above, where we don't yet support ParamerizedType conversion.
| public DerivationExpression userDefined( | ||
| int ref, java.util.List<io.substrait.type.Type.Parameter> typeParameters) { | ||
| throw new UnsupportedOperationException( | ||
| "User defined types are not supported in Derivation Expressions for now"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also consistent with the above.
core/src/main/java/io/substrait/type/proto/TypeProtoConverter.java
Outdated
Show resolved
Hide resolved
| public RexNode visit(Expression.UserDefinedStructLiteral expr, Context context) | ||
| throws RuntimeException { | ||
| throw new UnsupportedOperationException( | ||
| "UserDefinedStructLiteral representation is not yet supported in Isthmus"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is #612
78488b4 to
e8cb862
Compare
vbarua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started taking a look at this and left some comments. Mostly looks reasonable. Want to come back and look at your tests with fresh eyes, and also think about parameterized types with a fresh 🧠.
| return Type.withNullability(nullable()).userDefined(urn(), name()); | ||
| public abstract List<io.substrait.type.Type.Parameter> typeParameters(); | ||
|
|
||
| public abstract List<Literal> fields(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to use a List<Literal> here instead of just a Expression.StructLiteral, which would map directly to the protobuf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sharing this comment because I believe it is relevant.
Basically it is the same issue, which is that the POJO class called StructLiteral is representing a special case of the Literal class.
message Literal { // Both StructLiteral and UserDefinedStructLiteral are representing this _whole_ message
oneof literal_type {
...
Struct struct = 25;
...
UserDefined user_defined = 33;
}
...
message Struct {
// A possibly heterogeneously typed list of literals
repeated Literal fields = 1;
}
message UserDefined {
oneof type_anchor_type {
// points to a type_anchor defined in this plan
uint32 type_reference = 1;
// points to a type_alias_anchor defined in this plan.
uint32 type_alias_reference = 5;
}
// The parameters to be bound to the type class, if the type class is
// parameterizable.
repeated Type.Parameter type_parameters = 3;
// a user-defined literal can be encoded in one of two ways
oneof val {
// the value of the literal, serialized using some type-specific protobuf message
google.protobuf.Any value = 2;
// the value of the literal, serialized using the structure definition in its declaration
Literal.Struct struct = 4;
}
}
}Back to your comment, switching this member variable to be Expression.StructLiteral would amount to embedding one literal proto inside of another. For example, the proto Struct doesn't actually have a nullable field. But the POJO Expression.StructLiteral does have a nullable field because it inherits from Literal.
This doesn't mean that we couldn't replace the member variable as you suggest, but if we did do that, we would be carrying around extra meaningless data, which I think is more confusing ultimately.
| * User-defined literal with value encoded as {@code Literal.Struct}. | ||
| * | ||
| * <p>This encoding uses a structured list of fields to represent the literal value. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring feels a little inconsistent. You have
literal with value encoded as {@code Literal.Struct}
but in the class the values are encoded as a List<Literal>, and the second part of the docstring is consistent with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that this could be made less confusing. The intention was to mean the proto Literal.Struct, no some java class. What do you think about just altering the message to say
User-defined literal with value encoded via the proto message {@code Literal.Struct}.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I altered it to instead link to the actual proto. Let me know what you think! I also did the same for the any case.
| @Override | ||
| public abstract List<io.substrait.type.Type.Parameter> typeParameters(); | ||
|
|
||
| public abstract com.google.protobuf.Any value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Release Notes
Capturing the value as an Any instead of a ByteString does feel nicer ✨
| * @see UserDefinedAnyLiteral | ||
| * @see UserDefinedStructLiteral | ||
| */ | ||
| interface UserDefinedLiteral extends Literal { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Release Notes
We should call out that we don't construct UserDefinedLiterals anymore.
| * parameters (like the {@code 10} in {@code VARCHAR<10>}). This interface provides a type-safe | ||
| * representation of all possible parameter kinds. | ||
| */ | ||
| interface Parameter {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the function of https://github.com/substrait-io/substrait-java/blob/main/core/src/main/java/io/substrait/function/ParameterizedType.java applicable here? It feels vaguely related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 yes that is an interesting point. Looking into it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as I understand it, ParameterizedType.java is used for representing abstract types with parameters in yaml files. Where as the Parameter above being introduced is actually a concrete argument passed into the type.
So for example, List<any1> could be a ParameterizedType, whereas List<int32> is a type with parameters [int32].
| @Value.Immutable | ||
| abstract class ParameterBooleanValue implements Parameter { | ||
| public abstract boolean value(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a usecase for having boolean literal type parameters? I can't imagine a usecase where something like MySpecialType<false> is something you would need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 just wanted to be consistent with the spec
| @Value.Immutable | ||
| abstract class ParameterEnumValue implements Parameter { | ||
| public abstract String value(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would this type be used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, these were inspired by this portion of the simple extension schema:
type_param_defs: # an array of compound type parameter definitions
type: array
items:
...
properties:
...
type: # expected metatype for the parameter
type: string
enum:
- dataType
- boolean
- integer
- enumeration
- stringSo while I don't understand the usage of it, I thought it was best to include all of them for consistency with the spec.
| @Value.Immutable | ||
| abstract class ParameterStringValue implements Parameter { | ||
| public abstract String value(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would this type be used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fea5921 to
77f0bd7
Compare
3814053 to
4b4f5ee
Compare
b2742da to
3732cf9
Compare
|
FYI, I have a WIP PR for implementing this in Isthmus but I split it into two PRs because the code there is more complicated. This keeps the PRs a bit smaller! |
|
ACTION NEEDED Substrait follows the Conventional Commits The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
core/src/main/java/io/substrait/expression/ExpressionCreator.java
Outdated
Show resolved
Hide resolved
core/src/main/java/io/substrait/expression/ExpressionCreator.java
Outdated
Show resolved
Hide resolved
| @Override | ||
| public abstract List<io.substrait.type.Type.Parameter> typeParameters(); | ||
|
|
||
| public abstract com.google.protobuf.Any value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One observation/remark I would have here is that in other places we have been trying to not expose protobuf objects in the Substrait core POJO API like e.g. in AdvancedExtension we provide the empty Optimization and Enhancement interfaces and ask users of the Substrait Java SDK to implement protobuf conversion logic for those instead. At first glance it feels like we would be a little inconsistent if for user defined any literals we expose the protobuf any value directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Niels, this was a great point. I just pushed a commit addressing this by introducing the interface UserDefinedAnyValue. I'll leave the PR open for a little while to give you the opportunity to provide feedback on that component if you'd like. Otherwise, I'll merge it in later today. Thanks for the review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, after implementing the abstraction pattern similar to AdvancedExtension, I think we should actually defer this change. The abstraction adds an extra layer (UserDefinedAnyValue) without clear benefits. Users still need to work with com.google.protobuf.Any directly to construct values, and the indirection makes the API harder to understand. Since you mentioned this could be revisited later, let's keep it simple for now and use Any directly as the spec defines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on that point I also wasn't sure how much additional value the additional layer of abstraction would bring.
| * Explains the Sustrait relation | ||
| * | ||
| * @param plan Subsrait relation | ||
| * @param rel Subsrait relation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note: this fix is also in #642
nielspardon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I'm fine with the changes. Just not sure whether we want to revisit exposing the Any proto value directly in a follow-up.
Co-authored-by: Niels Pardon <[email protected]>
This reverts commit 76cd0f0.
Support both opaque (google.protobuf.Any) and
structured (Literal.Struct) encodings for user-defined type literals per Substrait spec.
Closes #611