Skip to content

Conversation

@benbellick
Copy link
Member

@benbellick benbellick commented Nov 21, 2025

Support both opaque (google.protobuf.Any) and
structured (Literal.Struct) encodings for user-defined type literals per Substrait spec.

  • Split UserDefinedLiteral into UserDefinedAny and UserDefinedStruct
  • Move type parameters to interface level for parameterized types
  • Add first-class POJO representation for type parameters
  • Test coverage including roundtrip tests
  • Throw exception on unhandled struct-based representation in isthmus

Closes #611

@benbellick benbellick force-pushed the benbellick/handle-structured-udt2 branch from 133b704 to e4f36ea Compare November 21, 2025 19:08
@benbellick benbellick changed the title Benbellick/handle structured udt2 handle struct-based UDT literals in core Nov 21, 2025
@benbellick benbellick force-pushed the benbellick/handle-structured-udt2 branch from e4f36ea to cef9798 Compare November 21, 2025 19:16
@benbellick benbellick force-pushed the benbellick/handle-structured-udt2 branch from cef9798 to f3379c1 Compare November 21, 2025 19:51
@benbellick
Copy link
Member Author

FYI, this PR makes no attempt to actually validate if the struct representation provided matches the definition in the yaml file. I think that this is the right thing to do, but it turned out to be slightly more complicated than I thought, as it involves threading an ExtensionCollection through the codebase a bit. Thus, I left it for another issue (#614).

@benbellick benbellick marked this pull request as ready for review November 21, 2025 20:01
@benbellick benbellick requested a review from vbarua November 21, 2025 20:01
Support both opaque (google.protobuf.Any) and
structured (Literal.Struct) encodings for user-defined type literals per Substrait spec.

- Split UserDefinedLiteral into UserDefinedAny and UserDefinedStruct
- Move type parameters to interface level for parameterized types
- Comprehensive test coverage including roundtrip tests
- Throw exception on unhandled struct-based representation in isthmus
@benbellick benbellick force-pushed the benbellick/handle-structured-udt2 branch from 1eb4fb4 to f5b6341 Compare November 24, 2025 16:19
extensionCollector.getTypeReference(SimpleExtension.TypeAnchor.of(expr.urn(), expr.name()));
return lit(
bldr -> {
try {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception doesn't happen anymore because we don't parse the Any here. Instead, we have a reference to the pre-parsed proto directly.

public ParameterizedType userDefined(
int ref, java.util.List<io.substrait.type.Type.Parameter> typeParameters) {
throw new UnsupportedOperationException(
"User defined types are not supported in Parameterized Types for now");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with the above, where we don't yet support ParamerizedType conversion.

public DerivationExpression userDefined(
int ref, java.util.List<io.substrait.type.Type.Parameter> typeParameters) {
throw new UnsupportedOperationException(
"User defined types are not supported in Derivation Expressions for now");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also consistent with the above.

public RexNode visit(Expression.UserDefinedStructLiteral expr, Context context)
throws RuntimeException {
throw new UnsupportedOperationException(
"UserDefinedStructLiteral representation is not yet supported in Isthmus");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is #612

@benbellick benbellick force-pushed the benbellick/handle-structured-udt2 branch from 78488b4 to e8cb862 Compare November 25, 2025 19:27
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started taking a look at this and left some comments. Mostly looks reasonable. Want to come back and look at your tests with fresh eyes, and also think about parameterized types with a fresh 🧠.

@Override
public abstract List<io.substrait.type.Type.Parameter> typeParameters();

public abstract com.google.protobuf.Any value();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release Notes
Capturing the value as an Any instead of a ByteString does feel nicer ✨

* @see UserDefinedAnyLiteral
* @see UserDefinedStructLiteral
*/
interface UserDefinedLiteral extends Literal {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release Notes
We should call out that we don't construct UserDefinedLiterals anymore.

* parameters (like the {@code 10} in {@code VARCHAR<10>}). This interface provides a type-safe
* representation of all possible parameter kinds.
*/
interface Parameter {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 yes that is an interesting point. Looking into it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as I understand it, ParameterizedType.java is used for representing abstract types with parameters in yaml files. Where as the Parameter above being introduced is actually a concrete argument passed into the type.

So for example, List<any1> could be a ParameterizedType, whereas List<int32> is a type with parameters [int32].

@benbellick benbellick force-pushed the benbellick/handle-structured-udt2 branch from fea5921 to 77f0bd7 Compare November 26, 2025 18:14
@benbellick benbellick force-pushed the benbellick/handle-structured-udt2 branch from 3814053 to 4b4f5ee Compare November 26, 2025 18:26
@benbellick
Copy link
Member Author

FYI, I have a WIP PR for implementing this in Isthmus but I split it into two PRs because the code there is more complicated. This keeps the PRs a bit smaller!

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

ACTION NEEDED

Substrait follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@benbellick benbellick changed the title handle struct-based UDT literals in core feat: handle struct-based UDT literals in core Jan 5, 2026
@Override
public abstract List<io.substrait.type.Type.Parameter> typeParameters();

public abstract com.google.protobuf.Any value();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One observation/remark I would have here is that in other places we have been trying to not expose protobuf objects in the Substrait core POJO API like e.g. in AdvancedExtension we provide the empty Optimization and Enhancement interfaces and ask users of the Substrait Java SDK to implement protobuf conversion logic for those instead. At first glance it feels like we would be a little inconsistent if for user defined any literals we expose the protobuf any value directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Niels, this was a great point. I just pushed a commit addressing this by introducing the interface UserDefinedAnyValue. I'll leave the PR open for a little while to give you the opportunity to provide feedback on that component if you'd like. Otherwise, I'll merge it in later today. Thanks for the review!

Copy link
Member Author

@benbellick benbellick Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, after implementing the abstraction pattern similar to AdvancedExtension, I think we should actually defer the change you are suggesting. The abstraction adds an extra layer (UserDefinedAnyValue) without clear benefits. Users still need to work with com.google.protobuf.Any directly to construct values, and the indirection makes the API harder to understand. Since you mentioned this could be revisited later, let's keep it simple for now and use Any directly as the spec defines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on that point I also wasn't sure how much additional value the additional layer of abstraction would bring.

* Explains the Sustrait relation
*
* @param plan Subsrait relation
* @param rel Subsrait relation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note: this fix is also in #642

Copy link
Member

@nielspardon nielspardon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I'm fine with the changes. Just not sure whether we want to revisit exposing the Any proto value directly in a follow-up.

@vbarua vbarua changed the title feat: handle struct-based UDT literals in core feat(core): handle struct-based UDT literals Jan 13, 2026
@benbellick benbellick merged commit 13309df into main Jan 13, 2026
13 checks passed
@benbellick benbellick deleted the benbellick/handle-structured-udt2 branch January 13, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle struct-based UDT literals in core

4 participants