Skip to content

Conversation

@wackywendell
Copy link
Contributor

@wackywendell wackywendell commented Sep 15, 2025

Summary

Begins addressing #367 and #342 by adding support for parsing the types in YAML Simple Extension Files into Rustic types - with validity enforced. This includes a string text parser handling built-in types, compound types, named structs, custom types, and validated parameter constraints in the Simple Extension YAML files.

Scope

  • Types-only: no functions or call validation yet.
  • Public API exposes parsed types (ExtensionFile, Registry, CustomType, ConcreteType) and enforces validation of those on creation / read.

Key Changes

  • Type system
    • New BuiltinType, CompoundType, ConcreteType, CustomType with Display/round‑trip support for alias and named‑struct structures.
    • Parameter constraints: data type, integer (with min/max), enum (validated/deduped), boolean, string.
    • Parsing to and from the YAML structures (TryFrom<TypeParamDefsItem>, Parse<RawType>)
  • File/Registry type: abstraction for handling YAML files
  • Context and proto glue
    • Separates out ProtoContext from Context, to distinguish between things needed for Protobuf parsing (ProtoContext)
  • Type expression parser
    • Parses simple, user‑defined (u!Name) and type variables; visits extension references for linkage bookkeeping.
  • Build/CI
    • parse feature includes serde_yaml; include!(extensions.in) is gated behind extensions feature.
    • Aligns actions/checkout to v4, updates Cargo dependency set, and bumps the substrait submodule.

Compatibility Notes

  • New trait bound ProtoContext on proto parsing that previously required only Context.
  • extensions.in now compiled only with features=["extensions"].
  • Minimal, types-only round‑trip implemented; other sections remain empty when converting back to text.

Testing

  • New unit tests cover:
    • Type parsing and round‑trip for alias and named‑struct.
    • Parameter constraint handling including enum validation and integer bounds (with current truncation behavior).
    • Registry creation and type lookup; core registry smoke test behind features=["extensions"].

@wackywendell wackywendell marked this pull request as ready for review October 20, 2025 15:32
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments but still haven't gone through most of the PR. Just wanted to flush what I have so far, but I will review more later. Thanks!

Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments, but excited to get this in! Sorry that this PR has been sitting so long 😅

Refactored ProtoContext trait (removed), by replacing with a concrete type.
Some other small refactoring simplifications and code cleanup.
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments. Had to run so I didn't get to entirely finish the review but will revisit later. Thanks

/// Sub-second precision digits
precision: i32,
},
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these wonderful comments 🙏

Add TODO referencing Substrait type proto for unsupported type
parameters not yet implemented.
Use a single type for non-recursive builtin types.
Also change get_simple_extension_urn to just return an option, and update some comments.
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I once again have to run before getting to fully finish everything. I actually feel quite confident in everything except for the types.rs file, which is the one file I just didn't get to finish going through. But everything is looking great!

Thank you for bearing with my (sometimes excessive) comments 😄


#[test]
fn test_user_defined_and_parameters() {
let cases = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome tests, these inspire a lot of confidence in the parser.

Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments, thanks!

parsed_type.visit_references(&mut link);
let concrete = ConcreteType::try_from(parsed_type)?;

// Structure representation cannot be nullable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessarily the case that calling parse on RawType is always for parsing structure? If it is, It isn't entirely obvious to me from looking at this code block alone. For example, consider the below excerpt from an extension file:

urn: extension:io.substrait:functions_boolean
scalar_functions:
  -
    name: or
    description: ...
    impls:
      - args:
          - value: boolean?
            name: a
        variadic:
          min: 0
        return: boolean?

would we then later need to call this parse implementation on "boolean?"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I misunderstood. The structured representation of a type always means that it is non nullable. Marking this one as resolved. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, actually unresolving this one. I believe this is incorrect. My read of the schema is that: a string representation can be nullable, but not the object based representation (which is the case below).

Copy link
Contributor Author

@wackywendell wackywendell Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, I looked at this more carefully, and I think the check was in the wrong place, and I moved it.

My interpretation was that if a structure is specified, it cannot ever be nullable, whether specified by string or YAML - because the type itself is the non-nullable version, and the nullability should always go outside the type. As in, if I define a type count with structure i64, then count? is the same as i64?; but if I define count: i64?, then... what is count?? Is that a nullable nullable i64?

I may be over-thinking / over-interpreting this, but see the test / its comment in 02227c4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating the schema definition here:

$defs:
  type:
    oneOf:
      - type: string # any data type
      - type: object # syntactic sugar for a non-nullable named struct

Ah, my interpretation is different. Any function can return any type, so for example we can have a function return STRUCT?<f1:i32?, f2:i64> (a nullable struct). But someone introduced this "structured" representation in yaml files merely as syntactic surgar, so that you can do:

structure: # equivalent to struct<i32?: f2:i64>
  f1: i32?
  f2: i64

But there just isn't syntax available ATM to say whether or not the above struct is nullable. So the convention is that if you optionally choose to use the structured representation, then you just can't have a nullable struct simply because it cannot be expressed. But if you wanted your function to return a nullable struct, that is perfectly valid. You should have to use the string-based representation.

That is my understanding at least 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some discussion, I think we have a good plan now:

  1. There is some ambiguity in the spec to clarify; this is now split out in docs: clarify nullable-ness of string representation of UDT structure substrait#920.
  2. In the meantime - I'll remove this check from the code, and put in a comment pointing to this issue; we should clarify the spec before adding enforcement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - see latest commit. I removed the relevant test, as it asserted that a nullable type should not be allowed, which it now is. I could have changed it to #[ignore] instead, but it didn't seem worth keeping around.

Replace Parse impls for proto versions with TryFrom, as they don't need a Context.
Applied some good suggested changes from code review.
Use IndexMap for YAML object parsing to keep simple extension struct fields in
declared order and flag duplicates, aligning with Substrait expectations.
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now just a few more small comments, but I think after this it is good to go! Thanks for your patience 🙇

parsed_type.visit_references(&mut link);
let concrete = ConcreteType::try_from(parsed_type)?;

// Structure representation cannot be nullable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, actually unresolving this one. I believe this is incorrect. My read of the schema is that: a string representation can be nullable, but not the object based representation (which is the case below).

Addressing PR feedback by tightening the API and docs.
…arse

Disallow nullable structure representations when parsing SimpleExtensionsTypesItem
so errors surface before conversion, and adjust tests to cover full item parsing.
# Files:
# src/parse/text/simple_extensions/types.rs - reject nullable structure types; add test
# Edit header/body then save/quit to commit
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!! 🎉 thanks for bearing with all of my reviews :)

//!
//! This module provides registries for Substrait extensions:
//! - **Global Registry**: Immutable, reusable across plans, URI+name based lookup
//! - **Local Registry**: Per-plan, anchor-based, references Global Registry (TODO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (and probably opinionated): I wonder if it is worth dropping the line about Local Registry? I could imagine this being a different struct/file in the future. But I'm not bothered by just leaving it in.

@benbellick benbellick merged commit f097b31 into substrait-io:main Dec 18, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants