-
Notifications
You must be signed in to change notification settings - Fork 190
feat: add simple extension for JSON #887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
benbellick
commented
Nov 11, 2025
- introduce a simple extension for representing JSON (introduce a standard JSON type #879)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3cb0673 to
b6e2d25
Compare
b6e2d25 to
6c6a85e
Compare
| - args: | ||
| - name: json_value | ||
| value: u!json | ||
| return: string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this return null somehow? Also do we want to have some sort of formatting options like indentation and etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are good points, though I am unsure of the answer. On the one hand, I can see a case for assuming that values saved as JSON are already valid, but on the other hand, this may not actually be the case.
As for formatting options, that is definitely a good idea. I'm not sure what the right approach is to cover many general use cases without more research.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking whether you have thought about it. :smile Finding an intersection across popular implementations would be an interesting survey but I think we may end up with some sort of property bag that we define some standard options and the undefined are up to the implementations... Something along that line. 100% agree that no need to add at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it might make sense to not include this in a first pass. On one hand to_string is probably a very common operation, but also I don't know if there is a standard way to stringify JSON that is consistent across engines. Not to say it's not worth trying to define this, but if the intent is start small and expand, this doesn't feel like a small function to start with.
| types: | ||
| - name: json | ||
| structure: | ||
| content: string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you envision extending this into something like variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a ton of experience with variants, only heard about them for the first time during the last sync. Would it be a better approach to wait to handle variants first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that the actual systems will map it their own internal representations -- simply use this as a tag. So content is somewhat meaningless.
BTW, what's the difference between to_string(json_value) and json_value.content? Are they the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I understand it, structure is just a means by which to communicate literals in a plan (besides a custom any representation). So I think that json_value.content isn't a meaningful operation from a substrait perspective, even though it is would be identical to to_string(json_value) in the case of literals present in plans. But if we have a table of type NSTRUCT<index::i32, data::JSON> called table, then we can do:
SELECT to_string(data) FROM tableBut in this literal-less context, json_value.content doesn't mean anything. Let me know if I misunderstood anything :)
EpsilonPrime
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the purpose of this PR is to create a JSON type and some functions to manipulate it.
But it does not try to tackle ingestion as done here: https://github.com/apache/incubator-gluten/blob/main/gluten-substrait/src/main/resources/substrait/proto/substrait/algebra.proto#L165
| options: | ||
| on_invalid_path: | ||
| description: Controls behavior when the JSONPath expression is syntactically invalid. | ||
| values: [ ERROR, "NULL", UNDEFINED ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Velox once encountered three different JSON libraries. One actually succeeded despite not checking validity because it only read the part of the JSON that mattered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was that the validity of the JSON, or the validity of the JSON path? Though if I think on it, I can imagine both "working". Lazy evaluation of a valid path on malformed JSON that never hits the bad json, and lazy evaluation of an invalid path on JSON that short-circuits the bad parts of the path.
vbarua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm broadly in favour of introducing this. It doesn't solve the discussion we've had historically about whether we should standardize on json, variant or some combination thereof. That discussion though has usually been in the context of adding JSON as a core type in the protos.
I think that adding JSON support in a YAML extension like this is much lower stakes way to experiment with it. In practical reality, a bunch of systems have a JSON type, it's unlikely to go away anytime soon, and people seem to love putting JSON in their data systems so being able to encode and capture these capabilities should help us bring in more users.
I do suspect that if we go down this path, we will eventually run into portability/compatibility issues around "I want to communicate JSON functions to a system that only has VARIANT" or vice-versa. At that point though, we can hopefully harness the power of a bunch of developers wanting it work for their system / use case to help us figure out a grand unified theory of JSON/VARIANT interop, and it will help to already have a model for JSON operations when we start that.
| - args: | ||
| - name: json_value | ||
| value: u!json | ||
| return: string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it might make sense to not include this in a first pass. On one hand to_string is probably a very common operation, but also I don't know if there is a standard way to stringify JSON that is consistent across engines. Not to say it's not worth trying to define this, but if the intent is start small and expand, this doesn't feel like a small function to start with.
| - name: "json_extract" | ||
| description: >- | ||
| Extracts a value from JSON using a JSONPath expression. | ||
| JSONPath expressions should follow RFC 9535 (https://datatracker.ietf.org/doc/html/rfc9535). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yay, there's a standard for this, though it appears to be in progress and not final. It does feel like the most neutral path format to pick as I imagine most systems have their own flavours of it, which will likely represent a compatibility challenge much like regex has been.
| on_error: | ||
| description: Controls behavior when input is not valid JSON. | ||
| values: [ ERROR, "NULL" ] | ||
| return: u!json? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including a ? on the type is redundant when declaring a function with nullability handling of MIRROR (which is the default when it is not declared). The nullability is entirely determined by the nullability of the inputs.
| options: | ||
| on_invalid_path: | ||
| description: Controls behavior when the JSONPath expression is syntactically invalid. | ||
| values: [ ERROR, "NULL", UNDEFINED ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was that the validity of the JSON, or the validity of the JSON path? Though if I think on it, I can imagine both "working". Lazy evaluation of a valid path on malformed JSON that never hits the bad json, and lazy evaluation of an invalid path on JSON that short-circuits the bad parts of the path.
| on_path_not_found: | ||
| description: Controls behavior when the path does not exist in the JSON document. | ||
| values: [ ERROR, "NULL" ] | ||
| return: u!json? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ? is also redundant on here.