Skip to content

Conversation

@benbellick
Copy link
Member

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few more functions I could imagine including, e.g. json_each, json_keys, etc. but I wanted to start with a small set of primitives.

@benbellick benbellick marked this pull request as ready for review November 11, 2025 16:31
@benbellick benbellick marked this pull request as draft November 11, 2025 16:33
@benbellick benbellick marked this pull request as ready for review November 11, 2025 16:41
- args:
- name: json_value
value: u!json
return: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this return null somehow? Also do we want to have some sort of formatting options like indentation and etc.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good points, though I am unsure of the answer. On the one hand, I can see a case for assuming that values saved as JSON are already valid, but on the other hand, this may not actually be the case.

As for formatting options, that is definitely a good idea. I'm not sure what the right approach is to cover many general use cases without more research.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking whether you have thought about it. :smile Finding an intersection across popular implementations would be an interesting survey but I think we may end up with some sort of property bag that we define some standard options and the undefined are up to the implementations... Something along that line. 100% agree that no need to add at this point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it might make sense to not include this in a first pass. On one hand to_string is probably a very common operation, but also I don't know if there is a standard way to stringify JSON that is consistent across engines. Not to say it's not worth trying to define this, but if the intent is start small and expand, this doesn't feel like a small function to start with.

types:
- name: json
structure:
content: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you envision extending this into something like variant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a ton of experience with variants, only heard about them for the first time during the last sync. Would it be a better approach to wait to handle variants first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that the actual systems will map it their own internal representations -- simply use this as a tag. So content is somewhat meaningless.

BTW, what's the difference between to_string(json_value) and json_value.content? Are they the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I understand it, structure is just a means by which to communicate literals in a plan (besides a custom any representation). So I think that json_value.content isn't a meaningful operation from a substrait perspective, even though it is would be identical to to_string(json_value) in the case of literals present in plans. But if we have a table of type NSTRUCT<index::i32, data::JSON> called table, then we can do:

SELECT to_string(data) FROM table

But in this literal-less context, json_value.content doesn't mean anything. Let me know if I misunderstood anything :)

Copy link
Member

@EpsilonPrime EpsilonPrime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the purpose of this PR is to create a JSON type and some functions to manipulate it.

But it does not try to tackle ingestion as done here: https://github.com/apache/incubator-gluten/blob/main/gluten-substrait/src/main/resources/substrait/proto/substrait/algebra.proto#L165

options:
on_invalid_path:
description: Controls behavior when the JSONPath expression is syntactically invalid.
values: [ ERROR, "NULL", UNDEFINED ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Velox once encountered three different JSON libraries. One actually succeeded despite not checking validity because it only read the part of the JSON that mattered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was that the validity of the JSON, or the validity of the JSON path? Though if I think on it, I can imagine both "working". Lazy evaluation of a valid path on malformed JSON that never hits the bad json, and lazy evaluation of an invalid path on JSON that short-circuits the bad parts of the path.

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm broadly in favour of introducing this. It doesn't solve the discussion we've had historically about whether we should standardize on json, variant or some combination thereof. That discussion though has usually been in the context of adding JSON as a core type in the protos.

I think that adding JSON support in a YAML extension like this is much lower stakes way to experiment with it. In practical reality, a bunch of systems have a JSON type, it's unlikely to go away anytime soon, and people seem to love putting JSON in their data systems so being able to encode and capture these capabilities should help us bring in more users.

I do suspect that if we go down this path, we will eventually run into portability/compatibility issues around "I want to communicate JSON functions to a system that only has VARIANT" or vice-versa. At that point though, we can hopefully harness the power of a bunch of developers wanting it work for their system / use case to help us figure out a grand unified theory of JSON/VARIANT interop, and it will help to already have a model for JSON operations when we start that.

- args:
- name: json_value
value: u!json
return: string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it might make sense to not include this in a first pass. On one hand to_string is probably a very common operation, but also I don't know if there is a standard way to stringify JSON that is consistent across engines. Not to say it's not worth trying to define this, but if the intent is start small and expand, this doesn't feel like a small function to start with.

- name: "json_extract"
description: >-
Extracts a value from JSON using a JSONPath expression.
JSONPath expressions should follow RFC 9535 (https://datatracker.ietf.org/doc/html/rfc9535).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yay, there's a standard for this, though it appears to be in progress and not final. It does feel like the most neutral path format to pick as I imagine most systems have their own flavours of it, which will likely represent a compatibility challenge much like regex has been.

on_error:
description: Controls behavior when input is not valid JSON.
values: [ ERROR, "NULL" ]
return: u!json?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including a ? on the type is redundant when declaring a function with nullability handling of MIRROR (which is the default when it is not declared). The nullability is entirely determined by the nullability of the inputs.

options:
on_invalid_path:
description: Controls behavior when the JSONPath expression is syntactically invalid.
values: [ ERROR, "NULL", UNDEFINED ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was that the validity of the JSON, or the validity of the JSON path? Though if I think on it, I can imagine both "working". Lazy evaluation of a valid path on malformed JSON that never hits the bad json, and lazy evaluation of an invalid path on JSON that short-circuits the bad parts of the path.

on_path_not_found:
description: Controls behavior when the path does not exist in the JSON document.
values: [ ERROR, "NULL" ]
return: u!json?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ? is also redundant on here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants