Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions extensions/json.yaml
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few more functions I could imagine including, e.g. json_each, json_keys, etc. but I wanted to start with a small set of primitives.

Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
%YAML 1.2
---
urn: extension:io.substrait:json
types:
- name: json
structure:
content: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you envision extending this into something like variant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a ton of experience with variants, only heard about them for the first time during the last sync. Would it be a better approach to wait to handle variants first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that the actual systems will map it their own internal representations -- simply use this as a tag. So content is somewhat meaningless.

BTW, what's the difference between to_string(json_value) and json_value.content? Are they the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I understand it, structure is just a means by which to communicate literals in a plan (besides a custom any representation). So I think that json_value.content isn't a meaningful operation from a substrait perspective, even though it is would be identical to to_string(json_value) in the case of literals present in plans. But if we have a table of type NSTRUCT<index::i32, data::JSON> called table, then we can do:

SELECT to_string(data) FROM table

But in this literal-less context, json_value.content doesn't mean anything. Let me know if I misunderstood anything :)

description: >-
A JSON type representing arbitrary JSON values (objects, arrays,
strings, numbers, booleans, or null).

scalar_functions:
- name: "parse_json"
description: >-
Parse a JSON string into a JSON value.
impls:
- args:
- name: json_string
value: string
options:
on_error:
description: Controls behavior when input is not valid JSON
values: [ ERROR, "NULL" ]
return: u!json?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including a ? on the type is redundant when declaring a function with nullability handling of MIRROR (which is the default when it is not declared). The nullability is entirely determined by the nullability of the inputs.


- name: "to_string"
description: "Convert a JSON value to its string representation"
impls:
- args:
- name: json_value
value: u!json
return: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this return null somehow? Also do we want to have some sort of formatting options like indentation and etc.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good points, though I am unsure of the answer. On the one hand, I can see a case for assuming that values saved as JSON are already valid, but on the other hand, this may not actually be the case.

As for formatting options, that is definitely a good idea. I'm not sure what the right approach is to cover many general use cases without more research.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking whether you have thought about it. :smile Finding an intersection across popular implementations would be an interesting survey but I think we may end up with some sort of property bag that we define some standard options and the undefined are up to the implementations... Something along that line. 100% agree that no need to add at this point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it might make sense to not include this in a first pass. On one hand to_string is probably a very common operation, but also I don't know if there is a standard way to stringify JSON that is consistent across engines. Not to say it's not worth trying to define this, but if the intent is start small and expand, this doesn't feel like a small function to start with.


- name: "json_extract"
description: >-
Extract a value from JSON using a JSONPath expression.
JSONPath expressions should follow RFC 9535 (https://datatracker.ietf.org/doc/html/rfc9535).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yay, there's a standard for this, though it appears to be in progress and not final. It does feel like the most neutral path format to pick as I imagine most systems have their own flavours of it, which will likely represent a compatibility challenge much like regex has been.

impls:
- args:
- name: json_value
value: u!json
- name: path
value: string
options:
on_invalid_path:
description: Controls behavior when the JSONPath expression is syntactically invalid
values: [ ERROR, "NULL", UNDEFINED ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Velox once encountered three different JSON libraries. One actually succeeded despite not checking validity because it only read the part of the JSON that mattered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was that the validity of the JSON, or the validity of the JSON path? Though if I think on it, I can imagine both "working". Lazy evaluation of a valid path on malformed JSON that never hits the bad json, and lazy evaluation of an invalid path on JSON that short-circuits the bad parts of the path.

on_path_not_found:
description: Controls behavior when the path does not exist in the JSON document
values: [ ERROR, "NULL" ]
return: u!json?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ? is also redundant on here.


- name: "is_json_valid"
description: >-
Returns true if the input string is valid JSON, false otherwise.
This function does not parse the JSON, only validates syntax.
impls:
- args:
- name: json_string
value: string
return: boolean
Loading