-
Notifications
You must be signed in to change notification settings - Fork 272
Open
Description
Description
Add native Comet support for Spark's from_json function (JsonToStructs expression), which parses JSON strings into structured data types (StructType, ArrayType, or MapType).
Spark Specification
Syntax
from_json(jsonStr, schema [, options])Arguments
| Argument | Type | Description |
|---|---|---|
| jsonStr | StringType | Column containing JSON strings to parse |
| schema | DataType, StructType, ArrayType, MapType, or DDL String | Target schema defining the structure of the parsed output |
| options | Map[String, String] (optional) | JSON parsing options |
Return Type
Returns a column matching the provided schema type (struct, array, or map).
Key Options
mode: PERMISSIVE (default) or FAILFASTdateFormat: Format for parsing dates (default: yyyy-MM-dd)timestampFormat: Format for parsing timestampscolumnNameOfCorruptRecord: Field to store malformed records (PERMISSIVE mode)
Edge Cases
- Null input returns null output
- Missing fields in JSON are set to null
- Empty string input returns null
- PERMISSIVE mode: malformed JSON returns row with parseable fields populated
- FAILFAST mode: throws exception on malformed JSON
- Field names are case-sensitive
- Extra fields in JSON not in schema are ignored
Examples
-- Basic struct parsing
SELECT from_json('{"name":"Alice","age":30}', 'name STRING, age INT');
-- Result: {Alice, 30}
-- Parsing to MapType
SELECT from_json('{"key1":"value1"}', 'MAP<STRING,STRING>');
-- With options
SELECT from_json('{"a":1}', 'a INT', map('mode', 'FAILFAST'));Implementation Approach
-
Scala Serde: Create
CometJsonToStructsinspark/src/main/scala/org/apache/comet/serde/- Handle schema serialization to protobuf
- Serialize options map
- Consider marking complex options as
Unsupportedinitially
-
Protobuf: May need new message type in
expr.protofor schema and options -
Rust Implementation: Implement JSON parsing in
native/spark-expr/src/- Use
serde_jsonfor parsing - Match Spark's null handling behavior
- Support PERMISSIVE and FAILFAST modes
- Use
-
Potential Simplifications for Initial Implementation:
- Start with StructType schemas only (defer ArrayType, MapType)
- Support common options only (mode, dateFormat, timestampFormat)
- UTC timezone only initially
Difficulty
Large - requires protobuf changes, Rust JSON parsing implementation, and careful compatibility testing for edge cases.
References
- Spark documentation: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.from_json.html
- Contributor guide: https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html
Note: This issue was generated with AI assistance.
Metadata
Metadata
Assignees
Labels
No labels