Skip to content

Add support for from_json (JsonToStructs) expression #3203

@andygrove

Description

@andygrove

Description

Add native Comet support for Spark's from_json function (JsonToStructs expression), which parses JSON strings into structured data types (StructType, ArrayType, or MapType).

Spark Specification

Syntax

from_json(jsonStr, schema [, options])

Arguments

Argument Type Description
jsonStr StringType Column containing JSON strings to parse
schema DataType, StructType, ArrayType, MapType, or DDL String Target schema defining the structure of the parsed output
options Map[String, String] (optional) JSON parsing options

Return Type

Returns a column matching the provided schema type (struct, array, or map).

Key Options

  • mode: PERMISSIVE (default) or FAILFAST
  • dateFormat: Format for parsing dates (default: yyyy-MM-dd)
  • timestampFormat: Format for parsing timestamps
  • columnNameOfCorruptRecord: Field to store malformed records (PERMISSIVE mode)

Edge Cases

  • Null input returns null output
  • Missing fields in JSON are set to null
  • Empty string input returns null
  • PERMISSIVE mode: malformed JSON returns row with parseable fields populated
  • FAILFAST mode: throws exception on malformed JSON
  • Field names are case-sensitive
  • Extra fields in JSON not in schema are ignored

Examples

-- Basic struct parsing
SELECT from_json('{"name":"Alice","age":30}', 'name STRING, age INT');
-- Result: {Alice, 30}

-- Parsing to MapType
SELECT from_json('{"key1":"value1"}', 'MAP<STRING,STRING>');

-- With options
SELECT from_json('{"a":1}', 'a INT', map('mode', 'FAILFAST'));

Implementation Approach

  1. Scala Serde: Create CometJsonToStructs in spark/src/main/scala/org/apache/comet/serde/

    • Handle schema serialization to protobuf
    • Serialize options map
    • Consider marking complex options as Unsupported initially
  2. Protobuf: May need new message type in expr.proto for schema and options

  3. Rust Implementation: Implement JSON parsing in native/spark-expr/src/

    • Use serde_json for parsing
    • Match Spark's null handling behavior
    • Support PERMISSIVE and FAILFAST modes
  4. Potential Simplifications for Initial Implementation:

    • Start with StructType schemas only (defer ArrayType, MapType)
    • Support common options only (mode, dateFormat, timestampFormat)
    • UTC timezone only initially

Difficulty

Large - requires protobuf changes, Rust JSON parsing implementation, and careful compatibility testing for edge cases.

References


Note: This issue was generated with AI assistance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions