Skip to content

Commit db13a38

Browse files
heyihongcloud-fan
authored andcommitted
[SPARK-53578][CONNECT] Simplify data type handling in LiteralValueProtoConverter
### What changes were proposed in this pull request? This PR simplifies data type handling in the Spark Connect `LiteralValueProtoConverter` by consolidating type information into a single `data_type` field at the root level of the `Literal` message, rather than having separate type fields within nested structures. **Key changes:** 1. **Protobuf Schema Simplification:** - Added a new `data_type` field (field 100) to the root `Expression.Literal` message - Removed redundant type fields from nested messages (`Array.data_type`, `Map.data_type`, `Struct.data_type_struct`) 2. **Array Type Handling Enhancement:** - Added special handling for `ByteType` arrays to convert them to `Binary` type in the protobuf representation - This addresses a specific edge case where byte arrays should be represented as binary data ### Why are the changes needed? The current data type handling in Spark Connect has several issues: 1. **Redundancy and Complexity:** Type information is scattered across multiple fields in nested messages, making the protobuf schema unnecessarily complex and error-prone. 2. **Limited Extensibility:** Without this data_type field, it is difficult to add type information for literal types. For example, it's challenging to include detailed type metadata for types like `String` (with collation information), `YearMonthInterval`, `DayTimeInterval`, and other types that may require additional type-specific attributes. ### Does this PR introduce _any_ user-facing change? **No.** This is an internal refactoring of the Spark Connect protobuf schema and converter logic. ### How was this patch tested? `build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"` `SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite"` `build/sbt "connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite"` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.5.11 Closes #52342 from heyihong/SPARK-53578. Authored-by: Yihong He <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
1 parent a8bb8b0 commit db13a38

File tree

16 files changed

+972
-1072
lines changed

16 files changed

+972
-1072
lines changed

python/pyspark/sql/connect/proto/expressions_pb2.py

Lines changed: 65 additions & 65 deletions
Large diffs are not rendered by default.

python/pyspark/sql/connect/proto/expressions_pb2.pyi

Lines changed: 24 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -496,7 +496,6 @@ class Expression(google.protobuf.message.Message):
496496

497497
ELEMENT_TYPE_FIELD_NUMBER: builtins.int
498498
ELEMENTS_FIELD_NUMBER: builtins.int
499-
DATA_TYPE_FIELD_NUMBER: builtins.int
500499
@property
501500
def element_type(self) -> pyspark.sql.connect.proto.types_pb2.DataType:
502501
"""(Deprecated) The element type of the array.
@@ -509,42 +508,20 @@ class Expression(google.protobuf.message.Message):
509508
) -> google.protobuf.internal.containers.RepeatedCompositeFieldContainer[
510509
global___Expression.Literal
511510
]:
512-
"""The literal values that make up the array elements.
513-
514-
For inferring the data_type.element_type, only the first element needs to
515-
contain the type information.
516-
"""
517-
@property
518-
def data_type(self) -> pyspark.sql.connect.proto.types_pb2.DataType.Array:
519-
"""The type of the array. You don't need to set this field if the type information is not needed.
520-
521-
If the element type can be inferred from the first element of the elements field,
522-
then you don't need to set data_type.element_type to save space.
523-
524-
On the other hand, redundant type information is also acceptable.
525-
"""
511+
"""The literal values that make up the array elements."""
526512
def __init__(
527513
self,
528514
*,
529515
element_type: pyspark.sql.connect.proto.types_pb2.DataType | None = ...,
530516
elements: collections.abc.Iterable[global___Expression.Literal] | None = ...,
531-
data_type: pyspark.sql.connect.proto.types_pb2.DataType.Array | None = ...,
532517
) -> None: ...
533518
def HasField(
534-
self,
535-
field_name: typing_extensions.Literal[
536-
"data_type", b"data_type", "element_type", b"element_type"
537-
],
519+
self, field_name: typing_extensions.Literal["element_type", b"element_type"]
538520
) -> builtins.bool: ...
539521
def ClearField(
540522
self,
541523
field_name: typing_extensions.Literal[
542-
"data_type",
543-
b"data_type",
544-
"element_type",
545-
b"element_type",
546-
"elements",
547-
b"elements",
524+
"element_type", b"element_type", "elements", b"elements"
548525
],
549526
) -> None: ...
550527

@@ -555,7 +532,6 @@ class Expression(google.protobuf.message.Message):
555532
VALUE_TYPE_FIELD_NUMBER: builtins.int
556533
KEYS_FIELD_NUMBER: builtins.int
557534
VALUES_FIELD_NUMBER: builtins.int
558-
DATA_TYPE_FIELD_NUMBER: builtins.int
559535
@property
560536
def key_type(self) -> pyspark.sql.connect.proto.types_pb2.DataType:
561537
"""(Deprecated) The key type of the map.
@@ -575,51 +551,31 @@ class Expression(google.protobuf.message.Message):
575551
) -> google.protobuf.internal.containers.RepeatedCompositeFieldContainer[
576552
global___Expression.Literal
577553
]:
578-
"""The literal keys that make up the map.
579-
580-
For inferring the data_type.key_type, only the first key needs to
581-
contain the type information.
582-
"""
554+
"""The literal keys that make up the map."""
583555
@property
584556
def values(
585557
self,
586558
) -> google.protobuf.internal.containers.RepeatedCompositeFieldContainer[
587559
global___Expression.Literal
588560
]:
589-
"""The literal values that make up the map.
590-
591-
For inferring the data_type.value_type, only the first value needs to
592-
contain the type information.
593-
"""
594-
@property
595-
def data_type(self) -> pyspark.sql.connect.proto.types_pb2.DataType.Map:
596-
"""The type of the map. You don't need to set this field if the type information is not needed.
597-
598-
If the key/value types can be inferred from the first element of the keys/values fields,
599-
then you don't need to set data_type.key_type/data_type.value_type to save space.
600-
601-
On the other hand, redundant type information is also acceptable.
602-
"""
561+
"""The literal values that make up the map."""
603562
def __init__(
604563
self,
605564
*,
606565
key_type: pyspark.sql.connect.proto.types_pb2.DataType | None = ...,
607566
value_type: pyspark.sql.connect.proto.types_pb2.DataType | None = ...,
608567
keys: collections.abc.Iterable[global___Expression.Literal] | None = ...,
609568
values: collections.abc.Iterable[global___Expression.Literal] | None = ...,
610-
data_type: pyspark.sql.connect.proto.types_pb2.DataType.Map | None = ...,
611569
) -> None: ...
612570
def HasField(
613571
self,
614572
field_name: typing_extensions.Literal[
615-
"data_type", b"data_type", "key_type", b"key_type", "value_type", b"value_type"
573+
"key_type", b"key_type", "value_type", b"value_type"
616574
],
617575
) -> builtins.bool: ...
618576
def ClearField(
619577
self,
620578
field_name: typing_extensions.Literal[
621-
"data_type",
622-
b"data_type",
623579
"key_type",
624580
b"key_type",
625581
"keys",
@@ -636,50 +592,33 @@ class Expression(google.protobuf.message.Message):
636592

637593
STRUCT_TYPE_FIELD_NUMBER: builtins.int
638594
ELEMENTS_FIELD_NUMBER: builtins.int
639-
DATA_TYPE_STRUCT_FIELD_NUMBER: builtins.int
640595
@property
641596
def struct_type(self) -> pyspark.sql.connect.proto.types_pb2.DataType:
642597
"""(Deprecated) The type of the struct.
643598
644599
This field is deprecated since Spark 4.1+ because using DataType as the type of a struct
645-
is ambiguous. Use data_type_struct field instead.
600+
is ambiguous. Use data_type field instead.
646601
"""
647602
@property
648603
def elements(
649604
self,
650605
) -> google.protobuf.internal.containers.RepeatedCompositeFieldContainer[
651606
global___Expression.Literal
652607
]:
653-
"""(Required) The literal values that make up the struct elements."""
654-
@property
655-
def data_type_struct(self) -> pyspark.sql.connect.proto.types_pb2.DataType.Struct:
656-
"""The type of the struct. You don't need to set this field if the type information is not needed.
657-
658-
Whether data_type_struct.fields.data_type should be set depends on
659-
whether each field's type can be inferred from the elements field.
660-
"""
608+
"""The literal values that make up the struct elements."""
661609
def __init__(
662610
self,
663611
*,
664612
struct_type: pyspark.sql.connect.proto.types_pb2.DataType | None = ...,
665613
elements: collections.abc.Iterable[global___Expression.Literal] | None = ...,
666-
data_type_struct: pyspark.sql.connect.proto.types_pb2.DataType.Struct | None = ...,
667614
) -> None: ...
668615
def HasField(
669-
self,
670-
field_name: typing_extensions.Literal[
671-
"data_type_struct", b"data_type_struct", "struct_type", b"struct_type"
672-
],
616+
self, field_name: typing_extensions.Literal["struct_type", b"struct_type"]
673617
) -> builtins.bool: ...
674618
def ClearField(
675619
self,
676620
field_name: typing_extensions.Literal[
677-
"data_type_struct",
678-
b"data_type_struct",
679-
"elements",
680-
b"elements",
681-
"struct_type",
682-
b"struct_type",
621+
"elements", b"elements", "struct_type", b"struct_type"
683622
],
684623
) -> None: ...
685624

@@ -811,6 +750,7 @@ class Expression(google.protobuf.message.Message):
811750
STRUCT_FIELD_NUMBER: builtins.int
812751
SPECIALIZED_ARRAY_FIELD_NUMBER: builtins.int
813752
TIME_FIELD_NUMBER: builtins.int
753+
DATA_TYPE_FIELD_NUMBER: builtins.int
814754
@property
815755
def null(self) -> pyspark.sql.connect.proto.types_pb2.DataType: ...
816756
binary: builtins.bytes
@@ -844,6 +784,14 @@ class Expression(google.protobuf.message.Message):
844784
def specialized_array(self) -> global___Expression.Literal.SpecializedArray: ...
845785
@property
846786
def time(self) -> global___Expression.Literal.Time: ...
787+
@property
788+
def data_type(self) -> pyspark.sql.connect.proto.types_pb2.DataType:
789+
"""Data type information for the literal.
790+
This field is required only in the root literal message for null values or
791+
for data types (e.g., array, map, or struct) with non-trivial information.
792+
If the data_type field is not set at the root level, the data type will be
793+
inferred or retrieved from the deprecated data type fields using best efforts.
794+
"""
847795
def __init__(
848796
self,
849797
*,
@@ -869,6 +817,7 @@ class Expression(google.protobuf.message.Message):
869817
struct: global___Expression.Literal.Struct | None = ...,
870818
specialized_array: global___Expression.Literal.SpecializedArray | None = ...,
871819
time: global___Expression.Literal.Time | None = ...,
820+
data_type: pyspark.sql.connect.proto.types_pb2.DataType | None = ...,
872821
) -> None: ...
873822
def HasField(
874823
self,
@@ -883,6 +832,8 @@ class Expression(google.protobuf.message.Message):
883832
b"byte",
884833
"calendar_interval",
885834
b"calendar_interval",
835+
"data_type",
836+
b"data_type",
886837
"date",
887838
b"date",
888839
"day_time_interval",
@@ -934,6 +885,8 @@ class Expression(google.protobuf.message.Message):
934885
b"byte",
935886
"calendar_interval",
936887
b"calendar_interval",
888+
"data_type",
889+
b"data_type",
937890
"date",
938891
b"date",
939892
"day_time_interval",

sql/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/ColumnNodeToProtoConverterSuite.scala

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -78,23 +78,17 @@ class ColumnNodeToProtoConverterSuite extends ConnectFunSuite {
7878
testConversion(
7979
Literal((12.0, "north", 60.0, "west"), Option(dataType)),
8080
expr { b =>
81-
val builder = b.getLiteralBuilder.getStructBuilder
82-
builder
81+
b.getLiteralBuilder.getStructBuilder
8382
.addElements(proto.Expression.Literal.newBuilder().setDouble(12.0).build())
84-
builder
8583
.addElements(proto.Expression.Literal.newBuilder().setString("north").build())
86-
builder
8784
.addElements(proto.Expression.Literal.newBuilder().setDouble(60.0).build())
88-
builder
8985
.addElements(proto.Expression.Literal.newBuilder().setString("west").build())
90-
builder.setDataTypeStruct(
86+
b.getLiteralBuilder.getDataTypeBuilder.setStruct(
9187
proto.DataType.Struct
9288
.newBuilder()
93-
.addFields(
94-
proto.DataType.StructField.newBuilder().setName("_1").setNullable(true).build())
89+
.addFields(structField("_1", ProtoDataTypes.DoubleType))
9590
.addFields(structField("_2", stringTypeWithCollation))
96-
.addFields(
97-
proto.DataType.StructField.newBuilder().setName("_3").setNullable(true).build())
91+
.addFields(structField("_3", ProtoDataTypes.DoubleType))
9892
.addFields(structField("_4", stringTypeWithCollation))
9993
.build())
10094
})

sql/connect/common/src/main/protobuf/spark/connect/expressions.proto

Lines changed: 9 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -207,6 +207,13 @@ message Expression {
207207
Time time = 26;
208208
}
209209

210+
// Data type information for the literal.
211+
// This field is required only in the root literal message for null values or
212+
// for data types (e.g., array, map, or struct) with non-trivial information.
213+
// If the data_type field is not set at the root level, the data type will be
214+
// inferred or retrieved from the deprecated data type fields using best efforts.
215+
DataType data_type = 100;
216+
210217
message Decimal {
211218
// the string representation.
212219
string value = 1;
@@ -230,18 +237,7 @@ message Expression {
230237
DataType element_type = 1 [deprecated = true];
231238

232239
// The literal values that make up the array elements.
233-
//
234-
// For inferring the data_type.element_type, only the first element needs to
235-
// contain the type information.
236240
repeated Literal elements = 2;
237-
238-
// The type of the array. You don't need to set this field if the type information is not needed.
239-
//
240-
// If the element type can be inferred from the first element of the elements field,
241-
// then you don't need to set data_type.element_type to save space.
242-
//
243-
// On the other hand, redundant type information is also acceptable.
244-
DataType.Array data_type = 3;
245241
}
246242

247243
message Map {
@@ -257,41 +253,21 @@ message Expression {
257253
DataType value_type = 2 [deprecated = true];
258254

259255
// The literal keys that make up the map.
260-
//
261-
// For inferring the data_type.key_type, only the first key needs to
262-
// contain the type information.
263256
repeated Literal keys = 3;
264257

265258
// The literal values that make up the map.
266-
//
267-
// For inferring the data_type.value_type, only the first value needs to
268-
// contain the type information.
269259
repeated Literal values = 4;
270-
271-
// The type of the map. You don't need to set this field if the type information is not needed.
272-
//
273-
// If the key/value types can be inferred from the first element of the keys/values fields,
274-
// then you don't need to set data_type.key_type/data_type.value_type to save space.
275-
//
276-
// On the other hand, redundant type information is also acceptable.
277-
DataType.Map data_type = 5;
278260
}
279261

280262
message Struct {
281263
// (Deprecated) The type of the struct.
282264
//
283265
// This field is deprecated since Spark 4.1+ because using DataType as the type of a struct
284-
// is ambiguous. Use data_type_struct field instead.
266+
// is ambiguous. Use data_type field instead.
285267
DataType struct_type = 1 [deprecated = true];
286268

287-
// (Required) The literal values that make up the struct elements.
269+
// The literal values that make up the struct elements.
288270
repeated Literal elements = 2;
289-
290-
// The type of the struct. You don't need to set this field if the type information is not needed.
291-
//
292-
// Whether data_type_struct.fields.data_type should be set depends on
293-
// whether each field's type can be inferred from the elements field.
294-
DataType.Struct data_type_struct = 3;
295271
}
296272

297273
message SpecializedArray {

0 commit comments

Comments
 (0)