Skip to content

Commit d96c3e3

Browse files
jiangxb1987HyukjinKwon
authored andcommitted
[SPARK-21811][SQL] Fix the inconsistency behavior when finding the widest common type
## What changes were proposed in this pull request? Currently we find the wider common type by comparing the two types from left to right, this can be a problem when you have two data types which don't have a common type but each can be promoted to StringType. For instance, if you have a table with the schema: [c1: date, c2: string, c3: int] The following succeeds: SELECT coalesce(c1, c2, c3) FROM table While the following produces an exception: SELECT coalesce(c1, c3, c2) FROM table This is only a issue when the seq of dataTypes contains `StringType` and all the types can do string promotion. close apache#19033 ## How was this patch tested? Add test in `TypeCoercionSuite` Author: Xingbo Jiang <[email protected]> Closes apache#21074 from jiangxb1987/typeCoercion.
1 parent 9e10f69 commit d96c3e3

File tree

3 files changed

+34
-5
lines changed

3 files changed

+34
-5
lines changed

docs/sql-programming-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1810,7 +1810,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
18101810
- Since Spark 2.4, writing a dataframe with an empty or nested empty schema using any file formats (parquet, orc, json, text, csv etc.) is not allowed. An exception is thrown when attempting to write dataframes with empty schema.
18111811
- Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0.
18121812
- Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0.
1813-
1813+
- Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
18141814
## Upgrading From Spark SQL 2.2 to 2.3
18151815

18161816
- Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -175,11 +175,27 @@ object TypeCoercion {
175175
})
176176
}
177177

178+
/**
179+
* Whether the data type contains StringType.
180+
*/
181+
def hasStringType(dt: DataType): Boolean = dt match {
182+
case StringType => true
183+
case ArrayType(et, _) => hasStringType(et)
184+
// Add StructType if we support string promotion for struct fields in the future.
185+
case _ => false
186+
}
187+
178188
private def findWiderCommonType(types: Seq[DataType]): Option[DataType] = {
179-
types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match {
180-
case Some(d) => findWiderTypeForTwo(d, c)
181-
case None => None
182-
})
189+
// findWiderTypeForTwo doesn't satisfy the associative law, i.e. (a op b) op c may not equal
190+
// to a op (b op c). This is only a problem for StringType or nested StringType in ArrayType.
191+
// Excluding these types, findWiderTypeForTwo satisfies the associative law. For instance,
192+
// (TimestampType, IntegerType, StringType) should have StringType as the wider common type.
193+
val (stringTypes, nonStringTypes) = types.partition(hasStringType(_))
194+
(stringTypes.distinct ++ nonStringTypes).foldLeft[Option[DataType]](Some(NullType))((r, c) =>
195+
r match {
196+
case Some(d) => findWiderTypeForTwo(d, c)
197+
case _ => None
198+
})
183199
}
184200

185201
/**

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -539,6 +539,9 @@ class TypeCoercionSuite extends AnalysisTest {
539539
val floatLit = Literal.create(1.0f, FloatType)
540540
val timestampLit = Literal.create("2017-04-12", TimestampType)
541541
val decimalLit = Literal(new java.math.BigDecimal("1000000000000000000000"))
542+
val tsArrayLit = Literal(Array(new Timestamp(System.currentTimeMillis())))
543+
val strArrayLit = Literal(Array("c"))
544+
val intArrayLit = Literal(Array(1))
542545

543546
ruleTest(rule,
544547
Coalesce(Seq(doubleLit, intLit, floatLit)),
@@ -572,6 +575,16 @@ class TypeCoercionSuite extends AnalysisTest {
572575
Coalesce(Seq(nullLit, floatNullLit, doubleLit, stringLit)),
573576
Coalesce(Seq(Cast(nullLit, StringType), Cast(floatNullLit, StringType),
574577
Cast(doubleLit, StringType), Cast(stringLit, StringType))))
578+
579+
ruleTest(rule,
580+
Coalesce(Seq(timestampLit, intLit, stringLit)),
581+
Coalesce(Seq(Cast(timestampLit, StringType), Cast(intLit, StringType),
582+
Cast(stringLit, StringType))))
583+
584+
ruleTest(rule,
585+
Coalesce(Seq(tsArrayLit, intArrayLit, strArrayLit)),
586+
Coalesce(Seq(Cast(tsArrayLit, ArrayType(StringType)),
587+
Cast(intArrayLit, ArrayType(StringType)), Cast(strArrayLit, ArrayType(StringType)))))
575588
}
576589

577590
test("CreateArray casts") {

0 commit comments

Comments
 (0)