Skip to content

Commit 28774cd

Browse files
HyukjinKwoncloud-fan
authored andcommitted
[SPARK-28359][SQL][PYTHON][TESTS] Make integrated UDF tests robust by making UDFs (virtually) no-op
## What changes were proposed in this pull request? Current UDFs available in `IntegratedUDFTestUtils` are not exactly no-op. It converts input column to strings and outputs to strings. It causes some issues when we convert and port the tests at SPARK-27921. Integrated UDF test cases share one output file and it should outputs the same. However, 1. Special values are converted into strings differently: | Scala | Python | | ---------- | ------ | | `null` | `None` | | `Infinity` | `inf` | | `-Infinity`| `-inf` | | `NaN` | `nan` | 2. Due to float limitation at Python (see https://docs.python.org/3/tutorial/floatingpoint.html), if float is passed into Python and sent back to JVM, the values are potentially not exactly correct. See apache#25128 and apache#25110 To work around this, this PR targets to change the current UDF to be wrapped by cast. So, Input column is casted into string, UDF returns strings as are, and then output column is casted back to the input column. Roughly: **Before:** ``` JVM (col1) -> (cast to string within Python) Python (string) -> (string) JVM ``` **After:** ``` JVM (cast col1 to string) -> (string) Python (string) -> (cast back to col1's type) JVM ``` In this way, UDF is virtually no-op although there might be some subtleties due to roundtrip in string cast. I believe this is good enough. Python native functions and Scala native functions will take strings and output strings as are. So, there will be no potential test failures due to differences of conversion between Python and Scala. After this fix, for instance, `udf-aggregates_part1.sql` outputs exactly same as `aggregates_part1.sql`: <details><summary>Diff comparing to 'pgSQL/aggregates_part1.sql'</summary> <p> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part1.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out index 51ca1d5..801735781c7 100644 --- a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part1.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out -3,7 +3,7 -- !query 0 -SELECT avg(four) AS avg_1 FROM onek +SELECT avg(udf(four)) AS avg_1 FROM onek -- !query 0 schema struct<avg_1:double> -- !query 0 output -11,7 +11,7 struct<avg_1:double> -- !query 1 -SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100 +SELECT udf(avg(a)) AS avg_32 FROM aggtest WHERE a < 100 -- !query 1 schema struct<avg_32:double> -- !query 1 output -19,7 +19,7 struct<avg_32:double> -- !query 2 -select CAST(avg(b) AS Decimal(10,3)) AS avg_107_943 FROM aggtest +select CAST(avg(udf(b)) AS Decimal(10,3)) AS avg_107_943 FROM aggtest -- !query 2 schema struct<avg_107_943:decimal(10,3)> -- !query 2 output -27,7 +27,7 struct<avg_107_943:decimal(10,3)> -- !query 3 -SELECT sum(four) AS sum_1500 FROM onek +SELECT sum(udf(four)) AS sum_1500 FROM onek -- !query 3 schema struct<sum_1500:bigint> -- !query 3 output -35,7 +35,7 struct<sum_1500:bigint> -- !query 4 -SELECT sum(a) AS sum_198 FROM aggtest +SELECT udf(sum(a)) AS sum_198 FROM aggtest -- !query 4 schema struct<sum_198:bigint> -- !query 4 output -43,7 +43,7 struct<sum_198:bigint> -- !query 5 -SELECT sum(b) AS avg_431_773 FROM aggtest +SELECT udf(udf(sum(b))) AS avg_431_773 FROM aggtest -- !query 5 schema struct<avg_431_773:double> -- !query 5 output -51,7 +51,7 struct<avg_431_773:double> -- !query 6 -SELECT max(four) AS max_3 FROM onek +SELECT udf(max(four)) AS max_3 FROM onek -- !query 6 schema struct<max_3:int> -- !query 6 output -59,7 +59,7 struct<max_3:int> -- !query 7 -SELECT max(a) AS max_100 FROM aggtest +SELECT max(udf(a)) AS max_100 FROM aggtest -- !query 7 schema struct<max_100:int> -- !query 7 output -67,7 +67,7 struct<max_100:int> -- !query 8 -SELECT max(aggtest.b) AS max_324_78 FROM aggtest +SELECT udf(udf(max(aggtest.b))) AS max_324_78 FROM aggtest -- !query 8 schema struct<max_324_78:float> -- !query 8 output -75,237 +75,238 struct<max_324_78:float> -- !query 9 -SELECT stddev_pop(b) FROM aggtest +SELECT stddev_pop(udf(b)) FROM aggtest -- !query 9 schema -struct<stddev_pop(CAST(b AS DOUBLE)):double> +struct<stddev_pop(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS DOUBLE)):double> -- !query 9 output 131.10703231895047 -- !query 10 -SELECT stddev_samp(b) FROM aggtest +SELECT udf(stddev_samp(b)) FROM aggtest -- !query 10 schema -struct<stddev_samp(CAST(b AS DOUBLE)):double> +struct<CAST(udf(cast(stddev_samp(cast(b as double)) as string)) AS DOUBLE):double> -- !query 10 output 151.38936080399804 -- !query 11 -SELECT var_pop(b) FROM aggtest +SELECT var_pop(udf(b)) FROM aggtest -- !query 11 schema -struct<var_pop(CAST(b AS DOUBLE)):double> +struct<var_pop(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS DOUBLE)):double> -- !query 11 output 17189.053923482323 -- !query 12 -SELECT var_samp(b) FROM aggtest +SELECT udf(var_samp(b)) FROM aggtest -- !query 12 schema -struct<var_samp(CAST(b AS DOUBLE)):double> +struct<CAST(udf(cast(var_samp(cast(b as double)) as string)) AS DOUBLE):double> -- !query 12 output 22918.738564643096 -- !query 13 -SELECT stddev_pop(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT udf(stddev_pop(CAST(b AS Decimal(38,0)))) FROM aggtest -- !query 13 schema -struct<stddev_pop(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<CAST(udf(cast(stddev_pop(cast(cast(b as decimal(38,0)) as double)) as string)) AS DOUBLE):double> -- !query 13 output 131.18117242958306 -- !query 14 -SELECT stddev_samp(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT stddev_samp(CAST(udf(b) AS Decimal(38,0))) FROM aggtest -- !query 14 schema -struct<stddev_samp(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<stddev_samp(CAST(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS DECIMAL(38,0)) AS DOUBLE)):double> -- !query 14 output 151.47497042966097 -- !query 15 -SELECT var_pop(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT udf(var_pop(CAST(b AS Decimal(38,0)))) FROM aggtest -- !query 15 schema -struct<var_pop(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<CAST(udf(cast(var_pop(cast(cast(b as decimal(38,0)) as double)) as string)) AS DOUBLE):double> -- !query 15 output 17208.5 -- !query 16 -SELECT var_samp(CAST(b AS Decimal(38,0))) FROM aggtest +SELECT var_samp(udf(CAST(b AS Decimal(38,0)))) FROM aggtest -- !query 16 schema -struct<var_samp(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<var_samp(CAST(CAST(udf(cast(cast(b as decimal(38,0)) as string)) AS DECIMAL(38,0)) AS DOUBLE)):double> -- !query 16 output 22944.666666666668 -- !query 17 -SELECT var_pop(1.0), var_samp(2.0) +SELECT udf(var_pop(1.0)), var_samp(udf(2.0)) -- !query 17 schema -struct<var_pop(CAST(1.0 AS DOUBLE)):double,var_samp(CAST(2.0 AS DOUBLE)):double> +struct<CAST(udf(cast(var_pop(cast(1.0 as double)) as string)) AS DOUBLE):double,var_samp(CAST(CAST(udf(cast(2.0 as string)) AS DECIMAL(2,1)) AS DOUBLE)):double> -- !query 17 output 0.0 NaN -- !query 18 -SELECT stddev_pop(CAST(3.0 AS Decimal(38,0))), stddev_samp(CAST(4.0 AS Decimal(38,0))) +SELECT stddev_pop(udf(CAST(3.0 AS Decimal(38,0)))), stddev_samp(CAST(udf(4.0) AS Decimal(38,0))) -- !query 18 schema -struct<stddev_pop(CAST(CAST(3.0 AS DECIMAL(38,0)) AS DOUBLE)):double,stddev_samp(CAST(CAST(4.0 AS DECIMAL(38,0)) AS DOUBLE)):double> +struct<stddev_pop(CAST(CAST(udf(cast(cast(3.0 as decimal(38,0)) as string)) AS DECIMAL(38,0)) AS DOUBLE)):double,stddev_samp(CAST(CAST(CAST(udf(cast(4.0 as string)) AS DECIMAL(2,1)) AS DECIMAL(38,0)) AS DOUBLE)):double> -- !query 18 output 0.0 NaN -- !query 19 -select sum(CAST(null AS int)) from range(1,4) +select sum(udf(CAST(null AS int))) from range(1,4) -- !query 19 schema -struct<sum(CAST(NULL AS INT)):bigint> +struct<sum(CAST(udf(cast(cast(null as int) as string)) AS INT)):bigint> -- !query 19 output NULL -- !query 20 -select sum(CAST(null AS long)) from range(1,4) +select sum(udf(CAST(null AS long))) from range(1,4) -- !query 20 schema -struct<sum(CAST(NULL AS BIGINT)):bigint> +struct<sum(CAST(udf(cast(cast(null as bigint) as string)) AS BIGINT)):bigint> -- !query 20 output NULL -- !query 21 -select sum(CAST(null AS Decimal(38,0))) from range(1,4) +select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4) -- !query 21 schema -struct<sum(CAST(NULL AS DECIMAL(38,0))):decimal(38,0)> +struct<sum(CAST(udf(cast(cast(null as decimal(38,0)) as string)) AS DECIMAL(38,0))):decimal(38,0)> -- !query 21 output NULL -- !query 22 -select sum(CAST(null AS DOUBLE)) from range(1,4) +select sum(udf(CAST(null AS DOUBLE))) from range(1,4) -- !query 22 schema -struct<sum(CAST(NULL AS DOUBLE)):double> +struct<sum(CAST(udf(cast(cast(null as double) as string)) AS DOUBLE)):double> -- !query 22 output NULL -- !query 23 -select avg(CAST(null AS int)) from range(1,4) +select avg(udf(CAST(null AS int))) from range(1,4) -- !query 23 schema -struct<avg(CAST(NULL AS INT)):double> +struct<avg(CAST(udf(cast(cast(null as int) as string)) AS INT)):double> -- !query 23 output NULL -- !query 24 -select avg(CAST(null AS long)) from range(1,4) +select avg(udf(CAST(null AS long))) from range(1,4) -- !query 24 schema -struct<avg(CAST(NULL AS BIGINT)):double> +struct<avg(CAST(udf(cast(cast(null as bigint) as string)) AS BIGINT)):double> -- !query 24 output NULL -- !query 25 -select avg(CAST(null AS Decimal(38,0))) from range(1,4) +select avg(udf(CAST(null AS Decimal(38,0)))) from range(1,4) -- !query 25 schema -struct<avg(CAST(NULL AS DECIMAL(38,0))):decimal(38,4)> +struct<avg(CAST(udf(cast(cast(null as decimal(38,0)) as string)) AS DECIMAL(38,0))):decimal(38,4)> -- !query 25 output NULL -- !query 26 -select avg(CAST(null AS DOUBLE)) from range(1,4) +select avg(udf(CAST(null AS DOUBLE))) from range(1,4) -- !query 26 schema -struct<avg(CAST(NULL AS DOUBLE)):double> +struct<avg(CAST(udf(cast(cast(null as double) as string)) AS DOUBLE)):double> -- !query 26 output NULL -- !query 27 -select sum(CAST('NaN' AS DOUBLE)) from range(1,4) +select sum(CAST(udf('NaN') AS DOUBLE)) from range(1,4) -- !query 27 schema -struct<sum(CAST(NaN AS DOUBLE)):double> +struct<sum(CAST(CAST(udf(cast(NaN as string)) AS STRING) AS DOUBLE)):double> -- !query 27 output NaN -- !query 28 -select avg(CAST('NaN' AS DOUBLE)) from range(1,4) +select avg(CAST(udf('NaN') AS DOUBLE)) from range(1,4) -- !query 28 schema -struct<avg(CAST(NaN AS DOUBLE)):double> +struct<avg(CAST(CAST(udf(cast(NaN as string)) AS STRING) AS DOUBLE)):double> -- !query 28 output NaN -- !query 30 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE)) FROM (VALUES ('Infinity'), ('1')) v(x) -- !query 30 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double,var_pop(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double> -- !query 30 output Infinity NaN -- !query 31 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE)) FROM (VALUES ('Infinity'), ('Infinity')) v(x) -- !query 31 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double,var_pop(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double> -- !query 31 output Infinity NaN -- !query 32 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE)) FROM (VALUES ('-Infinity'), ('Infinity')) v(x) -- !query 32 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double,var_pop(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double> -- !query 32 output NaN NaN -- !query 33 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(udf(CAST(x AS DOUBLE))), udf(var_pop(CAST(x AS DOUBLE))) FROM (VALUES (100000003), (100000004), (100000006), (100000007)) v(x) -- !query 33 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(udf(cast(cast(x as double) as string)) AS DOUBLE)):double,CAST(udf(cast(var_pop(cast(x as double)) as string)) AS DOUBLE):double> -- !query 33 output 1.00000005E8 2.5 -- !query 34 -SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) +SELECT avg(udf(CAST(x AS DOUBLE))), udf(var_pop(CAST(x AS DOUBLE))) FROM (VALUES (7000000000005), (7000000000007)) v(x) -- !query 34 schema -struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double> +struct<avg(CAST(udf(cast(cast(x as double) as string)) AS DOUBLE)):double,CAST(udf(cast(var_pop(cast(x as double)) as string)) AS DOUBLE):double> -- !query 34 output 7.000000000006E12 1.0 -- !query 35 -SELECT covar_pop(b, a), covar_samp(b, a) FROM aggtest +SELECT udf(covar_pop(b, udf(a))), covar_samp(udf(b), a) FROM aggtest -- !query 35 schema -struct<covar_pop(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double,covar_samp(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double> +struct<CAST(udf(cast(covar_pop(cast(b as double), cast(cast(udf(cast(a as string)) as int) as double)) as string)) AS DOUBLE):double,covar_samp(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS DOUBLE), CAST(a AS DOUBLE)):double> -- !query 35 output 653.6289553875104 871.5052738500139 -- !query 36 -SELECT corr(b, a) FROM aggtest +SELECT corr(b, udf(a)) FROM aggtest -- !query 36 schema -struct<corr(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double> +struct<corr(CAST(b AS DOUBLE), CAST(CAST(udf(cast(a as string)) AS INT) AS DOUBLE)):double> -- !query 36 output 0.1396345165178734 -- !query 37 -SELECT count(four) AS cnt_1000 FROM onek +SELECT count(udf(four)) AS cnt_1000 FROM onek -- !query 37 schema struct<cnt_1000:bigint> -- !query 37 output -313,7 +314,7 struct<cnt_1000:bigint> -- !query 38 -SELECT count(DISTINCT four) AS cnt_4 FROM onek +SELECT udf(count(DISTINCT four)) AS cnt_4 FROM onek -- !query 38 schema struct<cnt_4:bigint> -- !query 38 output -321,10 +322,10 struct<cnt_4:bigint> -- !query 39 -select ten, count(*), sum(four) from onek +select ten, udf(count(*)), sum(udf(four)) from onek group by ten order by ten -- !query 39 schema -struct<ten:int,count(1):bigint,sum(four):bigint> +struct<ten:int,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint,sum(CAST(udf(cast(four as string)) AS INT)):bigint> -- !query 39 output 0 100 100 1 100 200 -339,10 +340,10 struct<ten:int,count(1):bigint,sum(four):bigint> -- !query 40 -select ten, count(four), sum(DISTINCT four) from onek +select ten, count(udf(four)), udf(sum(DISTINCT four)) from onek group by ten order by ten -- !query 40 schema -struct<ten:int,count(four):bigint,sum(DISTINCT four):bigint> +struct<ten:int,count(CAST(udf(cast(four as string)) AS INT)):bigint,CAST(udf(cast(sum(distinct cast(four as bigint)) as string)) AS BIGINT):bigint> -- !query 40 output 0 100 2 1 100 4 -357,11 +358,11 struct<ten:int,count(four):bigint,sum(DISTINCT four):bigint> -- !query 41 -select ten, sum(distinct four) from onek a +select ten, udf(sum(distinct four)) from onek a group by ten -having exists (select 1 from onek b where sum(distinct a.four) = b.four) +having exists (select 1 from onek b where udf(sum(distinct a.four)) = b.four) -- !query 41 schema -struct<ten:int,sum(DISTINCT four):bigint> +struct<ten:int,CAST(udf(cast(sum(distinct cast(four as bigint)) as string)) AS BIGINT):bigint> -- !query 41 output 0 2 2 2 -374,23 +375,23 struct<ten:int,sum(DISTINCT four):bigint> select ten, sum(distinct four) from onek a group by ten having exists (select 1 from onek b - where sum(distinct a.four + b.four) = b.four) + where sum(distinct a.four + b.four) = udf(b.four)) -- !query 42 schema struct<> -- !query 42 output org.apache.spark.sql.AnalysisException Aggregate/Window/Generate expressions are not valid in where clause of the query. -Expression in where clause: [(sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT)) = CAST(b.`four` AS BIGINT))] +Expression in where clause: [(sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT)) = CAST(CAST(udf(cast(four as string)) AS INT) AS BIGINT))] Invalid expressions: [sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT))]; -- !query 43 select - (select max((select i.unique2 from tenk1 i where i.unique1 = o.unique1))) + (select udf(max((select i.unique2 from tenk1 i where i.unique1 = o.unique1)))) from tenk1 o -- !query 43 schema struct<> -- !query 43 output org.apache.spark.sql.AnalysisException -cannot resolve '`o.unique1`' given input columns: [i.even, i.fivethous, i.four, i.hundred, i.odd, i.string4, i.stringu1, i.stringu2, i.ten, i.tenthous, i.thousand, i.twenty, i.two, i.twothousand, i.unique1, i.unique2]; line 2 pos 63 +cannot resolve '`o.unique1`' given input columns: [i.even, i.fivethous, i.four, i.hundred, i.odd, i.string4, i.stringu1, i.stringu2, i.ten, i.tenthous, i.thousand, i.twenty, i.two, i.twothousand, i.unique1, i.unique2]; line 2 pos 67 ``` </p> </details> ## How was this patch tested? Manually tested. Closes apache#25130 from HyukjinKwon/SPARK-28359. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
1 parent 66179fa commit 28774cd

17 files changed

+293
-160
lines changed

sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ case class UserDefinedPythonFunction(
3232
pythonEvalType: Int,
3333
udfDeterministic: Boolean) {
3434

35-
def builder(e: Seq[Expression]): PythonUDF = {
35+
def builder(e: Seq[Expression]): Expression = {
3636
PythonUDF(name, func, dataType, e, pythonEvalType, udfDeterministic)
3737
}
3838

sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part1.sql

Lines changed: 21 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,10 @@
99
-- SET extra_float_digits = 0;
1010

1111
-- This test file was converted from pgSQL/aggregates_part1.sql.
12-
-- Note that currently registered UDF returns a string. So there are some differences, for instance
13-
-- in string cast within UDF in Scala and Python.
1412

15-
SELECT CAST(avg(udf(four)) AS decimal(10,3)) AS avg_1 FROM onek;
13+
SELECT avg(udf(four)) AS avg_1 FROM onek;
1614

17-
SELECT CAST(udf(avg(a)) AS decimal(10,3)) AS avg_32 FROM aggtest WHERE a < 100;
15+
SELECT udf(avg(a)) AS avg_32 FROM aggtest WHERE a < 100;
1816

1917
-- In 7.1, avg(float4) is computed using float8 arithmetic.
2018
-- Round the result to 3 digits to avoid platform-specific results.
@@ -23,32 +21,32 @@ select CAST(avg(udf(b)) AS Decimal(10,3)) AS avg_107_943 FROM aggtest;
2321
-- `student` has a column with data type POINT, which is not supported by Spark [SPARK-27766]
2422
-- SELECT avg(gpa) AS avg_3_4 FROM ONLY student;
2523

26-
SELECT CAST(sum(udf(four)) AS int) AS sum_1500 FROM onek;
24+
SELECT sum(udf(four)) AS sum_1500 FROM onek;
2725
SELECT udf(sum(a)) AS sum_198 FROM aggtest;
28-
SELECT CAST(udf(udf(sum(b))) AS decimal(10,3)) AS avg_431_773 FROM aggtest;
26+
SELECT udf(udf(sum(b))) AS avg_431_773 FROM aggtest;
2927
-- `student` has a column with data type POINT, which is not supported by Spark [SPARK-27766]
3028
-- SELECT sum(gpa) AS avg_6_8 FROM ONLY student;
3129

3230
SELECT udf(max(four)) AS max_3 FROM onek;
33-
SELECT max(CAST(udf(a) AS int)) AS max_100 FROM aggtest;
34-
SELECT CAST(udf(udf(max(aggtest.b))) AS decimal(10,3)) AS max_324_78 FROM aggtest;
31+
SELECT max(udf(a)) AS max_100 FROM aggtest;
32+
SELECT udf(udf(max(aggtest.b))) AS max_324_78 FROM aggtest;
3533
-- `student` has a column with data type POINT, which is not supported by Spark [SPARK-27766]
3634
-- SELECT max(student.gpa) AS max_3_7 FROM student;
3735

38-
SELECT CAST(stddev_pop(udf(b)) AS decimal(10,3)) FROM aggtest;
39-
SELECT CAST(udf(stddev_samp(b)) AS decimal(10,3)) FROM aggtest;
40-
SELECT CAST(var_pop(udf(b)) AS decimal(10,3)) FROM aggtest;
41-
SELECT CAST(udf(var_samp(b)) AS decimal(10,3)) FROM aggtest;
36+
SELECT stddev_pop(udf(b)) FROM aggtest;
37+
SELECT udf(stddev_samp(b)) FROM aggtest;
38+
SELECT var_pop(udf(b)) FROM aggtest;
39+
SELECT udf(var_samp(b)) FROM aggtest;
4240

43-
SELECT CAST(udf(stddev_pop(CAST(b AS Decimal(38,0)))) AS decimal(10,3)) FROM aggtest;
44-
SELECT CAST(stddev_samp(CAST(udf(b) AS Decimal(38,0))) AS decimal(10,3)) FROM aggtest;
45-
SELECT CAST(udf(var_pop(CAST(b AS Decimal(38,0)))) AS decimal(10,3)) FROM aggtest;
46-
SELECT CAST(var_samp(udf(CAST(b AS Decimal(38,0)))) AS decimal(10,3)) FROM aggtest;
41+
SELECT udf(stddev_pop(CAST(b AS Decimal(38,0)))) FROM aggtest;
42+
SELECT stddev_samp(CAST(udf(b) AS Decimal(38,0))) FROM aggtest;
43+
SELECT udf(var_pop(CAST(b AS Decimal(38,0)))) FROM aggtest;
44+
SELECT var_samp(udf(CAST(b AS Decimal(38,0)))) FROM aggtest;
4745

4846
-- population variance is defined for a single tuple, sample variance
4947
-- is not
50-
SELECT CAST(udf(var_pop(1.0)) AS int), var_samp(udf(2.0));
51-
SELECT CAST(stddev_pop(udf(CAST(3.0 AS Decimal(38,0)))) AS int), stddev_samp(CAST(udf(4.0) AS Decimal(38,0)));
48+
SELECT udf(var_pop(1.0)), var_samp(udf(2.0));
49+
SELECT stddev_pop(udf(CAST(3.0 AS Decimal(38,0)))), stddev_samp(CAST(udf(4.0) AS Decimal(38,0)));
5250

5351

5452
-- verify correct results for null and NaN inputs
@@ -76,9 +74,9 @@ FROM (VALUES ('-Infinity'), ('Infinity')) v(x);
7674

7775

7876
-- test accuracy with a large input offset
79-
SELECT CAST(avg(udf(CAST(x AS DOUBLE))) AS int), CAST(udf(var_pop(CAST(x AS DOUBLE))) AS decimal(10,3))
77+
SELECT avg(udf(CAST(x AS DOUBLE))), udf(var_pop(CAST(x AS DOUBLE)))
8078
FROM (VALUES (100000003), (100000004), (100000006), (100000007)) v(x);
81-
SELECT CAST(avg(udf(x)) AS long), CAST(udf(var_pop(CAST(x AS DOUBLE))) AS decimal(10,3))
79+
SELECT avg(udf(CAST(x AS DOUBLE))), udf(var_pop(CAST(x AS DOUBLE)))
8280
FROM (VALUES (7000000000005), (7000000000007)) v(x);
8381

8482
-- SQL2003 binary aggregates [SPARK-23907]
@@ -89,8 +87,8 @@ FROM (VALUES (7000000000005), (7000000000007)) v(x);
8987
-- SELECT regr_avgx(b, a), regr_avgy(b, a) FROM aggtest;
9088
-- SELECT regr_r2(b, a) FROM aggtest;
9189
-- SELECT regr_slope(b, a), regr_intercept(b, a) FROM aggtest;
92-
SELECT CAST(udf(covar_pop(b, udf(a))) AS decimal(10,3)), CAST(covar_samp(udf(b), a) as decimal(10,3)) FROM aggtest;
93-
SELECT CAST(corr(b, udf(a)) AS decimal(10,3)) FROM aggtest;
90+
SELECT udf(covar_pop(b, udf(a))), covar_samp(udf(b), a) FROM aggtest;
91+
SELECT corr(b, udf(a)) FROM aggtest;
9492

9593

9694
-- test accum and combine functions directly [SPARK-23907]
@@ -122,7 +120,7 @@ SELECT CAST(corr(b, udf(a)) AS decimal(10,3)) FROM aggtest;
122120
SELECT count(udf(four)) AS cnt_1000 FROM onek;
123121
SELECT udf(count(DISTINCT four)) AS cnt_4 FROM onek;
124122

125-
select ten, udf(count(*)), CAST(sum(udf(four)) AS int) from onek
123+
select ten, udf(count(*)), sum(udf(four)) from onek
126124
group by ten order by ten;
127125

128126
select ten, count(udf(four)), udf(sum(DISTINCT four)) from onek

sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part2.sql

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,6 @@
66
-- https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/aggregates.sql#L145-L350
77
--
88
-- This test file was converted from pgSQL/aggregates_part2.sql.
9-
-- Note that currently registered UDF returns a string. So there are some differences, for instance
10-
-- in string cast within UDF in Scala and Python.
119

1210
create temporary view int4_tbl as select * from values
1311
(0),

sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-case.sql

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,6 @@
77
-- Test the CASE statement
88
--
99
-- This test file was converted from pgSQL/case.sql.
10-
-- Note that currently registered UDF returns a string. So there are some differences, for instance
11-
-- in string cast within UDF in Scala and Python.
1210

1311
CREATE TABLE CASE_TBL (
1412
i integer,
@@ -38,7 +36,7 @@ INSERT INTO CASE2_TBL VALUES (NULL, -6);
3836

3937
SELECT '3' AS `One`,
4038
CASE
41-
WHEN CAST(udf(1 < 2) AS boolean) THEN 3
39+
WHEN udf(1 < 2) THEN 3
4240
END AS `Simple WHEN`;
4341

4442
SELECT '<NULL>' AS `One`,
@@ -60,7 +58,7 @@ SELECT udf('4') AS `One`,
6058

6159
SELECT udf('6') AS `One`,
6260
CASE
63-
WHEN CAST(udf(1 > 2) AS boolean) THEN 3
61+
WHEN udf(1 > 2) THEN 3
6462
WHEN udf(4) < 5 THEN 6
6563
ELSE 7
6664
END AS `Two WHEN with default`;
@@ -70,7 +68,7 @@ SELECT '7' AS `None`,
7068
END AS `NULL on no matches`;
7169

7270
-- Constant-expression folding shouldn't evaluate unreachable subexpressions
73-
SELECT CASE WHEN CAST(udf(1=0) AS boolean) THEN 1/0 WHEN 1=1 THEN 1 ELSE 2/0 END;
71+
SELECT CASE WHEN udf(1=0) THEN 1/0 WHEN 1=1 THEN 1 ELSE 2/0 END;
7472
SELECT CASE 1 WHEN 0 THEN 1/udf(0) WHEN 1 THEN 1 ELSE 2/0 END;
7573

7674
-- [SPARK-27923] PostgreSQL throws an exception but Spark SQL is NULL
@@ -142,7 +140,7 @@ SELECT udf('') AS Five, NULLIF(a.i,b.i) AS `NULLIF(a.i,b.i)`,
142140

143141
SELECT '' AS `Two`, *
144142
FROM CASE_TBL a, CASE2_TBL b
145-
WHERE CAST(udf(COALESCE(f,b.i) = 2) AS boolean);
143+
WHERE udf(COALESCE(f,b.i) = 2);
146144

147145
-- We don't support update now.
148146
--

sql/core/src/test/resources/sql-tests/inputs/udf/udf-having.sql

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,4 @@
11
-- This test file was converted from having.sql.
2-
-- Note that currently registered UDF returns a string. So there are some differences, for instance
3-
-- in string cast within UDF in Scala and Python.
42

53
create temporary view hav as select * from values
64
("one", 1),

sql/core/src/test/resources/sql-tests/inputs/udf/udf-natural-join.sql

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,6 @@
44
--SET spark.sql.autoBroadcastJoinThreshold=-1,spark.sql.join.preferSortMergeJoin=false
55

66
-- This test file was converted from natural-join.sql.
7-
-- Note that currently registered UDF returns a string. So there are some differences, for instance
8-
-- in string cast within UDF in Scala and Python.
97

108
create temporary view nt1 as select * from values
119
("one", 1),
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
-- This file tests special values such as NaN, Infinity and NULL.
2+
3+
SELECT udf(x) FROM (VALUES (1), (2), (NULL)) v(x);
4+
SELECT udf(x) FROM (VALUES ('A'), ('B'), (NULL)) v(x);
5+
SELECT udf(x) FROM (VALUES ('NaN'), ('1'), ('2')) v(x);
6+
SELECT udf(x) FROM (VALUES ('Infinity'), ('1'), ('2')) v(x);
7+
SELECT udf(x) FROM (VALUES ('-Infinity'), ('1'), ('2')) v(x);
8+
SELECT udf(x) FROM (VALUES 0.00000001, 0.00000002, 0.00000003) v(x);

0 commit comments

Comments
 (0)