-
Notifications
You must be signed in to change notification settings - Fork 86
Description
Hi, I was testing my code with the new DBR 17.3 LTS ML runtime. I stumbled upon the following issue in summarizeToDf(). I will first describe the issue generally, then specify how this affects data created by dbldatagen, and finally workarounds that I tried.
I hope someone knows what the issue is and how we can solve this. Thanks!
General issue
Somewhere between pyspark 3.5.0 and 4.0.0, the behaviour of pyspark.sql.DataFrame.union() changed with respect to data types. Notice the following:
df1 = spark_session.createDataFrame([("3", "C"), ("4", "D")], schema="id:string, name:string")
df2 = spark_session.createDataFrame([(1, "A"), (2, "B")], schema="id:int, name:string")In pyspark==3.5.0, the schema of df1.union(df2) leads to DataFrame[id: string, name: string]. In pyspark==4.0.0, however, the schema of df1.union(df2) leads to DataFrame[id: bigint, name: string].
How does this affect dbldatagen?
In data_analyzer.DataAnalyzer.summarizeToDF(), when the addition of print_len_min is being done with _addMeasureToSummary(), this changes the schema of dfDataSummary from all strings to a mix of dtypes (of course, depending on the input data).
My input data is the following:
test_data = [
(1, "A", 0.0, 1),
(2, "B", 2.5, 2),
(3, "C", 3.0, 3),
(4, "D", None, 0),
]
df = spark.createDataFrame(test_data, schema="id:int, name:string, value_flt:double, value_int:int")In Pyspark 3.5.0:
- Before adding
print_len_min, the schema ofdfDataSummaryisDataFrame[measure_: string, summary_: string, id: string, name: string, value_flt: string, value_int: string]. - After this, it is
DataFrame[measure_: string, summary_: string, id: string, name: string, value_flt: string, value_int: string].
In Pyspark 4.0.0:
- Before adding
print_len_min, the schema ofdfDataSummaryisDataFrame[measure_: string, summary_: string, id: string, name: string, value_flt: string, value_int: string]. - After this, it is
DataFrame[measure_: string, summary_: string, id: bigint, name: bigint, value_flt: bigint, value_int: bigint].
Resulting issue
When trying to collect() this DataFrame (or performing other operations too), the following issue is raised:
pyspark.errors.exceptions.captured.NumberFormatException: [CAST_INVALID_INPUT] The value 'int' of the type "STRING" cannot be cast to "BIGINT" because it is malformed. Correct the value as per the syntax, or change its target type. Use
try_castto tolerate malformed input and return NULL instead.
Because this also appears on collect(), I cannot inspect the contents of the DataFrame.
Tried workarounds
- Manually casting or try_casting the output columns of
summarizeToDF()to string dtype yielded the same issue