Skip to content

Change in Union in Pyspark 4.0+ invalidates dbldatagen #373

@mikeweltevrede

Description

@mikeweltevrede

Hi, I was testing my code with the new DBR 17.3 LTS ML runtime. I stumbled upon the following issue in summarizeToDf(). I will first describe the issue generally, then specify how this affects data created by dbldatagen, and finally workarounds that I tried.

I hope someone knows what the issue is and how we can solve this. Thanks!

General issue

Somewhere between pyspark 3.5.0 and 4.0.0, the behaviour of pyspark.sql.DataFrame.union() changed with respect to data types. Notice the following:

df1 = spark_session.createDataFrame([("3", "C"), ("4", "D")], schema="id:string, name:string")
df2 = spark_session.createDataFrame([(1, "A"), (2, "B")], schema="id:int, name:string")

In pyspark==3.5.0, the schema of df1.union(df2) leads to DataFrame[id: string, name: string]. In pyspark==4.0.0, however, the schema of df1.union(df2) leads to DataFrame[id: bigint, name: string].

How does this affect dbldatagen?

In data_analyzer.DataAnalyzer.summarizeToDF(), when the addition of print_len_min is being done with _addMeasureToSummary(), this changes the schema of dfDataSummary from all strings to a mix of dtypes (of course, depending on the input data).

My input data is the following:

test_data = [
    (1, "A", 0.0, 1),
    (2, "B", 2.5, 2),
    (3, "C", 3.0, 3),
    (4, "D", None, 0),
]
df = spark.createDataFrame(test_data, schema="id:int, name:string, value_flt:double, value_int:int")

In Pyspark 3.5.0:

  • Before adding print_len_min, the schema of dfDataSummary is DataFrame[measure_: string, summary_: string, id: string, name: string, value_flt: string, value_int: string].
  • After this, it is DataFrame[measure_: string, summary_: string, id: string, name: string, value_flt: string, value_int: string].

In Pyspark 4.0.0:

  • Before adding print_len_min, the schema of dfDataSummary is DataFrame[measure_: string, summary_: string, id: string, name: string, value_flt: string, value_int: string].
  • After this, it is DataFrame[measure_: string, summary_: string, id: bigint, name: bigint, value_flt: bigint, value_int: bigint].

Resulting issue

When trying to collect() this DataFrame (or performing other operations too), the following issue is raised:

pyspark.errors.exceptions.captured.NumberFormatException: [CAST_INVALID_INPUT] The value 'int' of the type "STRING" cannot be cast to "BIGINT" because it is malformed. Correct the value as per the syntax, or change its target type. Use try_cast to tolerate malformed input and return NULL instead.

Because this also appears on collect(), I cannot inspect the contents of the DataFrame.

Tried workarounds

  • Manually casting or try_casting the output columns of summarizeToDF() to string dtype yielded the same issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions