Change in Union in Pyspark 4.0+ invalidates dbldatagen

Hi, I was testing my code with the new DBR 17.3 LTS ML runtime. I stumbled upon the following issue in `summarizeToDf()`. I will first describe the issue generally, then specify how this affects data created by `dbldatagen`, and finally workarounds that I tried.

I hope someone knows what the issue is and how we can solve this. Thanks!

# General issue
Somewhere between `pyspark` 3.5.0 and 4.0.0, the behaviour of `pyspark.sql.DataFrame.union()` changed with respect to data types. Notice the following:

```python
df1 = spark_session.createDataFrame([("3", "C"), ("4", "D")], schema="id:string, name:string")
df2 = spark_session.createDataFrame([(1, "A"), (2, "B")], schema="id:int, name:string")
```

In `pyspark==3.5.0`, the schema of `df1.union(df2)` leads to `DataFrame[id: string, name: string]`. In `pyspark==4.0.0`, however, the schema of `df1.union(df2)` leads to `DataFrame[id: bigint, name: string]`.

# How does this affect `dbldatagen`?
In `data_analyzer.DataAnalyzer.summarizeToDF()`, when the addition of `print_len_min` is being done with `_addMeasureToSummary()`, this changes the schema of `dfDataSummary` from all strings to a mix of dtypes (of course, depending on the input data).

My input data is the following:

```python
test_data = [
    (1, "A", 0.0, 1),
    (2, "B", 2.5, 2),
    (3, "C", 3.0, 3),
    (4, "D", None, 0),
]
df = spark.createDataFrame(test_data, schema="id:int, name:string, value_flt:double, value_int:int")
```

In Pyspark 3.5.0:
- Before adding `print_len_min`, the schema of `dfDataSummary` is `DataFrame[measure_: string, summary_: string, id: string, name: string, value_flt: string, value_int: string]`.
- After this, it is `DataFrame[measure_: string, summary_: string, id: string, name: string, value_flt: string, value_int: string]`.

In Pyspark 4.0.0:
- Before adding `print_len_min`, the schema of `dfDataSummary` is `DataFrame[measure_: string, summary_: string, id: string, name: string, value_flt: string, value_int: string]`.
- After this, it is `DataFrame[measure_: string, summary_: string, id: bigint, name: bigint, value_flt: bigint, value_int: bigint]`.

## Resulting issue
When trying to `collect()` this DataFrame (or performing other operations too), the following issue is raised:
> pyspark.errors.exceptions.captured.NumberFormatException: [CAST_INVALID_INPUT] The value 'int' of the type "STRING" cannot be cast to "BIGINT" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead.

 Because this also appears on `collect()`, I cannot inspect the contents of the DataFrame.

## Tried workarounds
- Manually casting or try_casting the output columns of `summarizeToDF()` to string dtype yielded the same issue


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change in Union in Pyspark 4.0+ invalidates dbldatagen #373

General issue

How does this affect `dbldatagen`?

Resulting issue

Tried workarounds

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Change in Union in Pyspark 4.0+ invalidates dbldatagen #373

Description

General issue

How does this affect dbldatagen?

Resulting issue

Tried workarounds

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

How does this affect `dbldatagen`?