Commit ec944cc
[SPARK-53659][SQL] Infer Variant shredding schema when writing to Parquet
### What changes were proposed in this pull request?
When writing Variant to Parquet, we want the shredding schema to adapt to the data being written on a per-file basis. This PR adds a new output writer that buffers the first few rows before starting the write, then uses the content of those rows to determine a shredding schema, and only then creates the Parquet writer with that schema.
The heuristics for determining the shredding schema are currently fairly simple: if a field appears consistently with a consistent type, we create `value` and `typed_value`, and if it appears with an inconsistent type, we only create `value`. We drop fields that occur in less than 10% of sampled rows, and have an upper bound of 300 total fields (counting `value` and `typed_value` separately) to avoid creating excessively wide Parquet schemas, which can cause performance issues.
### Why are the changes needed?
Allows Spark to make use of the [Variant shredding spec](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) without requiring the user to manually set a shredding schema.
### Does this PR introduce _any_ user-facing change?
Only if `spark.sql.variant.inferShreddingSchema` and `spark.sql.variant.writeShredding.enabled` are both set to true. They currently false by default.
### How was this patch tested?
Unit tests in PR.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #52406 from cashmand/variant_shredding_inference.
Lead-authored-by: cashmand <david.cashman@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>1 parent 65755bd commit ec944cc
File tree
7 files changed
+1204
-1
lines changed- common/variant/src/main/java/org/apache/spark/types/variant
- sql
- catalyst/src/main/scala/org/apache/spark/sql/internal
- core/src
- main/scala/org/apache/spark/sql/execution/datasources/parquet
- test/scala/org/apache/spark/sql/execution/datasources/parquet
7 files changed
+1204
-1
lines changedLines changed: 5 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
99 | 99 | | |
100 | 100 | | |
101 | 101 | | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
102 | 107 | | |
103 | 108 | | |
104 | 109 | | |
| |||
Lines changed: 25 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5422 | 5422 | | |
5423 | 5423 | | |
5424 | 5424 | | |
| 5425 | + | |
| 5426 | + | |
| 5427 | + | |
| 5428 | + | |
| 5429 | + | |
| 5430 | + | |
| 5431 | + | |
| 5432 | + | |
| 5433 | + | |
| 5434 | + | |
| 5435 | + | |
| 5436 | + | |
| 5437 | + | |
| 5438 | + | |
| 5439 | + | |
| 5440 | + | |
| 5441 | + | |
| 5442 | + | |
| 5443 | + | |
| 5444 | + | |
| 5445 | + | |
| 5446 | + | |
| 5447 | + | |
| 5448 | + | |
| 5449 | + | |
5425 | 5450 | | |
5426 | 5451 | | |
5427 | 5452 | | |
| |||
0 commit comments