-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-47201: [Python][Parquet] Extending the schema and writing it back does not update Spark schema metadata #47253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hey @AlenkaF @rok @raulcd! |
|
Thank you for continuing to work on PyArrow @rmnskb ! |
|
Hey @AlenkaF ! |
|
Hey @rmnskb, thanks for the ping! |
|
I had another look at it today. I also found a part in the documentation that should fit with this use case: Unfortunately I wasn't able to make it work with the example from the reported issue. I am still not comfortable to add spark row group metadata changes in PyArrow though I agree we should make it easier in PyArrow to handle such cases. Would you be willing to also give a try at the proposed @pitrou what do you think about PyArrow updating spark row group metadata in cases where PyArrow is being used for data manipulation with schema changes in between spark workloads? |
|
On the face of it, I think this can make sense, especially if the code remains simple, but it should be on the C++ side anyway, not in PyArrow. @wgtmac What do you think? |
|
IMHO, the C++ side does not even know the |
|
Looking at the original PR implementing spark schema sanitizer (https://github.com/apache/arrow/pull/1076/files) I would actually agree the proposed fix would fit in PyArrow, if we see it needs to happen on our side. I would still like it to be a bit less hacky, as already mentioned 😊 If I have time, I will try to make the example work with the use of metadata as explained here: https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files |
Thank you for the update! |
|
That is great @rmnskb, much appreciated! |
|
@HyukjinKwon wondering what is your take on this issue, discussion and the proposed solution. Any advice/opinion would be most welcome! Also cc @EnricoMi in case this might be interesting. |
|
ack taking a look now |
| "MAP": "map", | ||
| **dict.fromkeys(["DATE32", "DATE64"], "date"), | ||
| "TIMESTAMP": "timestamp", | ||
| "INTERVAL_MONTH_DAY_NANO": "Calendar Interval", # TODO: Correct this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I think it's too much to handle Spark specific cases like this in PyArrow. Spark is even adding more types such as variant type, and more interval types.
|
|
||
| spark_row_metadata["fields"].append({ | ||
| "name": field.name, | ||
| "type": _map_spark_to_arrow_types(field.type), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And seem like we're missing nested types like struct type and array/map
| "BINARY": "binary", | ||
| "STRING": "string", | ||
| **dict.fromkeys( | ||
| ["DECIMAL" + str(2 ** i) for i in range(5, 9)], "decimal"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and decimal precision
| ), | ||
| "MAP": "map", | ||
| **dict.fromkeys(["DATE32", "DATE64"], "date"), | ||
| "TIMESTAMP": "timestamp", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and timestamp ntz and ltz.
|
I would rather incline toward documenting it or deprecating |
|
Yeah, I would agree with documenting or even deprecating |
Rationale for this change
Please see Issue #47201
What changes are included in this PR?
Add support for updating Spark-related part of .parquet metadata
Are these changes tested?
Not yet
Are there any user-facing changes?
When users select
flavor="spark"while writing the parquet file from table, the function will check if there are any changes to existing Spark's schema in metadata, and will update it accordingly