GH-47201: [Python][Parquet] Extending the schema and writing it back does not update Spark schema metadata #47253

rmnskb · 2025-08-03T18:50:58Z

Rationale for this change

Please see Issue #47201

What changes are included in this PR?

Add support for updating Spark-related part of .parquet metadata

Are these changes tested?

Not yet

Are there any user-facing changes?

When users select flavor="spark" while writing the parquet file from table, the function will check if there are any changes to existing Spark's schema in metadata, and will update it accordingly

GitHub Issue: [Python][Parquet] Extending the schema and writing it back does not update Spark schema metadata #47201

rmnskb · 2025-08-03T18:55:27Z

Hey @AlenkaF @rok @raulcd!
I hope you're doing well. Can you please look at the changes, and tell me if they make sense, and if I should continue that or not? :)
I am on the fence with this one: from one hand, I don't like that Arrow should interfere with the metadata from another framework, on the other hand, the use case described in the original issue seems plausible, and I can imagine that they are not the only person doing that.
Thank you!

AlenkaF · 2025-08-26T07:13:56Z

Thank you for continuing to work on PyArrow @rmnskb !
I haven't had a chance to review this in depth yet, but from a quick initial look, I'm not sure these changes should be applied on our side. I’ll need to take a deeper look and will follow up on the issue as soon as possible.

rmnskb · 2025-10-08T10:32:58Z

Hey @AlenkaF !
Sorry to ping you like this :D Did you have a chance to review the code change? :)
Coincidentally, I have encountered this issue at work when fixing one of the pipelines, and remembered about this PR. I gave it some thought, and I think it makes sense to reflect the changes when writing the table with Arrow if it was previously created with Spark.

AlenkaF · 2025-10-17T08:35:10Z

Hey @rmnskb, thanks for the ping!
I haven't got a chance, sorry. Will do my best to have a look at it today.

AlenkaF · 2025-10-17T19:38:46Z

I had another look at it today. I also found a part in the documentation that should fit with this use case:
https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

Unfortunately I wasn't able to make it work with the example from the reported issue.

I am still not comfortable to add spark row group metadata changes in PyArrow though I agree we should make it easier in PyArrow to handle such cases.

Would you be willing to also give a try at the proposed _metadata or _common_metadata files from linked documentation and see if it would be something we can use and potentially simplify?

@pitrou what do you think about PyArrow updating spark row group metadata in cases where PyArrow is being used for data manipulation with schema changes in between spark workloads?

pitrou · 2025-10-17T19:49:46Z

On the face of it, I think this can make sense, especially if the code remains simple, but it should be on the C++ side anyway, not in PyArrow.

@wgtmac What do you think?

wgtmac · 2025-10-18T07:44:49Z

IMHO, the C++ side does not even know the flavor here so it should not take the responsibility. It looks hacky but python/pyarrow/parquet/core.py seems to be the right place to fix it.

AlenkaF · 2025-10-22T06:41:11Z

Looking at the original PR implementing spark schema sanitizer (https://github.com/apache/arrow/pull/1076/files) I would actually agree the proposed fix would fit in PyArrow, if we see it needs to happen on our side. I would still like it to be a bit less hacky, as already mentioned 😊

If I have time, I will try to make the example work with the use of metadata as explained here: https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

rmnskb · 2025-10-22T09:25:28Z

Looking at the original PR implementing spark schema sanitizer (https://github.com/apache/arrow/pull/1076/files) I would actually agree the proposed fix would fit in PyArrow, if we see it needs to happen on our side. I would still like it to be a bit less hacky, as already mentioned 😊

If I have time, I will try to make the example work with the use of metadata as explained here: https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

Thank you for the update!
I also took a look at the original issue, and also at the way Pandas uses the parquet writer: it leverages both write_table and write_to_dataset, so we'd have to cover both options.
If we're to continue with the approach that I've proposed, we'd have to inject the updated metadata at the write time, and to do that, we'd also have to map the Arrow data types to Spark ones. I still have to check whether there is an already existing mapping available, either in this repo or in Spark's one, but one way or another, I cannot imagine any other way to ensure compatibility between these two frameworks. I will test the compatibility, and how much forgiving Spark can be when it comes to the schema ingestion.
I will most probably come back with more concrete implementation later this week, will keep you updated :)

AlenkaF · 2025-10-23T08:34:05Z

That is great @rmnskb, much appreciated!

AlenkaF · 2026-01-07T08:58:30Z

@HyukjinKwon wondering what is your take on this issue, discussion and the proposed solution. Any advice/opinion would be most welcome! Also cc @EnricoMi in case this might be interesting.

HyukjinKwon · 2026-01-07T10:08:42Z

ack taking a look now

HyukjinKwon · 2026-01-07T10:15:47Z

python/pyarrow/parquet/core.py

+        "MAP": "map",
+        **dict.fromkeys(["DATE32", "DATE64"], "date"),
+        "TIMESTAMP": "timestamp",
+        "INTERVAL_MONTH_DAY_NANO": "Calendar Interval",  # TODO: Correct this


TBH, I think it's too much to handle Spark specific cases like this in PyArrow. Spark is even adding more types such as variant type, and more interval types.

HyukjinKwon · 2026-01-07T10:17:03Z

python/pyarrow/parquet/core.py

+
+        spark_row_metadata["fields"].append({
+            "name": field.name,
+            "type": _map_spark_to_arrow_types(field.type),


And seem like we're missing nested types like struct type and array/map

HyukjinKwon · 2026-01-07T10:17:15Z

python/pyarrow/parquet/core.py

+        "BINARY": "binary",
+        "STRING": "string",
+        **dict.fromkeys(
+            ["DECIMAL" + str(2 ** i) for i in range(5, 9)], "decimal"),


and decimal precision

HyukjinKwon · 2026-01-07T10:17:25Z

python/pyarrow/parquet/core.py

+        ),
+        "MAP": "map",
+        **dict.fromkeys(["DATE32", "DATE64"], "date"),
+        "TIMESTAMP": "timestamp",


and timestamp ntz and ltz.

HyukjinKwon · 2026-01-07T10:18:48Z

I would rather incline toward documenting it or deprecating flavor == "spark" TBH. FWIW, I also have a PR to fix a similar case at #48456 but I actually don't feel strongly about it.

AlenkaF · 2026-01-07T10:48:37Z

Yeah, I would agree with documenting or even deprecating flavor == "spark" if that is an option.
Thanks!!

Implement the parquet metadata substitution logic

330399d

github-actions bot added Component: Python awaiting review Awaiting review labels Aug 3, 2025

HyukjinKwon reviewed Jan 7, 2026

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 7, 2026

HyukjinKwon reviewed Jan 7, 2026

View reviewed changes

GH-47201: [Python][Parquet] Extending the schema and writing it back does not update Spark schema metadata #47253

Are you sure you want to change the base?

GH-47201: [Python][Parquet] Extending the schema and writing it back does not update Spark schema metadata #47253

Uh oh!

Conversation

rmnskb commented Aug 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

rmnskb commented Aug 3, 2025

Uh oh!

AlenkaF commented Aug 26, 2025

Uh oh!

rmnskb commented Oct 8, 2025

Uh oh!

AlenkaF commented Oct 17, 2025

Uh oh!

AlenkaF commented Oct 17, 2025

Uh oh!

pitrou commented Oct 17, 2025

Uh oh!

wgtmac commented Oct 18, 2025

Uh oh!

AlenkaF commented Oct 22, 2025

Uh oh!

rmnskb commented Oct 22, 2025

Uh oh!

AlenkaF commented Oct 23, 2025

Uh oh!

AlenkaF commented Jan 7, 2026

Uh oh!

HyukjinKwon commented Jan 7, 2026

Uh oh!

HyukjinKwon Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlenkaF commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rmnskb commented Aug 3, 2025 •

edited by github-actions bot

Loading

HyukjinKwon commented Jan 7, 2026 •

edited

Loading

AlenkaF commented Jan 7, 2026 •

edited

Loading