Skip to content

Conversation

@rmnskb
Copy link
Contributor

@rmnskb rmnskb commented Aug 3, 2025

Rationale for this change

Please see Issue #47201

What changes are included in this PR?

Add support for updating Spark-related part of .parquet metadata

Are these changes tested?

Not yet

Are there any user-facing changes?

When users select flavor="spark" while writing the parquet file from table, the function will check if there are any changes to existing Spark's schema in metadata, and will update it accordingly

@rmnskb
Copy link
Contributor Author

rmnskb commented Aug 3, 2025

Hey @AlenkaF @rok @raulcd!
I hope you're doing well. Can you please look at the changes, and tell me if they make sense, and if I should continue that or not? :)
I am on the fence with this one: from one hand, I don't like that Arrow should interfere with the metadata from another framework, on the other hand, the use case described in the original issue seems plausible, and I can imagine that they are not the only person doing that.
Thank you!

@AlenkaF
Copy link
Member

AlenkaF commented Aug 26, 2025

Thank you for continuing to work on PyArrow @rmnskb !
I haven't had a chance to review this in depth yet, but from a quick initial look, I'm not sure these changes should be applied on our side. I’ll need to take a deeper look and will follow up on the issue as soon as possible.

@rmnskb
Copy link
Contributor Author

rmnskb commented Oct 8, 2025

Hey @AlenkaF !
Sorry to ping you like this :D Did you have a chance to review the code change? :)
Coincidentally, I have encountered this issue at work when fixing one of the pipelines, and remembered about this PR. I gave it some thought, and I think it makes sense to reflect the changes when writing the table with Arrow if it was previously created with Spark.

@AlenkaF
Copy link
Member

AlenkaF commented Oct 17, 2025

Hey @rmnskb, thanks for the ping!
I haven't got a chance, sorry. Will do my best to have a look at it today.

@AlenkaF
Copy link
Member

AlenkaF commented Oct 17, 2025

I had another look at it today. I also found a part in the documentation that should fit with this use case:
https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

Unfortunately I wasn't able to make it work with the example from the reported issue.

I am still not comfortable to add spark row group metadata changes in PyArrow though I agree we should make it easier in PyArrow to handle such cases.

Would you be willing to also give a try at the proposed _metadata or _common_metadata files from linked documentation and see if it would be something we can use and potentially simplify?

@pitrou what do you think about PyArrow updating spark row group metadata in cases where PyArrow is being used for data manipulation with schema changes in between spark workloads?

@pitrou
Copy link
Member

pitrou commented Oct 17, 2025

On the face of it, I think this can make sense, especially if the code remains simple, but it should be on the C++ side anyway, not in PyArrow.

@wgtmac What do you think?

@wgtmac
Copy link
Member

wgtmac commented Oct 18, 2025

IMHO, the C++ side does not even know the flavor here so it should not take the responsibility. It looks hacky but python/pyarrow/parquet/core.py seems to be the right place to fix it.

@AlenkaF
Copy link
Member

AlenkaF commented Oct 22, 2025

Looking at the original PR implementing spark schema sanitizer (https://github.com/apache/arrow/pull/1076/files) I would actually agree the proposed fix would fit in PyArrow, if we see it needs to happen on our side. I would still like it to be a bit less hacky, as already mentioned 😊

If I have time, I will try to make the example work with the use of metadata as explained here: https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

@rmnskb
Copy link
Contributor Author

rmnskb commented Oct 22, 2025

Looking at the original PR implementing spark schema sanitizer (https://github.com/apache/arrow/pull/1076/files) I would actually agree the proposed fix would fit in PyArrow, if we see it needs to happen on our side. I would still like it to be a bit less hacky, as already mentioned 😊

If I have time, I will try to make the example work with the use of metadata as explained here: https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

Thank you for the update!
I also took a look at the original issue, and also at the way Pandas uses the parquet writer: it leverages both write_table and write_to_dataset, so we'd have to cover both options.
If we're to continue with the approach that I've proposed, we'd have to inject the updated metadata at the write time, and to do that, we'd also have to map the Arrow data types to Spark ones. I still have to check whether there is an already existing mapping available, either in this repo or in Spark's one, but one way or another, I cannot imagine any other way to ensure compatibility between these two frameworks. I will test the compatibility, and how much forgiving Spark can be when it comes to the schema ingestion.
I will most probably come back with more concrete implementation later this week, will keep you updated :)

@AlenkaF
Copy link
Member

AlenkaF commented Oct 23, 2025

That is great @rmnskb, much appreciated!

@AlenkaF
Copy link
Member

AlenkaF commented Jan 7, 2026

@HyukjinKwon wondering what is your take on this issue, discussion and the proposed solution. Any advice/opinion would be most welcome! Also cc @EnricoMi in case this might be interesting.

@HyukjinKwon
Copy link
Member

ack taking a look now

"MAP": "map",
**dict.fromkeys(["DATE32", "DATE64"], "date"),
"TIMESTAMP": "timestamp",
"INTERVAL_MONTH_DAY_NANO": "Calendar Interval", # TODO: Correct this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I think it's too much to handle Spark specific cases like this in PyArrow. Spark is even adding more types such as variant type, and more interval types.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 7, 2026

spark_row_metadata["fields"].append({
"name": field.name,
"type": _map_spark_to_arrow_types(field.type),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And seem like we're missing nested types like struct type and array/map

"BINARY": "binary",
"STRING": "string",
**dict.fromkeys(
["DECIMAL" + str(2 ** i) for i in range(5, 9)], "decimal"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and decimal precision

),
"MAP": "map",
**dict.fromkeys(["DATE32", "DATE64"], "date"),
"TIMESTAMP": "timestamp",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and timestamp ntz and ltz.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 7, 2026

I would rather incline toward documenting it or deprecating flavor == "spark" TBH. FWIW, I also have a PR to fix a similar case at #48456 but I actually don't feel strongly about it.

@AlenkaF
Copy link
Member

AlenkaF commented Jan 7, 2026

Yeah, I would agree with documenting or even deprecating flavor == "spark" if that is an option.
Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants