Skip to content
17 changes: 17 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -958,6 +958,23 @@ union ColumnCryptoMetaData {
struct ColumnChunk {
/** File where column data is stored. If not set, assumed to be same file as
* metadata. This path is relative to the current file.
*
* As of December 2025, the only known use-case for this field is writing summary
* parquet files (i.e. "_metadata" files). These files consolidate footers from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a link that describes what a summary file is and what implementations support it?

This is what came back from a quick google search: https://stackoverflow.com/questions/53150801/what-is-the-parquet-summary-file

But I didn't see any mention of this in the format repository: https://github.com/search?q=repo%3Aapache%2Fparquet-format%20summary&type=code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this was ever officially part of the parquet specification as far as I can tell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded this section.

* multiple parquet files to allow for efficient reading of footers to avoid file
* listing costs and prune out files that do not need to be read based on statistics.
* This is legacy feature as modern table formats (e.g. Iceberg, Hudi and Delta Lake)
* are more scalable and serve effectively the same purpose.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem to me that calling this "legacy" may be too opinionated -- maybe we could tone down the language with something like

Note that table formats (e.g. Iceberg, Hudi and Delta Lake) offer a
superset of this functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at attempt to summarize this thread: https://lists.apache.org/thread/ootf2kmyg3p01b1bvplpvp4ftd1bt72d

It seems like there are potential correctness issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this to the text.

*
* There is no other known usage of this field. Specifically, there are no known
* readers that will read externally stored column data if this field is populated
* within a standard parquet file. Making use of the field for this purpose is currently
* not considered part of the Parquet specification.
*
*
* Any new use of this field must go through the normal Parquet feature
* addition process.
*
**/
1: optional string file_path

Expand Down