Allow parquet readers to use existing `datasource`s and `metadata`s #20693

mhaseeb123 · 2025-11-21T01:32:52Z

Description

Contributes to #20311. Closes #18890

This PR enables cuDF parquet readers (chunked and non-chunked) to use pre-constructed datasource(s) and FileMetaData(s). This allows them to save compute time for re-reading footers when possible. This is particularly useful for workflows where one may want to only read file footers first, making some decisions based on that, and finally read parquet files without re-reading the footers.

Checklist

Merge Enable using multithreaded setup_page_index in hybrid scan reader #20721 before this PR
I am familiar with the Contributing Guidelines.
Python bindings - in a future PR
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-11-21T01:32:55Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

mhaseeb123 · 2025-11-21T19:20:51Z

CC: @JigaoLuo

mhaseeb123 · 2025-11-21T23:04:58Z

docs/cudf/source/conf.py

    "deprecated",
    # TODO: This is currently in a src file but perhaps should be public
    "orc::column_statistics",
-    # Sphinx doesn't know how to distinguish between the ORC and Parquet


Only an ORC thing now since it's called Type in parquet consistent with the schema

mhaseeb123 · 2025-11-21T23:05:33Z

cpp/include/cudf/io/orc_metadata.hpp

 */

+//! ORC data type
+using cudf::io::orc::TypeKind;


Ignore for docs build

mhaseeb123 · 2025-11-21T23:06:18Z

cpp/include/cudf/io/detail/parquet.hpp

+ *
+ * @return List of FileMetaData objects, one per parquet source
+ */
+std::vector<parquet::FileMetaData> read_parquet_footers(


Could rename this t oread_parquet_raw_metadata as well

mhaseeb123 · 2025-11-21T23:07:02Z

cpp/tests/io/parquet_chunked_reader_test.cu

  // Test with zero limit: everything will be read in one chunk
  {
-    auto const [result, num_chunks] = chunked_read(filepath_no_null, 0);
+    // Separately materialize datasource and metadata and use them to construct the chunked reader


Randomly just threw in the new chunked_reader interface in existing tests

mhaseeb123 · 2025-11-21T23:07:40Z

cpp/tests/io/parquet_chunked_reader_test.cu

  return chunked_read(vpath, output_limit, input_limit);
 }

+auto chunked_read(std::vector<std::unique_ptr<cudf::io::datasource>>&& sources,


Same as chunked_read utility above, just uses pre-existing sources and metadatas with the chunked reader

mhaseeb123 · 2025-11-21T23:08:43Z

cpp/include/cudf/io/datasource.hpp

+ * which means the entire file after `offset`)
+ * @return Constructed vector of datasource objects
+ */
+std::vector<std::unique_ptr<cudf::io::datasource>> make_datasources(source_info const& info,


Just declared this helper in the header file. It's already defined in functions.cpp

mhaseeb123 · 2025-11-21T23:09:07Z

cpp/src/io/functions.cpp


-namespace {
-
+/**


Just move it out of the anonymous namespace

…aseeb123/cudf into fea/read-parquet-with-pre-populated-footer

pmattione-nvidia · 2025-12-01T17:07:12Z

cpp/src/io/functions.cpp

+{
+  CUDF_FUNC_RANGE();
+
+  auto reader = std::make_unique<detail_parquet::reader>(


do we really want to moving out of these vectors? what if you want to call read_parquet() twice on the same file? One of the points is to reuse the metadata for multiple calls, right? and if we're just doing copies instead, we should probably be passing in std::span's instead of vectors.

I am okay with that approach as well. @wence- @JigaoLuo does any of your use cases involve reading the same file via (datasource and metadata) again and again?

@abellina can you comment on what we need for spark here? would we share the parquet metadata between spark tasks?

no, tasks do not share data when reading from parquet, and we should not need to re-read the same file or same buffer multiple times during normal operations. We can technically call read_parquet multiple times if we OOM on the same host buffer, but that is an error condition and we are likely to recreate the reader from scratch.

Does this change affect chunking at all? that's the one case where I could see an OOM causing us to call into cuDF to re-read that chunk.

Nope, this is just a new interface to the parquet readers (both chunked and non-chunked) where you can skip creating datasource and reading in metadata and directly pass in existing ones.

One advantage could be that you can save some time if reading the same file but with different options (make sure to create a local copy of metadatas and datasources as needed as we are using move semantics)

wouldn't we have to reread the same file multiple times if we hit the chunked read memory limit? wouldn't we want to reuse the datasource and metadata to read each chunk in that case?

Actually, I think the move semantics here are ok since we are passing in unique ptrs to datasources which get destructed with the reader. I could change the PR to move datasources but copy metadatas but that seems inconsistent. The current state allows the user to create a copy of both the datasource and metadata explicitly and pass in as needed instead of always.

FWIW, avoiding the metadata copies can actually impact performance. When I removed an accidental copy in read_parquet_metadata, the PDS-H overall got measurably faster (single digit %).

I am ok with the user explicitly having to opt in to copying things if they need to.

cpp/src/io/parquet/reader_impl.cpp

copy-pr-bot · 2025-12-03T01:35:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

vuule

partial review

cpp/src/io/parquet/experimental/hybrid_scan_helpers.hpp

cpp/src/io/parquet/reader_impl.cpp

cpp/include/cudf/io/parquet_metadata.hpp

vuule · 2025-12-04T00:05:48Z

cpp/src/io/parquet/reader_impl_helpers.cpp

+
+  // Flatten all columns into a single vector for easier task distribution
+  std::vector<std::reference_wrapper<ColumnChunk>> all_column_chunks;
+  all_column_chunks.reserve(row_groups.size() * row_groups.front().columns.size());


do we ever get here with empty row_groups?

~~I don't think so unless doing it maliciously using the hybrid scan API~~
Edit: Actually we check for non-zero row_groups.size() and row_groups.front().size() in hybrid scan API too so I think we can't get here with empty stuff.

I realized that I had actually removed this check in the PR but added it again now. Thanks for catching

JigaoLuo · 2025-12-05T21:34:16Z

does any of your use cases involve reading the same file via (datasource and metadata) again and again?

(Sorry for the delay. I missed your message as I was busy writing at the beginning of this month.)

As you may know, I let different threads read different RGs of the same Parquet file. This does not mean the datasource is read multiple times.

However, one rare but valid case is the self‑join, essentially joining a table with itself (e.g., A JOIN A) for filtering purposes. In this case, I still need to fully read table A once to build the hash table, and then read A again to probe the hash table.

There are a couple of reasons why we did not cache A in memory: 1. We do not have any query optimizer. 2. We expect there may be cases where repeated reading is necessary, so we kept the design general.

This is more of a query processing & optimization discussion and slightly off‑topic for a reader of this issue. We could continue the discussion on Slack if that helps.

Allow parquet readers to use pre-materialized metadatas

d01acf9

github-actions bot assigned mhaseeb123 Nov 21, 2025

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 21, 2025

Allow parquet readers to use external datasources

2c83509

mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue non-breaking Non-breaking change cudf-polars Issues specific to cudf-polars labels Nov 21, 2025

github-project-automation bot added this to cuDF Python Nov 21, 2025

mhaseeb123 added the feature request New feature or request label Nov 21, 2025

Improve docs

2cb9738

GPUtester moved this to In Progress in cuDF Python Nov 21, 2025

Add more tests

67bd2bc

mhaseeb123 changed the title ~~Allow parquet readers to use pre-materialized metadatas~~ Allow parquet readers to use pre-existing datasources and metadatas Nov 21, 2025

Merge branch 'main' into fea/read-parquet-with-pre-populated-footer

8bede77

mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 21, 2025

mhaseeb123 marked this pull request as ready for review November 21, 2025 18:25

mhaseeb123 requested a review from a team as a code owner November 21, 2025 18:25

mhaseeb123 requested review from pmattione-nvidia, vyasr and wence- November 21, 2025 18:25

JigaoLuo mentioned this pull request Nov 21, 2025

[DO NOT MERGE] [POC] Metadata caching prototype in Parquet reader #18891

Closed

3 tasks

Fix docs

b81c66a

mhaseeb123 commented Nov 21, 2025

View reviewed changes

cpp/include/cudf/io/orc_metadata.hpp

*/

//! ORC data type

using cudf::io::orc::TypeKind;

Copy link

Member Author

mhaseeb123 Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore for docs build

mhaseeb123 commented Nov 21, 2025

View reviewed changes

mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels Nov 21, 2025

mhaseeb123 commented Nov 21, 2025

View reviewed changes

cpp/src/io/functions.cpp

namespace {

/**

Copy link

Member Author

mhaseeb123 Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just move it out of the anonymous namespace

Merge branch 'main' into fea/read-parquet-with-pre-populated-footer

ebce1a0

mhaseeb123 changed the title ~~Allow parquet readers to use pre-existing datasources and metadatas~~ Allow parquet readers to use existing datasources and metadatas Nov 24, 2025

mhaseeb123 and others added 5 commits November 25, 2025 01:36

Use multithreaded setup_page_index in hybrid scan reader

f7ce471

Merge branch 'main' into fea/multithreaded-setup-pgidx

034983a

style fix

8b85949

Minor improvements

e9e8976

Merge branch 'fea/multithreaded-setup-pgidx' of https://github.com/mh…

3795225

…aseeb123/cudf into fea/read-parquet-with-pre-populated-footer

JigaoLuo mentioned this pull request Nov 26, 2025

[Story] Towards a faster Parquet reader with pipelining and multistream optimization #18892

Open

pmattione-nvidia reviewed Dec 1, 2025

View reviewed changes

cpp/src/io/parquet/reader_impl.cpp Show resolved Hide resolved

Address feedback

0ef323c

vuule reviewed Dec 3, 2025

View reviewed changes

cpp/src/io/parquet/experimental/hybrid_scan_helpers.hpp Outdated Show resolved Hide resolved

cpp/src/io/parquet/reader_impl.cpp Outdated Show resolved Hide resolved

cpp/include/cudf/io/parquet_metadata.hpp Outdated Show resolved Hide resolved

mhaseeb123 added 2 commits December 3, 2025 18:56

Address partial feedback

d33c9ee

Rename read_parquet_metadata to read_parquet_footers

e056495

pmattione-nvidia approved these changes Dec 3, 2025

View reviewed changes

vuule approved these changes Dec 4, 2025

View reviewed changes

Add some extra checks

9e00764

Allow parquet readers to use existing datasources and metadatas #20693

Are you sure you want to change the base?

Allow parquet readers to use existing datasources and metadatas #20693

Uh oh!

Conversation

mhaseeb123 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Nov 21, 2025

Uh oh!

mhaseeb123 commented Nov 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmattione-nvidia Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmattione-nvidia Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abellina Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhaseeb123 Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhaseeb123 Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

copy-pr-bot bot commented Dec 3, 2025

Uh oh!

vuule left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhaseeb123 Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JigaoLuo commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Allow parquet readers to use existing `datasource`s and `metadata`s #20693

Allow parquet readers to use existing `datasource`s and `metadata`s #20693

mhaseeb123 commented Nov 21, 2025 •

edited

Loading

pmattione-nvidia Dec 1, 2025 •

edited

Loading

pmattione-nvidia Dec 1, 2025 •

edited

Loading

abellina Dec 1, 2025 •

edited

Loading

mhaseeb123 Dec 1, 2025 •

edited

Loading

mhaseeb123 Dec 2, 2025 •

edited

Loading

mhaseeb123 Dec 4, 2025 •

edited

Loading

JigaoLuo commented Dec 5, 2025 •

edited

Loading