Skip to content

Conversation

@mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Nov 21, 2025

Description

Contributes to #20311. Closes #18890

This PR enables cuDF parquet readers (chunked and non-chunked) to use pre-constructed datasource(s) and FileMetaData(s). This allows them to save compute time for re-reading footers when possible. This is particularly useful for workflows where one may want to only read file footers first, making some decisions based on that, and finally read parquet files without re-reading the footers.

Checklist

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 21, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 21, 2025
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue non-breaking Non-breaking change cudf-polars Issues specific to cudf-polars labels Nov 21, 2025
@mhaseeb123 mhaseeb123 added the feature request New feature or request label Nov 21, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Nov 21, 2025
@mhaseeb123 mhaseeb123 changed the title Allow parquet readers to use pre-materialized metadatas Allow parquet readers to use pre-existing datasources and metadatas Nov 21, 2025
@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 21, 2025
@mhaseeb123 mhaseeb123 marked this pull request as ready for review November 21, 2025 18:25
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner November 21, 2025 18:25
@mhaseeb123
Copy link
Member Author

CC: @JigaoLuo

"deprecated",
# TODO: This is currently in a src file but perhaps should be public
"orc::column_statistics",
# Sphinx doesn't know how to distinguish between the ORC and Parquet
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only an ORC thing now since it's called Type in parquet consistent with the schema

*/

//! ORC data type
using cudf::io::orc::TypeKind;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore for docs build

*
* @return List of FileMetaData objects, one per parquet source
*/
std::vector<parquet::FileMetaData> read_parquet_footers(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could rename this t oread_parquet_raw_metadata as well

// Test with zero limit: everything will be read in one chunk
{
auto const [result, num_chunks] = chunked_read(filepath_no_null, 0);
// Separately materialize datasource and metadata and use them to construct the chunked reader
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Randomly just threw in the new chunked_reader interface in existing tests

return chunked_read(vpath, output_limit, input_limit);
}

auto chunked_read(std::vector<std::unique_ptr<cudf::io::datasource>>&& sources,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as chunked_read utility above, just uses pre-existing sources and metadatas with the chunked reader

@mhaseeb123 mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels Nov 21, 2025
* which means the entire file after `offset`)
* @return Constructed vector of datasource objects
*/
std::vector<std::unique_ptr<cudf::io::datasource>> make_datasources(source_info const& info,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just declared this helper in the header file. It's already defined in functions.cpp


namespace {

/**
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just move it out of the anonymous namespace

@mhaseeb123 mhaseeb123 changed the title Allow parquet readers to use pre-existing datasources and metadatas Allow parquet readers to use existing datasources and metadatas Nov 24, 2025
{
CUDF_FUNC_RANGE();

auto reader = std::make_unique<detail_parquet::reader>(
Copy link
Contributor

@pmattione-nvidia pmattione-nvidia Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to moving out of these vectors? what if you want to call read_parquet() twice on the same file? One of the points is to reuse the metadata for multiple calls, right? and if we're just doing copies instead, we should probably be passing in std::span's instead of vectors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with that approach as well. @wence- @JigaoLuo does any of your use cases involve reading the same file via (datasource and metadata) again and again?

Copy link
Contributor

@pmattione-nvidia pmattione-nvidia Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abellina can you comment on what we need for spark here? would we share the parquet metadata between spark tasks?

Copy link
Contributor

@abellina abellina Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, tasks do not share data when reading from parquet, and we should not need to re-read the same file or same buffer multiple times during normal operations. We can technically call read_parquet multiple times if we OOM on the same host buffer, but that is an error condition and we are likely to recreate the reader from scratch.

Does this change affect chunking at all? that's the one case where I could see an OOM causing us to call into cuDF to re-read that chunk.

Copy link
Member Author

@mhaseeb123 mhaseeb123 Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this is just a new interface to the parquet readers (both chunked and non-chunked) where you can skip creating datasource and reading in metadata and directly pass in existing ones.

One advantage could be that you can save some time if reading the same file but with different options (make sure to create a local copy of metadatas and datasources as needed as we are using move semantics)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't we have to reread the same file multiple times if we hit the chunked read memory limit? wouldn't we want to reuse the datasource and metadata to read each chunk in that case?

Copy link
Member Author

@mhaseeb123 mhaseeb123 Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think the move semantics here are ok since we are passing in unique ptrs to datasources which get destructed with the reader. I could change the PR to move datasources but copy metadatas but that seems inconsistent. The current state allows the user to create a copy of both the datasource and metadata explicitly and pass in as needed instead of always.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, avoiding the metadata copies can actually impact performance. When I removed an accidental copy in read_parquet_metadata, the PDS-H overall got measurably faster (single digit %).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with the user explicitly having to opt in to copying things if they need to.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 3, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review


// Flatten all columns into a single vector for easier task distribution
std::vector<std::reference_wrapper<ColumnChunk>> all_column_chunks;
all_column_chunks.reserve(row_groups.size() * row_groups.front().columns.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we ever get here with empty row_groups?

Copy link
Member Author

@mhaseeb123 mhaseeb123 Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so unless doing it maliciously using the hybrid scan API
Edit: Actually we check for non-zero row_groups.size() and row_groups.front().size() in hybrid scan API too so I think we can't get here with empty stuff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that I had actually removed this check in the PR but added it again now. Thanks for catching

@JigaoLuo
Copy link
Contributor

JigaoLuo commented Dec 5, 2025

does any of your use cases involve reading the same file via (datasource and metadata) again and again?

(Sorry for the delay. I missed your message as I was busy writing at the beginning of this month.)

As you may know, I let different threads read different RGs of the same Parquet file. This does not mean the datasource is read multiple times.

However, one rare but valid case is the self‑join, essentially joining a table with itself (e.g., A JOIN A) for filtering purposes. In this case, I still need to fully read table A once to build the hash table, and then read A again to probe the hash table.

There are a couple of reasons why we did not cache A in memory: 1. We do not have any query optimizer. 2. We expect there may be cases where repeated reading is necessary, so we kept the design general.

This is more of a query processing & optimization discussion and slightly off‑topic for a reader of this issue. We could continue the discussion on Slack if that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 - Needs Review Waiting for reviewer to review or respond cudf-polars Issues specific to cudf-polars cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

[FEA] Parquet metadata caching due to overhead in reader

6 participants