-
Notifications
You must be signed in to change notification settings - Fork 123
feat: Add file name metadata column to parquet reading. #1512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1512 +/- ##
==========================================
+ Coverage 84.67% 84.74% +0.07%
==========================================
Files 122 122
Lines 32741 33188 +447
Branches 32741 33188 +447
==========================================
+ Hits 27722 28126 +404
- Misses 3674 3678 +4
- Partials 1345 1384 +39 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| pub mod reserved_field_ids { | ||
| /// Reserved field ID for the file name metadata column (`_file`). | ||
| /// This column provides the name of the Parquet file that contains each row. | ||
| pub const FILE_NAME: i64 = 2147483646; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might not be strictly necessary, but seems reasonable as a step towards iceberg unification.
nicklan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me. One odd thing is that we call it the "filename" column, but it's really the "file path" column. Is that an issue or is this already something common in parquet that I just don't know about
kernel/src/schema/mod.rs
Outdated
| Self::RowIndex => "row_index", | ||
| Self::RowId => "row_id", | ||
| Self::RowCommitVersion => "row_commit_version", | ||
| Self::FileName => "_file", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point we should probably agree on what we want to call these :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming was taken from iceberg spec, but I agree FilePath for the variable name is better lets do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to FilePath, happy to change this for now, just seemed a cheap way to be consistent but not too important, let me know if you prefer "file_path"
🥞 Stacked PR
Use this link to review incremental changes.
This adds a file name metadata column, to be able to determine which file a particular row came from. It also documents RowIndex as a requirement for Parquet readers.
TESTING=Added a new unit test.