Skip to content

Conversation

@emkornfield
Copy link
Collaborator

@emkornfield emkornfield commented Nov 20, 2025

🥞 Stacked PR

Use this link to review incremental changes.


This adds a file name metadata column, to be able to determine which file a particular row came from. It also documents RowIndex as a requirement for Parquet readers.

TESTING=Added a new unit test.

@codecov
Copy link

codecov bot commented Nov 20, 2025

Codecov Report

❌ Patch coverage is 82.03125% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.74%. Comparing base (ac8510e) to head (e8842b7).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/default/parquet.rs 38.46% 7 Missing and 1 partial ⚠️
kernel/src/engine/arrow_utils.rs 92.30% 0 Missing and 7 partials ⚠️
kernel/src/schema/mod.rs 55.55% 4 Missing ⚠️
kernel/src/scan/state_info.rs 0.00% 3 Missing ⚠️
kernel/src/engine/sync/parquet.rs 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1512      +/-   ##
==========================================
+ Coverage   84.67%   84.74%   +0.07%     
==========================================
  Files         122      122              
  Lines       32741    33188     +447     
  Branches    32741    33188     +447     
==========================================
+ Hits        27722    28126     +404     
- Misses       3674     3678       +4     
- Partials     1345     1384      +39     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pub mod reserved_field_ids {
/// Reserved field ID for the file name metadata column (`_file`).
/// This column provides the name of the Parquet file that contains each row.
pub const FILE_NAME: i64 = 2147483646;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might not be strictly necessary, but seems reasonable as a step towards iceberg unification.

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Nov 20, 2025
@emkornfield emkornfield changed the title wip feat: Add file name metadata column to parquet reading. Nov 20, 2025
Copy link
Member

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me. One odd thing is that we call it the "filename" column, but it's really the "file path" column. Is that an issue or is this already something common in parquet that I just don't know about

Self::RowIndex => "row_index",
Self::RowId => "row_id",
Self::RowCommitVersion => "row_commit_version",
Self::FileName => "_file",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point we should probably agree on what we want to call these :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming was taken from iceberg spec, but I agree FilePath for the variable name is better lets do that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to FilePath, happy to change this for now, just seemed a cheap way to be consistent but not too important, let me know if you prefer "file_path"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants