Skip to content

Conversation

@parthchandra
Copy link
Contributor

Which issue does this PR close?

Part of the changes needed for #2060
Mostly does cleanup of the native_iceberg_compat APIs so the they do not have Parquet classes. As a plus provides a utility class to allow ParquetMetadata to be serialized and deserialized to/from the Thrift format. This will also be useful in passing ParquetMetadata from JVM to native (for all native scan implementations). Currently the native scans end up reading Parquet metadata again (even though it has already been read in the JVM side) and this can be a costly operation in object stores.

@parthchandra parthchandra marked this pull request as draft November 3, 2025 20:56
@codecov-commenter
Copy link

codecov-commenter commented Nov 3, 2025

Codecov Report

❌ Patch coverage is 0% with 136 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.14%. Comparing base (f09f8af) to head (d8cd7b7).
⚠️ Report is 671 commits behind head on main.

Files with missing lines Patch % Lines
...va/org/apache/comet/parquet/NativeBatchReader.java 0.00% 100 Missing ⚠️
...e/comet/parquet/IcebergCometNativeBatchReader.java 0.00% 22 Missing ⚠️
...pache/comet/parquet/ParquetMetadataSerializer.java 0.00% 13 Missing ⚠️
...org/apache/comet/parquet/AbstractColumnReader.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2680      +/-   ##
============================================
+ Coverage     56.12%   57.14%   +1.01%     
- Complexity      976     1386     +410     
============================================
  Files           119      149      +30     
  Lines         11743    13930    +2187     
  Branches       2251     2391     +140     
============================================
+ Hits           6591     7960    +1369     
- Misses         4012     4751     +739     
- Partials       1140     1219      +79     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@parthchandra parthchandra marked this pull request as ready for review November 3, 2025 23:56
hdfs-sys = {version = "0.3", optional = true, features = ["hdfs_3_3"]}
opendal = { version ="0.54.1", optional = true, features = ["services-hdfs"] }
uuid = "1.0"
opendal = { version ="0.54.0", optional = true, features = ["services-hdfs"] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason for this change? Comet could still choose to use 0.54.1 since it is semver compatible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this happened due to rebasing. Reverted.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @parthchandra

this.dataSchema = dataSchema;
this.isCaseSensitive = isCaseSensitive;
this.useFieldId = useFieldId;
this.ignoreMissingIds = ignoreMissingIds;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.ignoreMissingIds = ignoreMissingIds;
this.ignoreMissingIds = ignoreMissingIds;
this.useLegacyDateTimestamp = useLegacyDateTimestamp;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

this.ignoreMissingIds = ignoreMissingIds;
this.partitionSchema = partitionSchema;
this.partitionValues = partitionValues;
this.preInitializedReaders = preInitializedReaders;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.preInitializedReaders = preInitializedReaders;
this.preInitializedReaders = preInitializedReaders;
this.metrics.clear();
if (metrics != null) {
this.metrics.putAll(metrics);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

filteredSchema = filteredSchema.add(sparkFields[i]);
}
}
sparkSchema = filteredSchema;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the filtering done here may lead to ArrayIndexOutOfBoundsException at https://github.com/parthchandra/datafusion-comet/blob/d73bcbab9f80836d7229207f309283942501e9ab/common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java#L985 ?
Now the sparkSchema may have less fields than before I see no new logic to protect the .fields()[i] call there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. This is not entirely correct. Let me fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Fixed to match the fields by name.

import org.apache.spark.sql.types.StructType;

/**
* A specialized NativeBatchReader for Iceberg that accepts ParquetMetadata as a JSON string. This
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepts ParquetMetadata as a JSON string - actually it accepts byte[] parquetMetadataBytes at https://github.com/apache/datafusion-comet/pull/2680/files#diff-e57878f6cd8036999500de5719f8f4bbe28e1ed5dcb79a02ad7d7eb206f37473R44, i.e. not a String but bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching this. The first version I did used JSON, but this is more efficient.

@andygrove andygrove changed the title chore: various refactoring changes for iceberg chore: various refactoring changes for iceberg [iceberg] Nov 6, 2025
@martin-g
Copy link
Member

martin-g commented Nov 7, 2025

@parthchandra You said Done but I see no new commits in the PR. Did the push fail ?

@parthchandra
Copy link
Contributor Author

@parthchandra You said Done but I see no new commits in the PR. Did the push fail ?

Oops. I had pushed to the wrong branch :(. Corrected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants