chore: various refactoring changes for iceberg [iceberg] #2680

parthchandra · 2025-11-03T20:56:21Z

Which issue does this PR close?

Part of the changes needed for #2060
Mostly does cleanup of the native_iceberg_compat APIs so the they do not have Parquet classes. As a plus provides a utility class to allow ParquetMetadata to be serialized and deserialized to/from the Thrift format. This will also be useful in passing ParquetMetadata from JVM to native (for all native scan implementations). Currently the native scans end up reading Parquet metadata again (even though it has already been read in the JVM side) and this can be a costly operation in object stores.

codecov-commenter · 2025-11-03T21:17:53Z

Codecov Report

❌ Patch coverage is 0% with 136 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.14%. Comparing base (f09f8af) to head (d8cd7b7).
⚠️ Report is 671 commits behind head on main.

Files with missing lines	Patch %	Lines
...va/org/apache/comet/parquet/NativeBatchReader.java	0.00%	100 Missing ⚠️
...e/comet/parquet/IcebergCometNativeBatchReader.java	0.00%	22 Missing ⚠️
...pache/comet/parquet/ParquetMetadataSerializer.java	0.00%	13 Missing ⚠️
...org/apache/comet/parquet/AbstractColumnReader.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2680      +/-   ##
============================================
+ Coverage     56.12%   57.14%   +1.01%     
- Complexity      976     1386     +410     
============================================
  Files           119      149      +30     
  Lines         11743    13930    +2187     
  Branches       2251     2391     +140     
============================================
+ Hits           6591     7960    +1369     
- Misses         4012     4751     +739     
- Partials       1140     1219      +79

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2025-11-04T01:57:08Z

native/core/Cargo.toml

 hdfs-sys = {version = "0.3", optional = true, features = ["hdfs_3_3"]}
-opendal = { version ="0.54.1", optional = true, features = ["services-hdfs"] }
-uuid = "1.0"
+opendal = { version ="0.54.0", optional = true, features = ["services-hdfs"] }


is there a reason for this change? Comet could still choose to use 0.54.1 since it is semver compatible

Looks like this happened due to rebasing. Reverted.

andygrove

LGTM. Thanks @parthchandra

martin-g · 2025-11-04T13:14:12Z

common/src/main/java/org/apache/comet/parquet/IcebergCometNativeBatchReader.java

+    this.dataSchema = dataSchema;
+    this.isCaseSensitive = isCaseSensitive;
+    this.useFieldId = useFieldId;
+    this.ignoreMissingIds = ignoreMissingIds;


Suggested change

this.ignoreMissingIds = ignoreMissingIds;

this.ignoreMissingIds = ignoreMissingIds;

this.useLegacyDateTimestamp = useLegacyDateTimestamp;

martin-g · 2025-11-04T13:19:13Z

common/src/main/java/org/apache/comet/parquet/IcebergCometNativeBatchReader.java

+    this.ignoreMissingIds = ignoreMissingIds;
+    this.partitionSchema = partitionSchema;
+    this.partitionValues = partitionValues;
+    this.preInitializedReaders = preInitializedReaders;


Suggested change

this.preInitializedReaders = preInitializedReaders;

this.preInitializedReaders = preInitializedReaders;

this.metrics.clear();

if (metrics != null) {

this.metrics.putAll(metrics);

}

martin-g · 2025-11-04T13:34:09Z

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

+            filteredSchema = filteredSchema.add(sparkFields[i]);
+          }
+        }
+        sparkSchema = filteredSchema;


Is it possible that the filtering done here may lead to ArrayIndexOutOfBoundsException at https://github.com/parthchandra/datafusion-comet/blob/d73bcbab9f80836d7229207f309283942501e9ab/common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java#L985 ?
Now the sparkSchema may have less fields than before I see no new logic to protect the .fields()[i] call there.

Yes, you're right. This is not entirely correct. Let me fix this.

Yup. Fixed to match the fields by name.

martin-g · 2025-11-04T13:46:34Z

common/src/main/java/org/apache/comet/parquet/IcebergCometNativeBatchReader.java

+import org.apache.spark.sql.types.StructType;
+
+/**
+ * A specialized NativeBatchReader for Iceberg that accepts ParquetMetadata as a JSON string. This


accepts ParquetMetadata as a JSON string - actually it accepts byte[] parquetMetadataBytes at https://github.com/apache/datafusion-comet/pull/2680/files#diff-e57878f6cd8036999500de5719f8f4bbe28e1ed5dcb79a02ad7d7eb206f37473R44, i.e. not a String but bytes.

Thank you for catching this. The first version I did used JSON, but this is more efficient.

martin-g · 2025-11-07T06:49:14Z

@parthchandra You said Done but I see no new commits in the PR. Did the push fail ?

parthchandra · 2025-11-07T18:26:30Z

@parthchandra You said Done but I see no new commits in the PR. Did the push fail ?

Oops. I had pushed to the wrong branch :(. Corrected.

chore: various refactoring changes for iceberg

d73bcba

parthchandra marked this pull request as draft November 3, 2025 20:56

parthchandra marked this pull request as ready for review November 3, 2025 23:56

parthchandra requested a review from andygrove November 3, 2025 23:57

andygrove reviewed Nov 4, 2025

View reviewed changes

andygrove approved these changes Nov 4, 2025

View reviewed changes

martin-g reviewed Nov 4, 2025

View reviewed changes

andygrove changed the title ~~chore: various refactoring changes for iceberg~~ chore: various refactoring changes for iceberg [iceberg] Nov 6, 2025

parthchandra added 2 commits November 5, 2025 20:08

address comments

d12f4f4

revert change to opendal version

d8cd7b7

	this.ignoreMissingIds = ignoreMissingIds;
	this.ignoreMissingIds = ignoreMissingIds;
	this.useLegacyDateTimestamp = useLegacyDateTimestamp;

-    this.preInitializedReaders = preInitializedReaders;
+    this.preInitializedReaders = preInitializedReaders;
+    this.metrics.clear();
+    if (metrics != null) {
+      this.metrics.putAll(metrics);
+    }

chore: various refactoring changes for iceberg [iceberg] #2680

Are you sure you want to change the base?

chore: various refactoring changes for iceberg [iceberg] #2680

Uh oh!

Conversation

parthchandra commented Nov 3, 2025

Which issue does this PR close?

Uh oh!

codecov-commenter commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-g commented Nov 7, 2025

Uh oh!

parthchandra commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Nov 3, 2025 •

edited

Loading