Skip to content

Conversation

@mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented Dec 18, 2025

Which issue does this PR close?

N/A.

Rationale for this change

For the initial implementation of Iceberg native scan we serialized FileScanTask objects on the JVM side 1:1 to the native side for their contents. This results in a lot of duplicate work, particularly with JSON strings like schema. For hundreds, thousands, or even millions of tasks we will 1) convert the schema to JSON, 2) serialize it to protobuf, 3) deserialize it from protobuf, 4) parse it back to JSON.

This is all fine for getting correctness working for Iceberg native scans, but we want to start optimizing this for production scale.

What changes are included in this PR?

Many IcebergScan protobuf fields are now "pools" of deduplicated values, and IcebergFileScanTask references indices in these pools to extract values from. On the native side, we now cache the extracted values to reduce duplicate JSON parsing.

How are these changes tested?

Existing tests.

@mbutrovich mbutrovich changed the title perf: [iceberg] Deduplicate serialized Iceberg metadata perf: [iceberg] Deduplicate serialized Iceberg native scan metadata Dec 18, 2025
@mbutrovich mbutrovich changed the title perf: [iceberg] Deduplicate serialized Iceberg native scan metadata perf: [iceberg] Deduplicate serialized metadata for Iceberg native scan Dec 18, 2025
@codecov-commenter
Copy link

codecov-commenter commented Dec 18, 2025

Codecov Report

❌ Patch coverage is 94.28571% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.61%. Comparing base (f09f8af) to head (6d68c96).
⚠️ Report is 775 commits behind head on main.

Files with missing lines Patch % Lines
.../comet/serde/operator/CometIcebergNativeScan.scala 94.28% 5 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2933      +/-   ##
============================================
+ Coverage     56.12%   59.61%   +3.48%     
- Complexity      976     1379     +403     
============================================
  Files           119      167      +48     
  Lines         11743    15430    +3687     
  Branches       2251     2550     +299     
============================================
+ Hits           6591     9198    +2607     
- Misses         4012     4946     +934     
- Partials       1140     1286     +146     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mbutrovich
Copy link
Contributor Author

I ran TPC-H SF10 locally and saw:

IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 15 tasks, 8 pools (93.3% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 15 tasks, 8 pools (93.3% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 15 tasks, 8 pools (93.3% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 15 tasks, 8 pools (93.3% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 15 tasks, 8 pools (93.3% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 15 tasks, 8 pools (93.3% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 15 tasks, 8 pools (93.3% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 15 tasks, 8 pools (93.3% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 15 tasks, 8 pools (93.3% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 15 tasks, 8 pools (93.3% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 8 tasks, 8 pools (87.5% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 26 tasks, 8 pools (96.2% avg dedup)
IcebergScan: 15 tasks, 8 pools (93.3% avg dedup)
IcebergScan: 1 tasks, 8 pools (0.0% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 2 tasks, 8 pools (50.0% avg dedup)
IcebergScan: 15 tasks, 8 pools (93.3% avg dedup)

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @mbutrovich

// Table metadata file path for FileIO initialization
string metadata_location = 4;

// Deduplication pools - shared data referenced by index from tasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mbutrovich mbutrovich self-assigned this Dec 18, 2025
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich

@mbutrovich mbutrovich merged commit 53e4092 into apache:main Dec 18, 2025
283 of 287 checks passed
@mbutrovich mbutrovich deleted the dedup_iceberg_metadata branch December 18, 2025 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants