-
Notifications
You must be signed in to change notification settings - Fork 268
perf: [iceberg] Deduplicate serialized metadata for Iceberg native scan #2933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2933 +/- ##
============================================
+ Coverage 56.12% 59.61% +3.48%
- Complexity 976 1379 +403
============================================
Files 119 167 +48
Lines 11743 15430 +3687
Branches 2251 2550 +299
============================================
+ Hits 6591 9198 +2607
- Misses 4012 4946 +934
- Partials 1140 1286 +146 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I ran TPC-H SF10 locally and saw: |
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @mbutrovich
| // Table metadata file path for FileIO initialization | ||
| string metadata_location = 4; | ||
|
|
||
| // Deduplication pools - shared data referenced by index from tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mbutrovich
Which issue does this PR close?
N/A.
Rationale for this change
For the initial implementation of Iceberg native scan we serialized
FileScanTaskobjects on the JVM side 1:1 to the native side for their contents. This results in a lot of duplicate work, particularly with JSON strings like schema. For hundreds, thousands, or even millions of tasks we will 1) convert the schema to JSON, 2) serialize it to protobuf, 3) deserialize it from protobuf, 4) parse it back to JSON.This is all fine for getting correctness working for Iceberg native scans, but we want to start optimizing this for production scale.
What changes are included in this PR?
Many
IcebergScanprotobuf fields are now "pools" of deduplicated values, andIcebergFileScanTaskreferences indices in these pools to extract values from. On the native side, we now cache the extracted values to reduce duplicate JSON parsing.How are these changes tested?
Existing tests.