-
Notifications
You must be signed in to change notification settings - Fork 249
docs: Add detailed diagrams to contributor guide for all Parquet scan implementations #2681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@mbutrovich @parthchandra This is low priority, but could you review when you get a chance? |
| │ 2. For each column: │ | ||
| │ - Get ColumnDescriptor │ | ||
| │ - Read pages via │ | ||
| │ PageReadStore │ | ||
| │ - Create CometVector │ | ||
| │ from native data │ | ||
| │ 3. Return ColumnarBatch │ | ||
| └───────────┬───────────────────┘ | ||
| │ | ||
| │ Uses JNI to access native decoders | ||
| │ (not for page reading, only for | ||
| │ specialized operations if needed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Between steps 2-3 is where there is a set of JNI calls to decode individual columns.
| │ via CometBatchIterator │ | ||
| │ │ | ||
| │ Key operations: │ | ||
| │ ├─ next_batch() │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one more JNI call here. ScanExec.get_next makes a call back to CometBatchIterator which exports the batch back to native (so an FFI call)
| │ Key method: │ | ||
| │ init(AbstractColumnReader[]) │ Iceberg provides column readers | ||
| │ │ | ||
| │ Purpose: │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct. (This is how the current integration of native_comet and Iceberg works). native_iceberg_compat uses the same init_datasource_exec call to create a DataFusion DataSourceExec and wraps the natve batch and native columns in corresponding classes on the JVM side (NativeBatchReader and NativeColumnReader)
| │ | ||
| ↓ | ||
| ┌───────────────────────────────┐ | ||
| │ AbstractColumnReader[] │ Iceberg-managed column readers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Iceberg, we will use IcebergCometNativeBatchReader which does not pass in column readers.
Iceberg integration is not done (and may not be done this way) anyway, so there must be a way that native_iceberg_compat works without Iceberg.
Which issue does this PR close?
Addresses PR review feedback in #2674
Rationale for this change
Add more detailed documentation explaining how scans are implemented.
What changes are included in this PR?
How are these changes tested?