-
Notifications
You must be signed in to change notification settings - Fork 3
Copy Hive and Iceberg code to connectors/lakehouse #440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
932485f
to
165ff8d
Compare
567c3db
to
a4bb14c
Compare
fix decimal avg function precision issue Alchemy-item: (ID = 310) [6020 ] Spark sql avg agg function support decimal commit 1/1 - a464bf5
Alchemy-item: (ID = 311) [oap ] Register merge extract companion agg functions without suffix commit 1/1 - 2de4df7
Alchemy-item: (ID = 313) [11067] Support scan filter for decimal in ORC commit 1/1 - ecf478c
The function toValues removes duplicated values from the vector and return them in a std::vector. It was used to build an InPredicate. It will be needed for building NOT IN filters for Iceberg equality delete read as well, therefore moving it from velox/functions/prestosql/InPred icate.cpp to velox/type/Filter.h. This commit also renames it to deDuplicateValues to make it easier to understand. Alchemy-item: (ID = 441) Iceberg staging hub commit 1/3 - e951481
This commit introduces EqualityDeleteFileReader, which is used to read Iceberg splits with equality delete files. The equality delete files are read to construct domain filters or filter functions, which then would be evaluated in the base file readers. When there is only one equality delete field, and when that field is an Iceberg identifier field, i.e. non-floating point primitive types, the values would be converted to a list as a NOT IN domain filter, with the NULL treated separately. This domain filter would then be pushed to the ColumnReaders to filter our unwanted rows before they are read into Velox vectors. When the equality delete column is a nested column, e.g. a sub-column in a struct, or the key in a map, such column may not be in the base file ScanSpec. We need to add/remove these subfields to/from the SchemaWithId and ScanSpec recursively if they were not in the ScanSpec already. A test is also added for such case. If there are more than one equality delete field, or the field is not an Iceberg identifier field, the values would be converted to a typed expression in the conjunct of disconjunts form. This expression would be evaluated as the remaining filter function after the rows are read into the Velox vectors. Note that this only works for Presto now as the "neq" function is not registered by Spark. See https://github.com/ facebookincubator/issues/12667 Note that this commit only supports integral types. VARCHAR and VARBINARY need to be supported in future commits (see facebookincubator#12664). Co-authored-by: Naveen Kumar Mahadevuni <[email protected]> Alchemy-item: (ID = 441) Iceberg staging hub commit 2/3 - 8cb17fb
165ff8d
to
aed4bff
Compare
8d3644b
to
9d3fe00
Compare
@@ -15,21 +15,6 @@ | |||
name: Linux Build using GCC | |||
|
|||
on: | |||
push: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove these files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems introduced by changes in main branch, it's not related with the iceberg patch
path: '${{ env.CCACHE_DIR }}' | ||
key: ccache-centos7-release-default-${{github.sha}} | ||
|
||
# linux-gcc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why comment these codes instead of drop it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just for testing now, will remove this
@@ -194,6 +194,11 @@ class HiveConfig { | |||
static constexpr const char* kPreserveFlatMapsInMemorySession = | |||
"preserve_flat_maps_in_memory"; | |||
|
|||
static constexpr const char* kEnableRequestedTypeCheck = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requested-type-check-enabled
connectorQueryCtx, | ||
commitStrategy, | ||
hiveConfig_); | ||
if (auto icebergInsertHandle = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the base does not contains IcebergInsertTableHandle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ying,
I think the base branch might not IBM:main, as you can see there are 14 commits now, but I don't think you intended to include all of them. Maybe you can rebase again.
@zhouyuan @jinchengchenghh @PingLiuPing Yes the PR on main branch is not reviewable. I've rebased several times but the commits keep changing. Could you please review #502 instead? It's based on oss-main. |
This PR is the first step of refactoring Velox’s built-in Iceberg support into its own standalone connector under a new connectors/lakehouse/iceberg hierarchy. By decoupling Iceberg from the Hive connector, we lay the groundwork for a true plugin architecture that:
What’s changed
Currently it is only a plain copy of the existing Hive-based implementations. It is not included in the build and will not break any CI. In follow-up work we’ll update their headers/namespaces, add necessary changes to make them build, remove Hive dependencies, clean up the implementations and add more features.