-
Notifications
You must be signed in to change notification settings - Fork 70
feat(parquet): add schema projection to parquet #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8968c66 to
80a629b
Compare
80a629b to
316f42a
Compare
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure what that lint failure is talking about
Me neither :/ |
|
I'm guessing one of the GTest or GMock macros expands to something weird. |
b24dd93 to
b287a6f
Compare
ceb4b1c to
6af2b8f
Compare
1311ee6 to
a94e6a0
Compare
472d2ee to
ad10678
Compare
a4531f0 to
ca51b68
Compare
c2f5b56 to
2ea602a
Compare
2ea602a to
b42bda5
Compare
zhjwpku
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for working on this.
|
@Fokko @zeroshade Could you help review this? Thanks! |
| } | ||
| break; | ||
| case TypeId::kTime: | ||
| if (arrow_type->id() == ::arrow::Type::TIME64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also check for ::arrow::TimeUnit::MICRO here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I have added an exhaustive test case to make sure I don't miss any primitive type.
| } | ||
| break; | ||
| case TypeId::kDecimal: | ||
| if (arrow_type->id() == ::arrow::Type::DECIMAL128) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still an open PR: apache/arrow#45351
dongxiao1198
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
| if (arrow_type->id() == ::arrow::Type::FIXED_SIZE_BINARY) { | ||
| const auto& fixed_binary = | ||
| internal::checked_cast<const ::arrow::FixedSizeBinaryType&>(*arrow_type); | ||
| if (fixed_binary.byte_width() == 16) { | ||
| return {}; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably also allow https://github.com/apache/arrow/blob/main/cpp/src/arrow/extension/uuid.h#L35
You can validate via arrow_type->id() == ::arrow::Type::EXTENSION and the extension_name() == "arrow.uuid"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Fixed.
| case TypeId::kString: | ||
| if (arrow_type->id() == ::arrow::Type::STRING) { | ||
| return {}; | ||
| } | ||
| break; | ||
| case TypeId::kBinary: | ||
| if (arrow_type->id() == ::arrow::Type::BINARY) { | ||
| return {}; | ||
| } | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about LargeString, LargeBinary, StringView and BinaryView?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think parquet-cpp has supported these types.
| // TODO(gangwu): support v3 unknown type | ||
| Status ValidateParquetSchemaEvolution( | ||
| const Type& expected_type, const ::parquet::arrow::SchemaField& parquet_field) { | ||
| const auto& arrow_type = parquet_field.field->type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forget offhand if this will return a DictionaryType for dictionary encoded columns, if so then you need to check for the DictionaryType and then switch on the ValueType of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it won't. Reading dictionary is supported via an option to create a RecordReader: https://github.com/apache/arrow/blob/2dd3ccda6437f79aa34641bd3197dd7392ae4aec/cpp/src/parquet/column_reader.h#L266
| } | ||
| break; | ||
| case TypeId::kList: | ||
| if (arrow_type->id() == ::arrow::Type::LIST) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about LargeList and ListView?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ListView is not supported by parquet-cpp yet. I think we should just support simple list and binary type variants in the early versions of iceberg-cpp. Once parquet-cpp has full support, we can leverage them later.
89410b3 to
b686ccb
Compare
b686ccb to
bc9d8be
Compare
Xuanwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this, let's move!

No description provided.