Skip to content

Conversation

@wgtmac
Copy link
Member

@wgtmac wgtmac commented Aug 4, 2025

No description provided.

@wgtmac wgtmac force-pushed the parquet_projection branch from 8968c66 to 80a629b Compare August 4, 2025 15:37
@wgtmac wgtmac force-pushed the parquet_projection branch from 80a629b to 316f42a Compare August 4, 2025 16:16
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure what that lint failure is talking about

@wgtmac
Copy link
Member Author

wgtmac commented Aug 5, 2025

Hmm, not sure what that lint failure is talking about

Me neither :/

@lidavidm
Copy link
Member

lidavidm commented Aug 5, 2025

I'm guessing one of the GTest or GMock macros expands to something weird.

google/googletest#4531

@wgtmac wgtmac force-pushed the parquet_projection branch from b24dd93 to b287a6f Compare August 5, 2025 05:41
@wgtmac wgtmac force-pushed the parquet_projection branch 4 times, most recently from ceb4b1c to 6af2b8f Compare August 5, 2025 09:07
@wgtmac wgtmac force-pushed the parquet_projection branch 2 times, most recently from 1311ee6 to a94e6a0 Compare August 5, 2025 09:48
@wgtmac wgtmac force-pushed the parquet_projection branch 5 times, most recently from 472d2ee to ad10678 Compare August 5, 2025 16:08
@wgtmac wgtmac force-pushed the parquet_projection branch 7 times, most recently from a4531f0 to ca51b68 Compare August 6, 2025 05:50
@wgtmac wgtmac force-pushed the parquet_projection branch 2 times, most recently from c2f5b56 to 2ea602a Compare August 6, 2025 06:42
@wgtmac wgtmac force-pushed the parquet_projection branch from 2ea602a to b42bda5 Compare August 6, 2025 15:10
Copy link
Collaborator

@zhjwpku zhjwpku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this.

@wgtmac
Copy link
Member Author

wgtmac commented Aug 7, 2025

@Fokko @zeroshade Could you help review this? Thanks!

}
break;
case TypeId::kTime:
if (arrow_type->id() == ::arrow::Type::TIME64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check for ::arrow::TimeUnit::MICRO here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I have added an exhaustive test case to make sure I don't miss any primitive type.

}
break;
case TypeId::kDecimal:
if (arrow_type->id() == ::arrow::Type::DECIMAL128) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also accept the other types?

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still an open PR: apache/arrow#45351

@wgtmac wgtmac requested a review from Fokko August 9, 2025 06:34
Copy link
Contributor

@dongxiao1198 dongxiao1198 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines 154 to 160
if (arrow_type->id() == ::arrow::Type::FIXED_SIZE_BINARY) {
const auto& fixed_binary =
internal::checked_cast<const ::arrow::FixedSizeBinaryType&>(*arrow_type);
if (fixed_binary.byte_width() == 16) {
return {};
}
}
Copy link
Member

@zeroshade zeroshade Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also allow https://github.com/apache/arrow/blob/main/cpp/src/arrow/extension/uuid.h#L35

You can validate via arrow_type->id() == ::arrow::Type::EXTENSION and the extension_name() == "arrow.uuid"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed.

Comment on lines +131 to +140
case TypeId::kString:
if (arrow_type->id() == ::arrow::Type::STRING) {
return {};
}
break;
case TypeId::kBinary:
if (arrow_type->id() == ::arrow::Type::BINARY) {
return {};
}
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about LargeString, LargeBinary, StringView and BinaryView?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think parquet-cpp has supported these types.

// TODO(gangwu): support v3 unknown type
Status ValidateParquetSchemaEvolution(
const Type& expected_type, const ::parquet::arrow::SchemaField& parquet_field) {
const auto& arrow_type = parquet_field.field->type();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget offhand if this will return a DictionaryType for dictionary encoded columns, if so then you need to check for the DictionaryType and then switch on the ValueType of it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it won't. Reading dictionary is supported via an option to create a RecordReader: https://github.com/apache/arrow/blob/2dd3ccda6437f79aa34641bd3197dd7392ae4aec/cpp/src/parquet/column_reader.h#L266

}
break;
case TypeId::kList:
if (arrow_type->id() == ::arrow::Type::LIST) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about LargeList and ListView?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListView is not supported by parquet-cpp yet. I think we should just support simple list and binary type variants in the early versions of iceberg-cpp. Once parquet-cpp has full support, we can leverage them later.

@wgtmac wgtmac force-pushed the parquet_projection branch 2 times, most recently from 89410b3 to b686ccb Compare August 12, 2025 08:07
@wgtmac wgtmac force-pushed the parquet_projection branch from b686ccb to bc9d8be Compare August 12, 2025 09:40
@wgtmac wgtmac changed the title feat(parquet): add schema projection for parquet feat(parquet): add schema projection to parquet Aug 13, 2025
Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, let's move!

@Xuanwo Xuanwo merged commit 7063f2b into apache:main Aug 13, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants