-
Notifications
You must be signed in to change notification settings - Fork 69
feat: convert iceberg schema to arrow schema #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/iceberg/arrow_c_data.h
Outdated
| #ifndef ARROW_C_STREAM_INTERFACE | ||
| # define ARROW_C_STREAM_INTERFACE | ||
|
|
||
| struct ArrowArrayStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention [0] in the \file description?
I have a feeling that this is not related to PR $titile, so maybe another PR for this?
[0] https://arrow.apache.org/docs/format/CStreamInterface.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to add this, otherwise compiling files with headers from nanoarrow will complain missing definition of ArrowArrayStream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then the patch LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed ArrowArrayStream. I think internally we can just include nanoarrow.h so we don't have to add ArrowArrayStream and confuse the downstream.
test/arrow_test.cc
Outdated
| ASSERT_EQ(imported_schema->num_fields(), 1); | ||
|
|
||
| auto field = imported_schema->field(0); | ||
| CheckArrowField(*field, ::arrow::Type::STRUCT, kStructFieldName, /*nullable=*/false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically you should ASSERT_NO_FATAL_FAILURE(CheckArrowField(...))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I forgot to add them.
test/test_util.h
Outdated
|
|
||
| #pragma once | ||
|
|
||
| #define EXPECT_OK(value) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you can write a GMock matcher using one of the macros GMock provides, then you can do something like ASSERT_THAT(Foo(), IsOk()). A macro might be simpler, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion! Added matchers.h for this.
src/iceberg/schema_internal.cc
Outdated
| std::string_view name = "", int32_t field_id = -1) { | ||
| ArrowBuffer metadata_buffer; | ||
| NANOARROW_RETURN_NOT_OK(ArrowMetadataBuilderInit(&metadata_buffer, nullptr)); | ||
| if (field_id > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, where in the Iceberg spec does it say that field IDs must be nonzero? (I don't even see a bit width defined for field ID in the spec...)
optional might be a more idiomatic way to write the type but this works (if we can find a reference for the field ID type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! In practice, parquet-cpp rejects negative field_id: https://github.com/apache/arrow/blob/618ef501a21375abfaeee19e393eb64dee83ef0d/cpp/src/parquet/arrow/schema.cc#L248-L278
@Fokko @rdblue Has this been discussed before? Should we restrict field_id to be non-negative from the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, should field ID 0 be valid then though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've switched to use std::optional<int32_t> for now.
zhjwpku
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Xuanwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank @wgtmac for working on this. I have some dumb questions over this PR.
src/iceberg/schema_internal.cc
Outdated
| constexpr const char* kArrowExtensionMetadata = "ARROW:extension:metadata"; | ||
|
|
||
| // Convert an Iceberg type to Arrow schema. Return value is Nanoarrow error code. | ||
| ArrowErrorCode ConvertToArrowSchema(const Type& type, ArrowSchema* schema, bool optional, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConvertToArrowSchema and ToArrowSchema are a bit confusing to me. Do we need to make the API naming more clear or split them in different namespaces?
By the way, do we have a policy for our API args' order?
input_field_a, out, input_field_b, input_field_c seems not good to me and hard to follow.
Is it widely adopted in cpp community?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Actually they are in different namespaces (the former one is in the anonymous namespace). I have renamed them to ToArrowSchema to be consistent.
W.r.t. the order of arguments, those with default values must be placed at the end. Since they are all internal functions, I have just removed any default argument and put the out param to the end.
Xuanwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, make sense to me
|
Thank you @Xuanwo! |
No description provided.