feat: convert iceberg schema to arrow schema #53

wgtmac · 2025-03-28T07:17:48Z

No description provided.

wgtmac · 2025-03-28T08:19:26Z

zhjwpku · 2025-03-28T11:48:44Z

src/iceberg/arrow_c_data.h

+#ifndef ARROW_C_STREAM_INTERFACE
+#  define ARROW_C_STREAM_INTERFACE
+
+struct ArrowArrayStream {


Should we mention [0] in the \file description?

I have a feeling that this is not related to PR $titile, so maybe another PR for this?

[0] https://arrow.apache.org/docs/format/CStreamInterface.html

I have to add this, otherwise compiling files with headers from nanoarrow will complain missing definition of ArrowArrayStream.

OK, then the patch LGTM.

I have removed ArrowArrayStream. I think internally we can just include nanoarrow.h so we don't have to add ArrowArrayStream and confuse the downstream.

lidavidm · 2025-03-28T12:31:41Z

test/arrow_test.cc

+  ASSERT_EQ(imported_schema->num_fields(), 1);
+
+  auto field = imported_schema->field(0);
+  CheckArrowField(*field, ::arrow::Type::STRUCT, kStructFieldName, /*nullable=*/false,


technically you should ASSERT_NO_FATAL_FAILURE(CheckArrowField(...))

You're right. I forgot to add them.

lidavidm · 2025-03-28T12:32:50Z

test/test_util.h

+
+#pragma once
+
+#define EXPECT_OK(value) \


Note that you can write a GMock matcher using one of the macros GMock provides, then you can do something like ASSERT_THAT(Foo(), IsOk()). A macro might be simpler, though.

Good suggestion! Added matchers.h for this.

lidavidm · 2025-03-28T12:35:48Z

src/iceberg/schema_internal.cc

+                                    std::string_view name = "", int32_t field_id = -1) {
+  ArrowBuffer metadata_buffer;
+  NANOARROW_RETURN_NOT_OK(ArrowMetadataBuilderInit(&metadata_buffer, nullptr));
+  if (field_id > 0) {


BTW, where in the Iceberg spec does it say that field IDs must be nonzero? (I don't even see a bit width defined for field ID in the spec...)

optional might be a more idiomatic way to write the type but this works (if we can find a reference for the field ID type)

Good question! In practice, parquet-cpp rejects negative field_id: https://github.com/apache/arrow/blob/618ef501a21375abfaeee19e393eb64dee83ef0d/cpp/src/parquet/arrow/schema.cc#L248-L278

@Fokko @rdblue Has this been discussed before? Should we restrict field_id to be non-negative from the spec?

Hmm, should field ID 0 be valid then though?

I've switched to use std::optional<int32_t> for now.

zhjwpku

LGTM

wgtmac · 2025-03-31T02:50:37Z

Thanks @lidavidm @zhjwpku! I believe I have addressed all the comments now.

wgtmac · 2025-03-31T06:52:49Z

@Fokko @Xuanwo Could you help review and merge it? Thanks!

Xuanwo

Hi, thank @wgtmac for working on this. I have some dumb questions over this PR.

Xuanwo · 2025-03-31T09:20:56Z

src/iceberg/schema_internal.cc

+constexpr const char* kArrowExtensionMetadata = "ARROW:extension:metadata";
+
+// Convert an Iceberg type to Arrow schema. Return value is Nanoarrow error code.
+ArrowErrorCode ConvertToArrowSchema(const Type& type, ArrowSchema* schema, bool optional,


ConvertToArrowSchema and ToArrowSchema are a bit confusing to me. Do we need to make the API naming more clear or split them in different namespaces?

By the way, do we have a policy for our API args' order?

input_field_a, out, input_field_b, input_field_c seems not good to me and hard to follow.

Is it widely adopted in cpp community?

That makes sense. Actually they are in different namespaces (the former one is in the anonymous namespace). I have renamed them to ToArrowSchema to be consistent.

W.r.t. the order of arguments, those with default values must be placed at the end. Since they are all internal functions, I have just removed any default argument and put the out param to the end.

Xuanwo

Thank you, make sense to me

wgtmac · 2025-03-31T14:01:07Z

Thank you @Xuanwo!

feat: convert iceberg schema to arrow schema

9f7f079

wgtmac force-pushed the to_arrow_schema branch from f0d8e61 to 9f7f079 Compare March 28, 2025 07:18

zhjwpku reviewed Mar 28, 2025

View reviewed changes

lidavidm approved these changes Mar 28, 2025

View reviewed changes

zhjwpku approved these changes Mar 29, 2025

View reviewed changes

add matchers

711d144

wgtmac force-pushed the to_arrow_schema branch from c956e74 to 711d144 Compare March 31, 2025 02:49

lidavidm approved these changes Mar 31, 2025

View reviewed changes

Xuanwo reviewed Mar 31, 2025

View reviewed changes

rename functions to be consistent

1fb24d5

Xuanwo approved these changes Mar 31, 2025

View reviewed changes

Xuanwo merged commit faf9cc8 into apache:main Mar 31, 2025
6 checks passed


		#pragma once

		#define EXPECT_OK(value) \

feat: convert iceberg schema to arrow schema #53

feat: convert iceberg schema to arrow schema #53

Uh oh!

Conversation

wgtmac commented Mar 28, 2025

Uh oh!

wgtmac commented Mar 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhjwpku left a comment

Choose a reason for hiding this comment

Uh oh!

wgtmac commented Mar 31, 2025

Uh oh!

wgtmac commented Mar 31, 2025

Uh oh!

Xuanwo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wgtmac Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xuanwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wgtmac commented Mar 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wgtmac Mar 31, 2025 •

edited

Loading