Skip to content

Conversation

@gty404
Copy link
Contributor

@gty404 gty404 commented Mar 30, 2025

  • Implement the API for partition spec/field.
  • add pure virtual classes for transform

[[nodiscard]] int32_t spec_id() const;
/// \brief Get a view of the partition fields.
[[nodiscard]] virtual std::span<const PartitionField> fields() const;
/// \brief Get a field by field ID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we clarify whether these are source/transform field IDs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we don't need these GetFieldByXXX at this moment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we don't need these GetFieldByXXX at this moment?

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we clarify whether these are source/transform field IDs?

I will change it to transform_fields, and the PartitionField class will define source_field_id/transform_field_id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we don't need these GetFieldByXXX at this moment?

Okay, I'll delete this part of the interface first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the spec, it seems we should keep the fields.
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fields are required

Comment on lines 55 to 56
/// \brief Get the transform type.
[[nodiscard]] TransformType transform_type() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java stores the full transform, not the type; do we need to do the same?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Some transform types required additional parameters like bucket number or truncate length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it to TransformFunction

#include <cstdint>
#include <memory>

#include "iceberg/arrow_c_data_internal.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not include any internal header in the public header file. To use Arrow C data, we can include arrow_c_data.h.

/// Always produces `null`
kVoid,
/// Used to represent some customized transform that can't be recognized or supported now.
kUnknown,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about putting kUnknown to the beginning?

public:
explicit TransformFunction(TransformType type);
/// \brief Transform an input array to a new array
[[nodiscard]] virtual expected<std::unique_ptr<ArrowArray>, Error> Transform(const ArrowArray& inputArrowArray) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[[nodiscard]] virtual expected<std::unique_ptr<ArrowArray>, Error> Transform(const ArrowArray& inputArrowArray) = 0;
virtual expected<ArrowArray, Error> Transform(const ArrowArray& data) = 0;

expected already has nodiscard attribute in its definition, we don't need it here.

For ArrowArray, usually we don't use std::unique_ptr since we have to explicitly call its release function.

Comment on lines 74 to 75
[[nodiscard]] virtual TransformType transform_type() const;
[[nodiscard]] std::string ToString() const override;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[[nodiscard]] virtual TransformType transform_type() const;
[[nodiscard]] std::string ToString() const override;
virtual TransformType transform_type() const;
std::string ToString() const override;

I don't think we need to add nodiscard to these functions.

[[nodiscard]] int32_t spec_id() const;
/// \brief Get a view of the partition fields.
[[nodiscard]] virtual std::span<const PartitionField> fields() const;
/// \brief Get a field by field ID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we don't need these GetFieldByXXX at this moment?

class ICEBERG_EXPORT PartitionSpec {
public:
virtual ~PartitionSpec() = default;
PartitionSpec(int32_t spec_id, std::vector<PartitionField> fields);
Copy link
Member

@wgtmac wgtmac Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we need an Schema in the constructor to denote to the partition schema.

Comment on lines 31 to 43
{
EXPECT_EQ("Identity", iceberg::ToString(iceberg::TransformType::kIdentity));
EXPECT_EQ("Bucket", iceberg::ToString(iceberg::TransformType::kBucket));
EXPECT_EQ("Truncate", iceberg::ToString(iceberg::TransformType::kTruncate));
EXPECT_EQ("Year", iceberg::ToString(iceberg::TransformType::kYear));
EXPECT_EQ("Month", iceberg::ToString(iceberg::TransformType::kMonth));
EXPECT_EQ("Day", iceberg::ToString(iceberg::TransformType::kDay));
EXPECT_EQ("Hour", iceberg::ToString(iceberg::TransformType::kHour));
EXPECT_EQ("Void", iceberg::ToString(iceberg::TransformType::kVoid));
EXPECT_EQ("Unknown", iceberg::ToString(iceberg::TransformType::kUnknown));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{
EXPECT_EQ("Identity", iceberg::ToString(iceberg::TransformType::kIdentity));
EXPECT_EQ("Bucket", iceberg::ToString(iceberg::TransformType::kBucket));
EXPECT_EQ("Truncate", iceberg::ToString(iceberg::TransformType::kTruncate));
EXPECT_EQ("Year", iceberg::ToString(iceberg::TransformType::kYear));
EXPECT_EQ("Month", iceberg::ToString(iceberg::TransformType::kMonth));
EXPECT_EQ("Day", iceberg::ToString(iceberg::TransformType::kDay));
EXPECT_EQ("Hour", iceberg::ToString(iceberg::TransformType::kHour));
EXPECT_EQ("Void", iceberg::ToString(iceberg::TransformType::kVoid));
EXPECT_EQ("Unknown", iceberg::ToString(iceberg::TransformType::kUnknown));
}
EXPECT_EQ("Identity", iceberg::ToString(iceberg::TransformType::kIdentity));
EXPECT_EQ("Bucket", iceberg::ToString(iceberg::TransformType::kBucket));
EXPECT_EQ("Truncate", iceberg::ToString(iceberg::TransformType::kTruncate));
EXPECT_EQ("Year", iceberg::ToString(iceberg::TransformType::kYear));
EXPECT_EQ("Month", iceberg::ToString(iceberg::TransformType::kMonth));
EXPECT_EQ("Day", iceberg::ToString(iceberg::TransformType::kDay));
EXPECT_EQ("Hour", iceberg::ToString(iceberg::TransformType::kHour));
EXPECT_EQ("Void", iceberg::ToString(iceberg::TransformType::kVoid));
EXPECT_EQ("Unknown", iceberg::ToString(iceberg::TransformType::kUnknown));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the iceberg namespace so we don't have to add iceberg:: everywhere.

EXPECT_EQ("Unknown", std::format("{}", transform));

auto [_, arrow_array] =
iceberg::internal::CreateExampleArrowSchemaAndArrayByNanoarrow();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove CreateExampleArrowSchemaAndArrayByNanoarrow some day. Perhaps add a TODO here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll build a simple arrow array to test

explicit IcebergError(const std::string& what) : std::runtime_error(what) {}
};

#define ICEBERG_CHECK_FMT(condition, ...) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#define ICEBERG_CHECK_FMT(condition, ...) \
#define ICEBERG_CHECK(condition, ...) \

/// \param[in] field_id The partition field ID.
/// \param[in] name The partition field name.
/// \param[in] transformType The transform type.
PartitionField(int32_t source_id, int32_t field_id, std::string name, TransformType transformType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source_id->source_field_id? keep consistent with impl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to be consistent with the spec: https://iceberg.apache.org/spec/#partition-specs


| V1       | V2       | V3       | Field            | JSON representation | Example      |
|----------|----------|----------|------------------|---------------------|--------------|
| required | required | omitted  | **`source-id`**  | `JSON int`          | 1            |
|          |          | required | **`source-ids`** | `JSON list of ints` | `[1,2]`      |
|          | required | required | **`field-id`**   | `JSON int`          | 1000         |
| required | required | required | **`name`**       | `JSON string`       | `id_bucket`  |
| required | required | required | **`transform`**  | `JSON string`       | `bucket[16]` |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iceberg-java used Transform, iceberg-rust used Transform, which is an enum type, and TransformFunction serves as the implementation interface for Transform. Which one does iceberg-cpp tend to use? @wgtmac @yingcai-cy

@gty404 gty404 force-pushed the table-metadata branch 4 times, most recently from 58e0b5e to be03f0a Compare April 1, 2025 06:42
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@gty404 gty404 changed the title Add partition field/partition spec feat: add partition field/partition spec Apr 1, 2025
@wgtmac
Copy link
Member

wgtmac commented Apr 1, 2025

@lidavidm @zhjwpku @yingcai-cy Do you want to take a look?

Copy link
Collaborator

@zhjwpku zhjwpku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, thanks.

@wgtmac
Copy link
Member

wgtmac commented Apr 1, 2025

@Fokko @Xuanwo Could you help review and merge it?

@Fokko Fokko merged commit 2912352 into apache:main Apr 1, 2025
6 checks passed
@gty404 gty404 deleted the table-metadata branch April 1, 2025 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants