-
Notifications
You must be signed in to change notification settings - Fork 69
feat: add partition field/partition spec #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
gty404
commented
Mar 30, 2025
- Implement the API for partition spec/field.
- add pure virtual classes for transform
src/iceberg/partition_spec.h
Outdated
| [[nodiscard]] int32_t spec_id() const; | ||
| /// \brief Get a view of the partition fields. | ||
| [[nodiscard]] virtual std::span<const PartitionField> fields() const; | ||
| /// \brief Get a field by field ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we clarify whether these are source/transform field IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we don't need these GetFieldByXXX at this moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we don't need these
GetFieldByXXXat this moment?
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we clarify whether these are source/transform field IDs?
I will change it to transform_fields, and the PartitionField class will define source_field_id/transform_field_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we don't need these
GetFieldByXXXat this moment?
Okay, I'll delete this part of the interface first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, fields are required
src/iceberg/partition_field.h
Outdated
| /// \brief Get the transform type. | ||
| [[nodiscard]] TransformType transform_type() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java stores the full transform, not the type; do we need to do the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. Some transform types required additional parameters like bucket number or truncate length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change it to TransformFunction
src/iceberg/transform.h
Outdated
| #include <cstdint> | ||
| #include <memory> | ||
|
|
||
| #include "iceberg/arrow_c_data_internal.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not include any internal header in the public header file. To use Arrow C data, we can include arrow_c_data.h.
src/iceberg/transform.h
Outdated
| /// Always produces `null` | ||
| kVoid, | ||
| /// Used to represent some customized transform that can't be recognized or supported now. | ||
| kUnknown, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about putting kUnknown to the beginning?
src/iceberg/transform.h
Outdated
| public: | ||
| explicit TransformFunction(TransformType type); | ||
| /// \brief Transform an input array to a new array | ||
| [[nodiscard]] virtual expected<std::unique_ptr<ArrowArray>, Error> Transform(const ArrowArray& inputArrowArray) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| [[nodiscard]] virtual expected<std::unique_ptr<ArrowArray>, Error> Transform(const ArrowArray& inputArrowArray) = 0; | |
| virtual expected<ArrowArray, Error> Transform(const ArrowArray& data) = 0; |
expected already has nodiscard attribute in its definition, we don't need it here.
For ArrowArray, usually we don't use std::unique_ptr since we have to explicitly call its release function.
src/iceberg/transform.h
Outdated
| [[nodiscard]] virtual TransformType transform_type() const; | ||
| [[nodiscard]] std::string ToString() const override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| [[nodiscard]] virtual TransformType transform_type() const; | |
| [[nodiscard]] std::string ToString() const override; | |
| virtual TransformType transform_type() const; | |
| std::string ToString() const override; |
I don't think we need to add nodiscard to these functions.
src/iceberg/partition_spec.h
Outdated
| [[nodiscard]] int32_t spec_id() const; | ||
| /// \brief Get a view of the partition fields. | ||
| [[nodiscard]] virtual std::span<const PartitionField> fields() const; | ||
| /// \brief Get a field by field ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we don't need these GetFieldByXXX at this moment?
src/iceberg/partition_spec.h
Outdated
| class ICEBERG_EXPORT PartitionSpec { | ||
| public: | ||
| virtual ~PartitionSpec() = default; | ||
| PartitionSpec(int32_t spec_id, std::vector<PartitionField> fields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we need an Schema in the constructor to denote to the partition schema.
test/transform_test.cc
Outdated
| { | ||
| EXPECT_EQ("Identity", iceberg::ToString(iceberg::TransformType::kIdentity)); | ||
| EXPECT_EQ("Bucket", iceberg::ToString(iceberg::TransformType::kBucket)); | ||
| EXPECT_EQ("Truncate", iceberg::ToString(iceberg::TransformType::kTruncate)); | ||
| EXPECT_EQ("Year", iceberg::ToString(iceberg::TransformType::kYear)); | ||
| EXPECT_EQ("Month", iceberg::ToString(iceberg::TransformType::kMonth)); | ||
| EXPECT_EQ("Day", iceberg::ToString(iceberg::TransformType::kDay)); | ||
| EXPECT_EQ("Hour", iceberg::ToString(iceberg::TransformType::kHour)); | ||
| EXPECT_EQ("Void", iceberg::ToString(iceberg::TransformType::kVoid)); | ||
| EXPECT_EQ("Unknown", iceberg::ToString(iceberg::TransformType::kUnknown)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| { | |
| EXPECT_EQ("Identity", iceberg::ToString(iceberg::TransformType::kIdentity)); | |
| EXPECT_EQ("Bucket", iceberg::ToString(iceberg::TransformType::kBucket)); | |
| EXPECT_EQ("Truncate", iceberg::ToString(iceberg::TransformType::kTruncate)); | |
| EXPECT_EQ("Year", iceberg::ToString(iceberg::TransformType::kYear)); | |
| EXPECT_EQ("Month", iceberg::ToString(iceberg::TransformType::kMonth)); | |
| EXPECT_EQ("Day", iceberg::ToString(iceberg::TransformType::kDay)); | |
| EXPECT_EQ("Hour", iceberg::ToString(iceberg::TransformType::kHour)); | |
| EXPECT_EQ("Void", iceberg::ToString(iceberg::TransformType::kVoid)); | |
| EXPECT_EQ("Unknown", iceberg::ToString(iceberg::TransformType::kUnknown)); | |
| } | |
| EXPECT_EQ("Identity", iceberg::ToString(iceberg::TransformType::kIdentity)); | |
| EXPECT_EQ("Bucket", iceberg::ToString(iceberg::TransformType::kBucket)); | |
| EXPECT_EQ("Truncate", iceberg::ToString(iceberg::TransformType::kTruncate)); | |
| EXPECT_EQ("Year", iceberg::ToString(iceberg::TransformType::kYear)); | |
| EXPECT_EQ("Month", iceberg::ToString(iceberg::TransformType::kMonth)); | |
| EXPECT_EQ("Day", iceberg::ToString(iceberg::TransformType::kDay)); | |
| EXPECT_EQ("Hour", iceberg::ToString(iceberg::TransformType::kHour)); | |
| EXPECT_EQ("Void", iceberg::ToString(iceberg::TransformType::kVoid)); | |
| EXPECT_EQ("Unknown", iceberg::ToString(iceberg::TransformType::kUnknown)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the iceberg namespace so we don't have to add iceberg:: everywhere.
test/transform_test.cc
Outdated
| EXPECT_EQ("Unknown", std::format("{}", transform)); | ||
|
|
||
| auto [_, arrow_array] = | ||
| iceberg::internal::CreateExampleArrowSchemaAndArrayByNanoarrow(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove CreateExampleArrowSchemaAndArrayByNanoarrow some day. Perhaps add a TODO here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll build a simple arrow array to test
src/iceberg/exception.h
Outdated
| explicit IcebergError(const std::string& what) : std::runtime_error(what) {} | ||
| }; | ||
|
|
||
| #define ICEBERG_CHECK_FMT(condition, ...) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #define ICEBERG_CHECK_FMT(condition, ...) \ | |
| #define ICEBERG_CHECK(condition, ...) \ |
src/iceberg/partition_field.h
Outdated
| /// \param[in] field_id The partition field ID. | ||
| /// \param[in] name The partition field name. | ||
| /// \param[in] transformType The transform type. | ||
| PartitionField(int32_t source_id, int32_t field_id, std::string name, TransformType transformType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source_id->source_field_id? keep consistent with impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok to be consistent with the spec: https://iceberg.apache.org/spec/#partition-specs
| V1 | V2 | V3 | Field | JSON representation | Example |
|----------|----------|----------|------------------|---------------------|--------------|
| required | required | omitted | **`source-id`** | `JSON int` | 1 |
| | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` |
| | required | required | **`field-id`** | `JSON int` | 1000 |
| required | required | required | **`name`** | `JSON string` | `id_bucket` |
| required | required | required | **`transform`** | `JSON string` | `bucket[16]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iceberg-java used Transform, iceberg-rust used Transform, which is an enum type, and TransformFunction serves as the implementation interface for Transform. Which one does iceberg-cpp tend to use? @wgtmac @yingcai-cy
58e0b5e to
be03f0a
Compare
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
@lidavidm @zhjwpku @yingcai-cy Do you want to take a look? |
zhjwpku
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, thanks.
