-
Notifications
You must be signed in to change notification settings - Fork 70
feat: transform function #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
67a266d to
bb94c3f
Compare
| /// For transforms that require parameters (e.g., Bucket(N)), this holds the arguments | ||
| /// as a primitive ArrowArray (e.g., INT32 for num_buckets or width). | ||
| /// If the transform does not require parameters, this will be std::nullopt. | ||
| std::optional<ArrowArray> params_opt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do the parameters need to be Arrow data? I suppose what we need is a standard for scalars or row-wise data...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need TransformSpec though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the following approach:
struct TransformUtil {
// TODO: source and result types may be required here
static expected<ArrowArray, Error> Copy(const ArrowArray& input);
static expected<ArrowArray, Error> Bucket(const ArrowArray& input, int32_t num_buckets);
static expected<ArrowArray, Error> Truncate(const ArrowArray& input, int32_t width);
};
class Transform {
public:
virtual ~Transform() = default;
expected<ArrowArray, Error> Apply(const ArrowArray& input) const {
return transform_function_(input);
}
protected:
Transform(
TransformType transform_type, std::shared_ptr<Type> source_type,
std::shared_ptr<Type> result_type,
std::function<expected<ArrowArray, Error>(const ArrowArray&)> transform_function)
: Transform(transform_type,
std::vector<std::shared_ptr<Type>>{std::move(source_type)},
std::move(result_type), std::move(transform_function)) {}
Transform(
TransformType transform_type, std::vector<std::shared_ptr<Type>> source_types,
std::shared_ptr<Type> result_type,
std::function<expected<ArrowArray, Error>(const ArrowArray&)> transform_function)
: transform_type_(transform_type),
source_types_(std::move(source_types)),
result_type_(std::move(result_type)),
transform_function_(std::move(transform_function)) {}
private:
TransformType transform_type_;
std::vector<std::shared_ptr<Type>> source_types_;
std::shared_ptr<Type> result_type_;
std::function<expected<ArrowArray, Error>(const ArrowArray&)> transform_function_;
};
class IdentityTransform : public Transform {
public:
explicit IdentityTransform(std::shared_ptr<Type> source_type)
: Transform(TransformType::kIdentity, source_type, source_type,
[](const ArrowArray& input) -> expected<ArrowArray, Error> {
return TransformUtil::Copy(input);
}) {}
};
class BucketTransform : public Transform {
public:
BucketTransform(std::shared_ptr<Type> source_type, int32_t num_buckets)
: BucketTransform(std::vector<std::shared_ptr<Type>>{std::move(source_type)},
num_buckets) {}
BucketTransform(std::vector<std::shared_ptr<Type>> source_types, int32_t num_buckets)
: Transform(TransformType::kBucket, std::move(source_types),
std::make_shared<IntType>(),
[num_buckets](const ArrowArray& input) -> expected<ArrowArray, Error> {
return TransformUtil::Bucket(input, num_buckets);
}),
num_buckets_{num_buckets} {}
int32_t num_buckets() const { return num_buckets_; }
private:
int32_t num_buckets_;
};
We can even overload expected<ArrowArray, Error> operator()(const ArrowArray& input) const if you don't like the name of Apply.
WDYT? @gty404 @lidavidm @yingcai-cy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Using ArrowArray is necessary, as different Transforms may use different types and numbers of parameters, which should be needed within the Transform or when creating an instance of the Transform.
-
TransformSpec is used to describe a transform, which can be derived from JSON and is also convenient for future extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main difference lies in the fact that your approach sub-classes uniformly provide a lambda function. I am concerned that the function might become too complex, as the same Transform type may need to implement corresponding logic based on different source types. However, iceberg-java has a similar approach; it generates specific function objects by binding the source type. @wgtmac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the argument is that you can't instantiate the full transform during parsing, you could split up the transform's type (aka the operation + parameters) which you should be able to construct immediately from the fully instantiated transform
When creating a TransformFunction, you need to specify the transform type, optional parameters, and the types of one or more source columns. These types need to be retrievable from the context to get the table's schema. My intention was to use TransformSpec as a parsed data block that can be easily passed around and delay the creation of the TransformFunction until the context meets the requirements. This might just be a hypothetical scenario, and currently, there might not be a need to provide an entry to create all TransformFunctions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, which is why I said
you could split up the transform's type (aka the operation + parameters)[...]from the fully instantiated transform
You could even still have something like TransformSpec, I'm just arguing that it might make more sense to directly instantiate a transform-like object (with parameters, without types), just like how a Field separates the (parameterized) Type out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, which is why I said
you could split up the transform's type (aka the operation + parameters)[...]from the fully instantiated transform
You could even still have something like TransformSpec, I'm just arguing that it might make more sense to directly instantiate a transform-like object (with parameters, without types), just like how a Field separates the (parameterized) Type out.
I see. I will provide UnboundTransform, which includes the transform type and parameters. At the same time, TransformFunction will provide a Bind interface to receive UnboundTransform and source type, generating a specific TransformFunction. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me. @wgtmac?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM!
| /// For transforms that require parameters (e.g., Bucket(N)), this holds the arguments | ||
| /// as a primitive ArrowArray (e.g., INT32 for num_buckets or width). | ||
| /// If the transform does not require parameters, this will be std::nullopt. | ||
| std::optional<ArrowArray> params_opt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the following approach:
struct TransformUtil {
// TODO: source and result types may be required here
static expected<ArrowArray, Error> Copy(const ArrowArray& input);
static expected<ArrowArray, Error> Bucket(const ArrowArray& input, int32_t num_buckets);
static expected<ArrowArray, Error> Truncate(const ArrowArray& input, int32_t width);
};
class Transform {
public:
virtual ~Transform() = default;
expected<ArrowArray, Error> Apply(const ArrowArray& input) const {
return transform_function_(input);
}
protected:
Transform(
TransformType transform_type, std::shared_ptr<Type> source_type,
std::shared_ptr<Type> result_type,
std::function<expected<ArrowArray, Error>(const ArrowArray&)> transform_function)
: Transform(transform_type,
std::vector<std::shared_ptr<Type>>{std::move(source_type)},
std::move(result_type), std::move(transform_function)) {}
Transform(
TransformType transform_type, std::vector<std::shared_ptr<Type>> source_types,
std::shared_ptr<Type> result_type,
std::function<expected<ArrowArray, Error>(const ArrowArray&)> transform_function)
: transform_type_(transform_type),
source_types_(std::move(source_types)),
result_type_(std::move(result_type)),
transform_function_(std::move(transform_function)) {}
private:
TransformType transform_type_;
std::vector<std::shared_ptr<Type>> source_types_;
std::shared_ptr<Type> result_type_;
std::function<expected<ArrowArray, Error>(const ArrowArray&)> transform_function_;
};
class IdentityTransform : public Transform {
public:
explicit IdentityTransform(std::shared_ptr<Type> source_type)
: Transform(TransformType::kIdentity, source_type, source_type,
[](const ArrowArray& input) -> expected<ArrowArray, Error> {
return TransformUtil::Copy(input);
}) {}
};
class BucketTransform : public Transform {
public:
BucketTransform(std::shared_ptr<Type> source_type, int32_t num_buckets)
: BucketTransform(std::vector<std::shared_ptr<Type>>{std::move(source_type)},
num_buckets) {}
BucketTransform(std::vector<std::shared_ptr<Type>> source_types, int32_t num_buckets)
: Transform(TransformType::kBucket, std::move(source_types),
std::make_shared<IntType>(),
[num_buckets](const ArrowArray& input) -> expected<ArrowArray, Error> {
return TransformUtil::Bucket(input, num_buckets);
}),
num_buckets_{num_buckets} {}
int32_t num_buckets() const { return num_buckets_; }
private:
int32_t num_buckets_;
};
We can even overload expected<ArrowArray, Error> operator()(const ArrowArray& input) const if you don't like the name of Apply.
WDYT? @gty404 @lidavidm @yingcai-cy
b45e466 to
328880b
Compare
| /// For transforms that require parameters (e.g., Bucket(N)), this holds the arguments | ||
| /// as a int32 value (e.g., INT32 for num_buckets or width). | ||
| /// If the transform does not require parameters, this will be empty. | ||
| std::vector<std::variant<int32_t>> params; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not simply std::variant<std::monostate, int32_t> param?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference between TransformFunction and TransformSpec is this addition param. Should we delete TransformSpec by adding param to TransformFunction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Could there be multiple param?
- TransformSpec is just the result after json parsing, and TransformFunction is the specific implementation for the transformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For each transform function, there is unlikely to have more than one param in the near future. Am I correct, @Fokko?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the name of TransformSpec because it does not have a spec_id or something. What about renaming TransformSpec to Transform? For TransformFunction, we can use Apply or even operator() to resolve the name conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the name of
TransformSpecbecause it does not have a spec_id or something. What about renamingTransformSpectoTransform? ForTransformFunction, we can useApplyor evenoperator()to resolve the name conflict.
How about UnboundTransform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose UnboundTransform will be used in the expression
c3ac6c7 to
df894f2
Compare
src/iceberg/type_fwd.h
Outdated
| class TableMetadata; | ||
| enum class TransformType; | ||
| class TransformFunction; | ||
| struct TransformSpec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this struct elsewhere, do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not needed anymore, I will delete it.
test/transform_test.cc
Outdated
| EXPECT_EQ(TransformType::kUnknown, transform.transform_type()); | ||
| EXPECT_EQ("unknown", transform.ToString()); | ||
| EXPECT_EQ("unknown", std::format("{}", transform)); | ||
| IdentityTransform transform{std::make_shared<StringType>()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not for this PR but we should consider singleton type instances for common types (though maybe sharing a single refcount is actually a net negative)
7ad56db to
cb153d4
Compare
cb153d4 to
e6b2fc9
Compare
| }; | ||
|
|
||
| /// \brief A transform function used for partitioning. | ||
| class ICEBERG_EXPORT TransformFunction : public util::Formattable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not formattable anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transform is already formattable, I haven't thought of the need for TransformFunction to be printable yet, possibly outputting the source type/result type is a requirement. In the next PR, I will support this interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
src/iceberg/transform.h
Outdated
| /// parameter. | ||
| /// \param source_type The source column type to bind to. | ||
| /// \return A TransformFunction instance wrapped in `expected`, or an error on failure. | ||
| expected<std::unique_ptr<TransformFunction>, Error> Bind( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use forward declaration here and define TransformFunction in the transform_function.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea is that transform.h contains all the interfaces that transform users depend on, without exposing the implementation details of the transform functions.
| return instance; | ||
| } | ||
|
|
||
| Transform::Transform(TransformType transform_type) : transform_type_(transform_type) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw for parameterized transform_type?
|
|
||
| Transform::Transform(TransformType transform_type) : transform_type_(transform_type) {} | ||
|
|
||
| Transform::Transform(TransformType transform_type, int32_t param) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw for non-parameterized transform_type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to avoid invalid inputs. Perhaps we should define a separate static functions to create each transform type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently I only added Transform::Identity, I will add the others as well.
src/iceberg/transform.cc
Outdated
|
|
||
| TransformType Transform::transform_type() const { return transform_type_; } | ||
|
|
||
| expected<std::unique_ptr<TransformFunction>, Error> Transform::Bind( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we should merge transform_function.h/cc to transform.h/cc. They have strong connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest it is separate, with users of Transform only needing to depend on transform.h and not needing to concern themselves with the implementation details of the transform function.
e6b2fc9 to
39148d0
Compare
src/iceberg/transform.h
Outdated
| /// parameter. | ||
| /// \param source_type The source column type to bind to. | ||
| /// \return A TransformFunction instance wrapped in `expected`, or an error on failure. | ||
| expected<std::unique_ptr<TransformFunction>, Error> Bind( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| expected<std::unique_ptr<TransformFunction>, Error> Bind( | |
| Result<std::unique_ptr<TransformFunction>> Bind( |
nit: we can leave these as-is and fix all together in a separate PR.
3e539be to
5a46de2
Compare
Xuanwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The community has reached a consensus. Let's go!
No description provided.