Skip to content

Conversation

@gty404
Copy link
Contributor

@gty404 gty404 commented Apr 8, 2025

No description provided.

@gty404 gty404 force-pushed the identity-transform branch from 67a266d to bb94c3f Compare April 8, 2025 06:14
/// For transforms that require parameters (e.g., Bucket(N)), this holds the arguments
/// as a primitive ArrowArray (e.g., INT32 for num_buckets or width).
/// If the transform does not require parameters, this will be std::nullopt.
std::optional<ArrowArray> params_opt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do the parameters need to be Arrow data? I suppose what we need is a standard for scalars or row-wise data...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need TransformSpec though?

Copy link
Member

@wgtmac wgtmac Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the following approach:

struct TransformUtil {
  // TODO: source and result types may be required here
  static expected<ArrowArray, Error> Copy(const ArrowArray& input);
  static expected<ArrowArray, Error> Bucket(const ArrowArray& input, int32_t num_buckets);
  static expected<ArrowArray, Error> Truncate(const ArrowArray& input, int32_t width);
};

class Transform {
 public:
  virtual ~Transform() = default;

  expected<ArrowArray, Error> Apply(const ArrowArray& input) const {
    return transform_function_(input);
  }

 protected:
  Transform(
      TransformType transform_type, std::shared_ptr<Type> source_type,
      std::shared_ptr<Type> result_type,
      std::function<expected<ArrowArray, Error>(const ArrowArray&)> transform_function)
      : Transform(transform_type,
                  std::vector<std::shared_ptr<Type>>{std::move(source_type)},
                  std::move(result_type), std::move(transform_function)) {}

  Transform(
      TransformType transform_type, std::vector<std::shared_ptr<Type>> source_types,
      std::shared_ptr<Type> result_type,
      std::function<expected<ArrowArray, Error>(const ArrowArray&)> transform_function)
      : transform_type_(transform_type),
        source_types_(std::move(source_types)),
        result_type_(std::move(result_type)),
        transform_function_(std::move(transform_function)) {}

 private:
  TransformType transform_type_;
  std::vector<std::shared_ptr<Type>> source_types_;
  std::shared_ptr<Type> result_type_;
  std::function<expected<ArrowArray, Error>(const ArrowArray&)> transform_function_;
};

class IdentityTransform : public Transform {
 public:
  explicit IdentityTransform(std::shared_ptr<Type> source_type)
      : Transform(TransformType::kIdentity, source_type, source_type,
                  [](const ArrowArray& input) -> expected<ArrowArray, Error> {
                    return TransformUtil::Copy(input);
                  }) {}
};

class BucketTransform : public Transform {
 public:
  BucketTransform(std::shared_ptr<Type> source_type, int32_t num_buckets)
      : BucketTransform(std::vector<std::shared_ptr<Type>>{std::move(source_type)},
                        num_buckets) {}

  BucketTransform(std::vector<std::shared_ptr<Type>> source_types, int32_t num_buckets)
      : Transform(TransformType::kBucket, std::move(source_types),
                  std::make_shared<IntType>(),
                  [num_buckets](const ArrowArray& input) -> expected<ArrowArray, Error> {
                    return TransformUtil::Bucket(input, num_buckets);
                  }),
        num_buckets_{num_buckets} {}

  int32_t num_buckets() const { return num_buckets_; }

 private:
  int32_t num_buckets_;
};

We can even overload expected<ArrowArray, Error> operator()(const ArrowArray& input) const if you don't like the name of Apply.

WDYT? @gty404 @lidavidm @yingcai-cy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Using ArrowArray is necessary, as different Transforms may use different types and numbers of parameters, which should be needed within the Transform or when creating an instance of the Transform.

  2. TransformSpec is used to describe a transform, which can be derived from JSON and is also convenient for future extensions.

@lidavidm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference lies in the fact that your approach sub-classes uniformly provide a lambda function. I am concerned that the function might become too complex, as the same Transform type may need to implement corresponding logic based on different source types. However, iceberg-java has a similar approach; it generates specific function objects by binding the source type. @wgtmac

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the argument is that you can't instantiate the full transform during parsing, you could split up the transform's type (aka the operation + parameters) which you should be able to construct immediately from the fully instantiated transform

When creating a TransformFunction, you need to specify the transform type, optional parameters, and the types of one or more source columns. These types need to be retrievable from the context to get the table's schema. My intention was to use TransformSpec as a parsed data block that can be easily passed around and delay the creation of the TransformFunction until the context meets the requirements. This might just be a hypothetical scenario, and currently, there might not be a need to provide an entry to create all TransformFunctions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, which is why I said

you could split up the transform's type (aka the operation + parameters)[...]from the fully instantiated transform

You could even still have something like TransformSpec, I'm just arguing that it might make more sense to directly instantiate a transform-like object (with parameters, without types), just like how a Field separates the (parameterized) Type out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, which is why I said

you could split up the transform's type (aka the operation + parameters)[...]from the fully instantiated transform

You could even still have something like TransformSpec, I'm just arguing that it might make more sense to directly instantiate a transform-like object (with parameters, without types), just like how a Field separates the (parameterized) Type out.

I see. I will provide UnboundTransform, which includes the transform type and parameters. At the same time, TransformFunction will provide a Bind interface to receive UnboundTransform and source type, generating a specific TransformFunction. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. @wgtmac?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

/// For transforms that require parameters (e.g., Bucket(N)), this holds the arguments
/// as a primitive ArrowArray (e.g., INT32 for num_buckets or width).
/// If the transform does not require parameters, this will be std::nullopt.
std::optional<ArrowArray> params_opt;
Copy link
Member

@wgtmac wgtmac Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the following approach:

struct TransformUtil {
  // TODO: source and result types may be required here
  static expected<ArrowArray, Error> Copy(const ArrowArray& input);
  static expected<ArrowArray, Error> Bucket(const ArrowArray& input, int32_t num_buckets);
  static expected<ArrowArray, Error> Truncate(const ArrowArray& input, int32_t width);
};

class Transform {
 public:
  virtual ~Transform() = default;

  expected<ArrowArray, Error> Apply(const ArrowArray& input) const {
    return transform_function_(input);
  }

 protected:
  Transform(
      TransformType transform_type, std::shared_ptr<Type> source_type,
      std::shared_ptr<Type> result_type,
      std::function<expected<ArrowArray, Error>(const ArrowArray&)> transform_function)
      : Transform(transform_type,
                  std::vector<std::shared_ptr<Type>>{std::move(source_type)},
                  std::move(result_type), std::move(transform_function)) {}

  Transform(
      TransformType transform_type, std::vector<std::shared_ptr<Type>> source_types,
      std::shared_ptr<Type> result_type,
      std::function<expected<ArrowArray, Error>(const ArrowArray&)> transform_function)
      : transform_type_(transform_type),
        source_types_(std::move(source_types)),
        result_type_(std::move(result_type)),
        transform_function_(std::move(transform_function)) {}

 private:
  TransformType transform_type_;
  std::vector<std::shared_ptr<Type>> source_types_;
  std::shared_ptr<Type> result_type_;
  std::function<expected<ArrowArray, Error>(const ArrowArray&)> transform_function_;
};

class IdentityTransform : public Transform {
 public:
  explicit IdentityTransform(std::shared_ptr<Type> source_type)
      : Transform(TransformType::kIdentity, source_type, source_type,
                  [](const ArrowArray& input) -> expected<ArrowArray, Error> {
                    return TransformUtil::Copy(input);
                  }) {}
};

class BucketTransform : public Transform {
 public:
  BucketTransform(std::shared_ptr<Type> source_type, int32_t num_buckets)
      : BucketTransform(std::vector<std::shared_ptr<Type>>{std::move(source_type)},
                        num_buckets) {}

  BucketTransform(std::vector<std::shared_ptr<Type>> source_types, int32_t num_buckets)
      : Transform(TransformType::kBucket, std::move(source_types),
                  std::make_shared<IntType>(),
                  [num_buckets](const ArrowArray& input) -> expected<ArrowArray, Error> {
                    return TransformUtil::Bucket(input, num_buckets);
                  }),
        num_buckets_{num_buckets} {}

  int32_t num_buckets() const { return num_buckets_; }

 private:
  int32_t num_buckets_;
};

We can even overload expected<ArrowArray, Error> operator()(const ArrowArray& input) const if you don't like the name of Apply.

WDYT? @gty404 @lidavidm @yingcai-cy

@gty404 gty404 force-pushed the identity-transform branch 3 times, most recently from b45e466 to 328880b Compare April 8, 2025 14:28
/// For transforms that require parameters (e.g., Bucket(N)), this holds the arguments
/// as a int32 value (e.g., INT32 for num_buckets or width).
/// If the transform does not require parameters, this will be empty.
std::vector<std::variant<int32_t>> params;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply std::variant<std::monostate, int32_t> param?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference between TransformFunction and TransformSpec is this addition param. Should we delete TransformSpec by adding param to TransformFunction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Could there be multiple param?
  2. TransformSpec is just the result after json parsing, and TransformFunction is the specific implementation for the transformation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each transform function, there is unlikely to have more than one param in the near future. Am I correct, @Fokko?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the name of TransformSpec because it does not have a spec_id or something. What about renaming TransformSpec to Transform? For TransformFunction, we can use Apply or even operator() to resolve the name conflict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the name of TransformSpec because it does not have a spec_id or something. What about renaming TransformSpec to Transform? For TransformFunction, we can use Apply or even operator() to resolve the name conflict.

How about UnboundTransform?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose UnboundTransform will be used in the expression

@gty404 gty404 force-pushed the identity-transform branch 2 times, most recently from c3ac6c7 to df894f2 Compare April 9, 2025 13:27
class TableMetadata;
enum class TransformType;
class TransformFunction;
struct TransformSpec;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this struct elsewhere, do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not needed anymore, I will delete it.

EXPECT_EQ(TransformType::kUnknown, transform.transform_type());
EXPECT_EQ("unknown", transform.ToString());
EXPECT_EQ("unknown", std::format("{}", transform));
IdentityTransform transform{std::make_shared<StringType>()};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for this PR but we should consider singleton type instances for common types (though maybe sharing a single refcount is actually a net negative)

@gty404 gty404 force-pushed the identity-transform branch 2 times, most recently from 7ad56db to cb153d4 Compare April 10, 2025 11:51
@gty404
Copy link
Contributor Author

gty404 commented Apr 10, 2025

I made the following modifications based on the previous comment:

  1. The Transform object is used for JSON serialization/deserialization.

  2. Transform::Bind can get the TransformFunction, which is used for data transformation.

Please review in your spare time. Thanks @lidavidm @wgtmac @zhjwpku

@gty404 gty404 force-pushed the identity-transform branch from cb153d4 to e6b2fc9 Compare April 10, 2025 15:21
};

/// \brief A transform function used for partitioning.
class ICEBERG_EXPORT TransformFunction : public util::Formattable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not formattable anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transform is already formattable, I haven't thought of the need for TransformFunction to be printable yet, possibly outputting the source type/result type is a requirement. In the next PR, I will support this interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

/// parameter.
/// \param source_type The source column type to bind to.
/// \return A TransformFunction instance wrapped in `expected`, or an error on failure.
expected<std::unique_ptr<TransformFunction>, Error> Bind(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use forward declaration here and define TransformFunction in the transform_function.h?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea is that transform.h contains all the interfaces that transform users depend on, without exposing the implementation details of the transform functions.

return instance;
}

Transform::Transform(TransformType transform_type) : transform_type_(transform_type) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw for parameterized transform_type?


Transform::Transform(TransformType transform_type) : transform_type_(transform_type) {}

Transform::Transform(TransformType transform_type, int32_t param)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw for non-parameterized transform_type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to avoid invalid inputs. Perhaps we should define a separate static functions to create each transform type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently I only added Transform::Identity, I will add the others as well.


TransformType Transform::transform_type() const { return transform_type_; }

expected<std::unique_ptr<TransformFunction>, Error> Transform::Bind(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we should merge transform_function.h/cc to transform.h/cc. They have strong connections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest it is separate, with users of Transform only needing to depend on transform.h and not needing to concern themselves with the implementation details of the transform function.

@gty404 gty404 force-pushed the identity-transform branch from e6b2fc9 to 39148d0 Compare April 12, 2025 02:11
/// parameter.
/// \param source_type The source column type to bind to.
/// \return A TransformFunction instance wrapped in `expected`, or an error on failure.
expected<std::unique_ptr<TransformFunction>, Error> Bind(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expected<std::unique_ptr<TransformFunction>, Error> Bind(
Result<std::unique_ptr<TransformFunction>> Bind(

nit: we can leave these as-is and fix all together in a separate PR.

@gty404 gty404 force-pushed the identity-transform branch from 3e539be to 5a46de2 Compare April 13, 2025 01:30
@gty404
Copy link
Contributor Author

gty404 commented Apr 14, 2025

@Fokko @Xuanwo Could you help review and merge this? Thanks!

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The community has reached a consensus. Let's go!

@Xuanwo Xuanwo merged commit 185515a into apache:main Apr 14, 2025
6 checks passed
@gty404 gty404 deleted the identity-transform branch April 14, 2025 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants