Skip to content

Conversation

@crystalxyz
Copy link
Contributor

@crystalxyz crystalxyz commented Mar 2, 2025

Which issue does this PR close?

Closes #806

Rationale for this change

This PR implements decorators for udf and udaf to make UDF creation more easily.

Idea was suggested by: apache/datafusion-site#17 (comment)

What changes are included in this PR?

  • Implemented two decorator methods and exposed them to the users. The internal logic is simple because it serves as a wrapper to the actual udf/udaf methods.
  • Added test cases to validate that the decorator methods are equivalent to the original APIs.
  • Slight modification in tests/test_udf.py to make the is_null test cases more reliable. Previously, since all data are not null, the return value is always [False, False, False], which is the same as the default empty vector. It caused confusion during my development because it didn't fail when I had a wrong implementation. I changed one value to NULL so that the output becomes [False, False, True], and it can test the functionality better.
  • In order to use udf to represent both a function and a decorator, we check if the first argument is a Callable. If so, then it's a function all. If not, then it is a decorator call.

Are there any user-facing changes?

Yes, this PR provides a more straightforward way for users to create UDF and UDAF.

Old way to create UDF:

def is_of_interest_impl(
    partkey_arr: pa.Array,
    suppkey_arr: pa.Array,
    returnflag_arr: pa.Array,
) -> pa.Array:
    # Implementation skipped

is_of_interest = udf(
    is_of_interest_impl,
    [pa.int64(), pa.int64(), pa.utf8()],
    pa.bool_(),
    "stable",
)

df_udf_filter = df_lineitem.filter(
    is_of_interest(col("l_partkey"), col("l_suppkey"), col("l_returnflag"))
)

New way to create UDF:

@udf(
    [pa.int64(), pa.int64(), pa.utf8()],
    pa.bool_(),
    "stable")
def is_of_interest(
    partkey_arr: pa.Array,
    suppkey_arr: pa.Array,
    returnflag_arr: pa.Array,
) -> pa.Array:
    # Implementation skipped

df_udf_filter = df_lineitem.filter(
    is_of_interest(col("l_partkey"), col("l_suppkey"), col("l_returnflag"))
)

Old way to create UDAF:

def sum_bias_10_impl() -> Summarize:
    return Summarize(10.0)

sum_bias_10 = udaf(sum_bias_10_impl, pa.float64(), pa.float64(), [pa.float64()], "immutable")
sum_bias_10(...)

New way to create UDAF:

@udaf(pa.float64(), pa.float64(), [pa.float64()], "immutable")
def sum_bias_10() -> Summarize:
    return Summarize(10.0)

sum_bias_10(...)

@crystalxyz crystalxyz changed the title Implementation of udf and udaf decorator feat: Implementation of udf and udaf decorator Mar 2, 2025
@crystalxyz crystalxyz marked this pull request as ready for review March 2, 2025 22:46
Copy link
Member

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice addition and I love that you've already got some good unit tests.

I think from an end user perspective it might be slightly nicer if we could call these just @udf instead of @udf_decorator. I think it can be done.

This isn't my strongest suite, so I got a llm to generate this code:

import functools

class udf:
    """Acts both as a function and a decorator."""

    def __new__(cls, func_or_value):
        if callable(func_or_value):
            return cls._decorator(func_or_value)  # If used as a decorator
        else:
            return cls._function(func_or_value)  # If used as a function

    @staticmethod
    def _function(value):
        """Original function behavior."""
        return value * 2  # Example behavior

    @staticmethod
    def _decorator(func):
        """Decorator behavior."""
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            print(f"Calling {func.__name__} with {args}, {kwargs}")
            result = func(*args, **kwargs)
            print(f"Result: {result}")
            return result
        return wrapper

Obviously we would need to adapt that some for our use case. What do you think?

@crystalxyz
Copy link
Contributor Author

Thanks for your suggestion! I totally agree that @udf is a better name. I'll experiment it and provide an update soon!

@crystalxyz
Copy link
Contributor Author

@timsaucer Hi, I borrowed your suggested idea and managed to get it work! I also used llm a bit to write the documentation. I hope it's not too confusing for users to understand that the APIs for function call and decorator call are slightly different (one with the callable parameter and one without the callable). Please let me know if you have any feedback on how to improve it!

@timsaucer
Copy link
Member

This is looking very nice. Would you mind if I do some wordsmithing on the documentation? I'll also run the work flows now.

@crystalxyz
Copy link
Contributor Author

crystalxyz commented Mar 3, 2025

No problem! I will also fix the linting errors!

@crystalxyz crystalxyz requested a review from timsaucer March 7, 2025 22:45
@timsaucer
Copy link
Member

Thank you for all of your work on this. It looks like we now have a conflict that needs to be resolved since main updated. I've started the CI pipeline but it also looks like at least one ruff failure is happening. Do you have pre-commit set up in your local workspace? It should help to catch some of these before CI does. If it's not catching them, then maybe we need to update our scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add udf / udaf decorators

2 participants