Skip to content

Conversation

@deanm0000
Copy link
Contributor

Which issue does this PR close?

Works towards closing #1064

Rationale for this change

To improve ergonomics of the API by adding functions to the Expr class so that they can be chained

What changes are included in this PR?

This includes function definitions in the Expr class for all the single argument functions that take Expr and return Expr

Are there any user-facing changes?

Yes, users can use functions chained from other Exprs for example col("a").abs().tan()

No breaking changes.

@deanm0000
Copy link
Contributor Author

I wrote this script to help produce this PR. I wanted to start with just the functions which have a single Expr input that return an Expr. The script writes all those defs to a file and then I copied that over to the source and let the linter fix the bad formatting that my code created.

from inspect import signature, Parameter
from datafusion import functions as f, Expr
from types import FunctionType
from pathlib import Path

funcs_not_exprs = set(dir(f)) - set(dir(Expr))
funcs = []
for fun in funcs_not_exprs:
    if isinstance(getattr(f, fun), FunctionType):
        funcs.append(fun)

expr_in_out = {
    "one_in_out": [],
    "multi_expr": [],
    "other_ins": [],
    "other_return": [],
    "other": [],
}
for fun in funcs:
    sig = signature(getattr(f, fun))
    params = sig.parameters
    return_annotation = sig.return_annotation
    if return_annotation != "Expr":
        expr_in_out["other_return"].append(fun)
        continue
    all_expr = True
    no_star = True
    for name, param in params.items():
        if param.annotation != "Expr":
            all_expr = False
            break
        if param.kind in (Parameter.VAR_POSITIONAL, Parameter.KEYWORD_ONLY):
            no_star=False
    if len(params) == 1 and all_expr and no_star:
        expr_in_out["one_in_out"].append(fun)
    elif len(params) > 1 and all_expr:
        expr_in_out["multi_expr"].append(fun)
    elif len(params) > 1 and not all_expr:
        expr_in_out["other_ins"].append(fun)
    else:
        expr_in_out["other"].append(fun)

expr_defs = Path("./expr_defs.py")
with expr_defs.open("w") as ff:
    for fun in expr_in_out["one_in_out"]:
        ff.write(f"    def {fun}(self) -> Expr:\n")

        docstring = getattr(f, fun).__doc__
        if docstring is not None:
            ff.write('        """')
            docstring = docstring.strip()
            ff.write(docstring)
            ff.write('\n        """\n')
        ff.write(f"        return F.{fun}(self)\n")

Before I do tests for all of them, I wanted to put this in the world for feedback.

One additional idea that wasn't in the original PR would be to create namespaces to group the category of function so instead of col('a").tan() it'd be col("a").trig.tan(), col("b").list.length(), col("c").str.reverse(), col("d").dt.to_timestamp(). That keeps there from being too many available functions to choose from and it puts similar functions together. That way if someone is working with datetimes then with a datetime namespace, all the functions they look through are for datetimes. Similarly, there wouldn't be datetime functions clogging up the root of Expr. Same for trig, strings, lists, arrays, and anything else that deserves a namespace.

That, of course, requires some more manual effort to categorize the functions.

As another forward thought, when it comes to functions that take extra Expr inputs like levenshtein, I would also put in a convenience check where the function would be

def levenshtein(self, string2: Expr|str) -> Expr:
    if isinstance(string2, str):
        string2=col(string2)
    return F.levenshtein(self, string2)

That would be consistent with polars wrt to literals, so if someone wanted the levenshtein against a literal they'd have to use lit("other_string") rather than just using the "other_string" directly.

For functions that take a number as the second then it'd use that directly, such as:

def pow(self, exponent: Expr | int | float) -> Expr:
    if isinstance(exponent, (int, float)):
        exponent=lit(exponent)
    return f.pow(self, exponent)

@timsaucer
Copy link
Member

I do love this PR. I hadn't looked at it since it's in draft, but I fully endorse.

@timsaucer
Copy link
Member

Let me know when it's ready for review

@deanm0000
Copy link
Contributor Author

I'll mark it ready and just do a second PR for more methods.

@deanm0000 deanm0000 marked this pull request as ready for review March 22, 2025 16:43
@deanm0000
Copy link
Contributor Author

@timsaucer did you have any thoughts on the namespaces because going from this to that would wind up being a breaking change.

@timsaucer
Copy link
Member

At a high level I like some of the aspects of a namespace. It would especially be nice to clean up our documentation using them in functions. I haven't thought through what this would mean for organizing both functions and in Expr. We would at least need to have some helper functions marked as deprecated for a few releases if we did this. It's a really good suggestion, but I just want to make sure we have a consistent story across the repo. I haven't looked into how we could do this in PyO3 or if it would just be in the wrappers.

@timsaucer timsaucer merged commit 5a7f638 into apache:main Apr 27, 2025
@timsaucer
Copy link
Member

I must have missed that CI didn't run on this. We have a circular import and broke CI on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants