1064/enhancement/add functions to Expr class #1074

deanm0000 · 2025-03-20T17:54:50Z

Which issue does this PR close?

Works towards closing #1064

Rationale for this change

To improve ergonomics of the API by adding functions to the Expr class so that they can be chained

What changes are included in this PR?

This includes function definitions in the Expr class for all the single argument functions that take Expr and return Expr

Are there any user-facing changes?

Yes, users can use functions chained from other Exprs for example col("a").abs().tan()

No breaking changes.

deanm0000 · 2025-03-20T18:31:30Z

I wrote this script to help produce this PR. I wanted to start with just the functions which have a single Expr input that return an Expr. The script writes all those defs to a file and then I copied that over to the source and let the linter fix the bad formatting that my code created.

from inspect import signature, Parameter
from datafusion import functions as f, Expr
from types import FunctionType
from pathlib import Path

funcs_not_exprs = set(dir(f)) - set(dir(Expr))
funcs = []
for fun in funcs_not_exprs:
    if isinstance(getattr(f, fun), FunctionType):
        funcs.append(fun)

expr_in_out = {
    "one_in_out": [],
    "multi_expr": [],
    "other_ins": [],
    "other_return": [],
    "other": [],
}
for fun in funcs:
    sig = signature(getattr(f, fun))
    params = sig.parameters
    return_annotation = sig.return_annotation
    if return_annotation != "Expr":
        expr_in_out["other_return"].append(fun)
        continue
    all_expr = True
    no_star = True
    for name, param in params.items():
        if param.annotation != "Expr":
            all_expr = False
            break
        if param.kind in (Parameter.VAR_POSITIONAL, Parameter.KEYWORD_ONLY):
            no_star=False
    if len(params) == 1 and all_expr and no_star:
        expr_in_out["one_in_out"].append(fun)
    elif len(params) > 1 and all_expr:
        expr_in_out["multi_expr"].append(fun)
    elif len(params) > 1 and not all_expr:
        expr_in_out["other_ins"].append(fun)
    else:
        expr_in_out["other"].append(fun)

expr_defs = Path("./expr_defs.py")
with expr_defs.open("w") as ff:
    for fun in expr_in_out["one_in_out"]:
        ff.write(f"    def {fun}(self) -> Expr:\n")

        docstring = getattr(f, fun).__doc__
        if docstring is not None:
            ff.write('        """')
            docstring = docstring.strip()
            ff.write(docstring)
            ff.write('\n        """\n')
        ff.write(f"        return F.{fun}(self)\n")

Before I do tests for all of them, I wanted to put this in the world for feedback.

One additional idea that wasn't in the original PR would be to create namespaces to group the category of function so instead of col('a").tan() it'd be col("a").trig.tan(), col("b").list.length(), col("c").str.reverse(), col("d").dt.to_timestamp(). That keeps there from being too many available functions to choose from and it puts similar functions together. That way if someone is working with datetimes then with a datetime namespace, all the functions they look through are for datetimes. Similarly, there wouldn't be datetime functions clogging up the root of Expr. Same for trig, strings, lists, arrays, and anything else that deserves a namespace.

That, of course, requires some more manual effort to categorize the functions.

As another forward thought, when it comes to functions that take extra Expr inputs like levenshtein, I would also put in a convenience check where the function would be

def levenshtein(self, string2: Expr|str) -> Expr:
    if isinstance(string2, str):
        string2=col(string2)
    return F.levenshtein(self, string2)

That would be consistent with polars wrt to literals, so if someone wanted the levenshtein against a literal they'd have to use lit("other_string") rather than just using the "other_string" directly.

For functions that take a number as the second then it'd use that directly, such as:

def pow(self, exponent: Expr | int | float) -> Expr:
    if isinstance(exponent, (int, float)):
        exponent=lit(exponent)
    return f.pow(self, exponent)

timsaucer · 2025-03-22T16:30:30Z

I do love this PR. I hadn't looked at it since it's in draft, but I fully endorse.

timsaucer · 2025-03-22T16:30:55Z

Let me know when it's ready for review

deanm0000 · 2025-03-22T16:42:58Z

I'll mark it ready and just do a second PR for more methods.

deanm0000 · 2025-03-22T16:44:46Z

@timsaucer did you have any thoughts on the namespaces because going from this to that would wind up being a breaking change.

timsaucer · 2025-03-22T17:15:03Z

At a high level I like some of the aspects of a namespace. It would especially be nice to clean up our documentation using them in functions. I haven't thought through what this would mean for organizing both functions and in Expr. We would at least need to have some helper functions marked as deprecated for a few releases if we did this. It's a really good suggestion, but I just want to make sure we have a consistent story across the repo. I haven't looked into how we could do this in PyO3 or if it would just be in the wrappers.

timsaucer · 2025-04-27T15:41:21Z

I must have missed that CI didn't run on this. We have a circular import and broke CI on main.

single_arg

45c1014

deanm0000 mentioned this pull request Mar 20, 2025

Add most functions to the Expr class so that they're chainable. #1064

Open

deanm0000 marked this pull request as ready for review March 22, 2025 16:43

timsaucer approved these changes Apr 27, 2025

View reviewed changes

timsaucer merged commit 5a7f638 into apache:main Apr 27, 2025

timsaucer mentioned this pull request Apr 27, 2025

Fix regression on main with circular import on expr #1116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1064/enhancement/add functions to Expr class #1074

1064/enhancement/add functions to Expr class #1074

Uh oh!

deanm0000 commented Mar 20, 2025

Uh oh!

deanm0000 commented Mar 20, 2025

Uh oh!

timsaucer commented Mar 22, 2025

Uh oh!

timsaucer commented Mar 22, 2025

Uh oh!

deanm0000 commented Mar 22, 2025

Uh oh!

deanm0000 commented Mar 22, 2025

Uh oh!

timsaucer commented Mar 22, 2025

Uh oh!

timsaucer commented Apr 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1064/enhancement/add functions to Expr class #1074

1064/enhancement/add functions to Expr class #1074

Uh oh!

Conversation

deanm0000 commented Mar 20, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

deanm0000 commented Mar 20, 2025

Uh oh!

timsaucer commented Mar 22, 2025

Uh oh!

timsaucer commented Mar 22, 2025

Uh oh!

deanm0000 commented Mar 22, 2025

Uh oh!

deanm0000 commented Mar 22, 2025

Uh oh!

timsaucer commented Mar 22, 2025

Uh oh!

timsaucer commented Apr 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants