-
Notifications
You must be signed in to change notification settings - Fork 168
feat: Support over
expressions more freely, make expressions printable, rewrite internals (travelling pr 🌴 )
#3152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4feae2f
to
b4141af
Compare
perf: Use linked list for `ExprMetadata`
# This should be set with extreme care, only in `_expression_parsing.py`, | ||
# and never from within any compliant class. | ||
_opt_metadata: ExprMetadata | None = None | ||
|
||
@property | ||
def _metadata(self) -> ExprMetadata: | ||
assert self._opt_metadata is not None # noqa: S101 | ||
return self._opt_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this a property that always returns ExprMetadata
is a step in the right direction 👍
Have you thought about formalising this invariant more than just using None
and assert
?
For example, if we could write something like:
ExprMetadata.UNSET # patent-pending
Then this sentinel could raise an informative error whenever it is used.
IIRC, you could run into that assertion if you try to construct certain kinds of expressions at the compliant-level, since they don't have the same checking/propagation as the narwhals-level.
For the benefit of typing the attribute/property would always return ExprMetadata
and never None
.
But in the case where we have written something incorrectly, at runtime we would get an error saying:
Hey dude, clippy here, looks like you forgot to do the thing in this way?
Maybe try this instead? 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, i've added a develop-facing error message, thanks (I think I prefer None
to UNSET
though)
Unless there's objections, I'll do a careful read-though of everything and then will plan to ship this by the end of the week |
I've been on holiday I had a bit of time while travelling recently, so I tried rewriting the internals (yes, again)
The general idea is that
narwhals.Expr
stores a list of expressions in._nodes
, rather than passing around a bunch of opaque lambdasWhat this enables
Expressions are pretty-printable
For example:
This is still quite basic, but we could make it more complex by introducing line breaks if it gets too long. Seems like the kind of thing @camriddell might be interested in?
We can do simple expression rewrites
Currently,
(nw.col('a').mean() + 1).over('b')
isn't supported:over
(mean(a) + 1) over (partition by b)
isn't valid syntax, it should bemean(a) over (partition by b) + 1
With this PR, however, it is!
What we do here is, when inserting an
over
node, we push it down before any elementwise operations (such as+
,.abs()
,sum_horizontal
, ...) and apply it to all expressions. There's some more details in the expansion to "how it works"So now, expressions like
(nw.col('a').mean() + 1).over('b')
can be supported for all backendsThis rewrite is extremely simple and cheap, it's just a matter of inserting a node at some position
i
rather than at the end of a list. In general, query optimisation is out of scope for Narwhals. But, given that this enables more of Polars' flexibility for other backends, I think this can be in scope.Per-group broadcasting
Previously,
nw.col('a') - nw.col('a').mean()
would be fine, but(nw.col('a') - nw.col('a').mean()).over('b')
would raise for sql-like backends. Now, it works fine across all backends! Really useful for feature engineeringSimplified internals
depth
,function_name
,scalar_kwargs
CompliantWhen
/CompliantThen
and their complicated interaction with justCompliantNamespace.when_then
In fact, this goes as far as reducing package size by almost 1%. Not that that was the objective with this work, but it's nice to see that it doesn't make the package bigger
What this may open the doors to
nw.Expr.from_json(expr.to_json())
nw.col('a').shift(7).rolling_mean(7).over('store', order_by='date')
df.group_by('a').agg((nw.col('b')-nw.col('c')).mean())
Benchmarks
I'm not seeing any meaningful impact on performance