Skip to content

Commit 16b4527

Browse files
authored
feat: unify exception for nested over statements, nested aggregations, filtrations on aggregations, document expression metadata better (#2351)
* feat: unify exception for nested ``over`` statements * rename exprmetadata selector staticmethods * docs fixups * simple -> single * n_closed_windows => has_windows * docs * fixup * simplify * it can always be done simpler! * coverage * update outdated "open window" refs * update outdated "open window" refs * document effect of UNCLOSEABLE window * typos (thanks Dan!)
1 parent e1526a8 commit 16b4527

File tree

10 files changed

+388
-188
lines changed

10 files changed

+388
-188
lines changed

docs/how_it_works.md

Lines changed: 99 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -272,7 +272,104 @@ print((pn.col("a") + 1).mean())
272272
For simple aggregations, Narwhals can just look at `_depth` and `function_name` and figure out
273273
which (efficient) elementary operation this corresponds to in pandas.
274274

275-
## Broadcasting
275+
## Expression Metadata
276+
277+
Let's try printing out a few expressions to the console to see what they show us:
278+
279+
```python exec="1" result="python" session="metadata" source="above"
280+
import narwhals as nw
281+
282+
print(nw.col("a"))
283+
print(nw.col("a").mean())
284+
print(nw.col("a").mean().over("b"))
285+
```
286+
287+
Note how they tell us something about their metadata. This section is all about
288+
making sense of what that all means, what the rules are, and what it enables.
289+
290+
### Expression kinds
291+
292+
Each Narwhals expression can be of one of the following kinds:
293+
294+
- `LITERAL`: expressions which correspond to literal values, such as the `3` in `nw.col('a')+3`.
295+
- `AGGREGATION`: expressions which reduce a column to a single value (e.g. `nw.col('a').mean()`).
296+
- `TRANSFORM`: expressions which don't change length (e.g. `nw.col('a').abs()`).
297+
- `WINDOW`: like `TRANSFORM`, but the last operation is a (row-order-dependent)
298+
window function (`rolling_*`, `cum_*`, `diff`, `shift`, `is_*_distinct`).
299+
- `FILTRATION`: expressions which change length but don't
300+
aggregate (e.g. `nw.col('a').drop_nulls()`).
301+
302+
For example:
303+
304+
- `nw.col('a')` is not order-dependent, so it's `TRANSFORM`.
305+
- `nw.col('a').abs()` is not order-dependent, so it's a `TRANSFORM`.
306+
- `nw.col('a').cum_sum()`'s last operation is `cum_sum`, so it's `WINDOW`.
307+
- `nw.col('a').cum_sum() + 1`'s last operation is `__add__`, and it preserves
308+
the input dataframe's length, so it's a `TRANSFORM`.
309+
310+
How these change depends on the operation.
311+
312+
#### Chaining
313+
314+
Say we have `expr.expr_method()`. How does `expr`'s `ExprMetadata` change?
315+
This depends on `expr_method`.
316+
317+
- Element-wise expressions such `abs`, `alias`, `cast`, `__invert__`, and
318+
many more, preserve the input kind (unless `expr` is a `WINDOW`, in
319+
which case it becomes a `TRANSFORM`. This is because for an expression
320+
to be `WINDOW`, the last expression needs to be the order-dependent one).
321+
- `rolling_*`, `cum_*`, `diff`, `shift`, `ewm_mean`, and `is_*_distinct`
322+
are window functions and result in `WINDOW`.
323+
- `mean`, `std`, `median`, and other aggregations result in `AGGREGATION`,
324+
and can only be applied to `TRANSFORM` and `WINDOW`.
325+
- `drop_nulls` and `filter` result in `FILTRATION`, and can only be applied
326+
to `TRANSFORM` and `WINDOW`.
327+
- `over` always results in `TRANSFORM`. This is a bit more complicated,
328+
so we elaborate on it in the ["You open a window ..."](#you-open-a-window-to-another-window-to-another-window-to-another-window).
329+
330+
#### Binary operations (e.g. `nw.col('a') + nw.col('b')`)
331+
332+
How do expression kinds change under binary operations? For example,
333+
if we do `expr1 + expr2`, then what can we say about the output kind?
334+
The rules are:
335+
336+
- If both are `LITERAL`, then the output is `LITERAL`.
337+
- If one is a `FILTRATION`, then:
338+
339+
- if the other is `LITERAL` or `AGGREGATION`, then the output is `FILTRATION`.
340+
- else, we raise an error.
341+
342+
- If one is `TRANSFORM` or `WINDOW` and the other is not `FILTRATION`,
343+
then the output is `TRANSFORM`.
344+
- If one is `AGGREGATION` and the other is `LITERAL` or `AGGREGATION`,
345+
the output is `AGGREGATION`.
346+
347+
For n-ary operations such as `nw.sum_horizontal`, the above logic is
348+
extended across inputs. For example, `nw.sum_horizontal(expr1, expr2, expr3)`
349+
is `LITERAL` if all of `expr1`, `expr2`, and `expr3` are.
350+
351+
### "You open a window to another window to another window to another window"
352+
353+
When we print out an expression, in addition to the expression kind,
354+
we also see `window_kind`. There are four window kinds:
355+
356+
- `NONE`: non-order-dependent operations, like `.abs()` or `.mean()`.
357+
- `CLOSEABLE`: expression where the last operation is order-dependent. For
358+
example, `nw.col('a').diff()`.
359+
- `UNCLOSEABLE`: expression where some operation is order-dependent but
360+
the order-dependent operation wasn't the last one. For example,
361+
`nw.col('a').diff().abs()`.
362+
- `CLOSED`: expression contains `over` at some point, and any order-dependent
363+
operation was immediately followed by `over(order_by=...)`.
364+
365+
When working with `DataFrame`s, row order is well-defined, as the dataframes
366+
are assumed to be eager and in-memory. Therefore, it's allowed to work
367+
with all window kinds.
368+
369+
When working with `LazyFrame`s, on the other hand, row order is undefined.
370+
Therefore, window kinds must either be `NONE` or `CLOSED`.
371+
372+
### Broadcasting
276373

277374
When performing comparisons between columns and aggregations or scalars, we operate as if the
278375
aggregation or scalar was broadcasted to the length of the whole column. For example, if we
@@ -282,14 +379,7 @@ with values `[-1, 0, 1]`.
282379

283380
Different libraries do broadcasting differently. SQL-like libraries require an empty window
284381
function for expressions (e.g. `a - sum(a) over ()`), Polars does its own broadcasting of
285-
length-1 Series, and pandas does its own broadcasting of scalars. Narwhals keeps track of
286-
when to trigger a broadcast by tracking the `ExprKind` of each expression. `ExprKind` is an
287-
`Enum` with four variants:
288-
289-
- `TRANSFORM`: expressions which don't change length (e.g. `nw.col('a').abs()`).
290-
- `AGGREGATION`: expressions which reduce a column to a single value (e.g. `nw.col('a').mean()`).
291-
- `CHANGE_LENGTH`: expressions which change length but don't necessarily aggregate (e.g. `nw.col('a').drop_nulls()`).
292-
- `LITERAL`: expressions which correspond to literal values, such as the `3` in `nw.col('a')+3`.
382+
length-1 Series, and pandas does its own broadcasting of scalars.
293383

294384
Narwhals triggers a broadcast in these situations:
295385

narwhals/_expression_parsing.py

Lines changed: 101 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ class ExprKind(Enum):
121121
- LITERAL vs LITERAL -> LITERAL
122122
- FILTRATION vs (LITERAL | AGGREGATION) -> FILTRATION
123123
- FILTRATION vs (FILTRATION | TRANSFORM | WINDOW) -> raise
124-
- (TRANSFORM | WINDOW) vs (LITERAL | AGGREGATION) -> TRANSFORM
124+
- (TRANSFORM | WINDOW) vs (...) -> TRANSFORM
125125
- AGGREGATION vs (LITERAL | AGGREGATION) -> AGGREGATION
126126
"""
127127

@@ -191,30 +191,64 @@ def is_multi_output(
191191
return expansion_kind in {ExpansionKind.MULTI_NAMED, ExpansionKind.MULTI_UNNAMED}
192192

193193

194+
class WindowKind(Enum):
195+
"""Describe what kind of window the expression contains."""
196+
197+
NONE = auto()
198+
"""e.g. `nw.col('a').abs()`, no windows."""
199+
200+
CLOSEABLE = auto()
201+
"""e.g. `nw.col('a').cum_sum()` - can be closed if immediately followed by `over(order_by=...)`."""
202+
203+
UNCLOSEABLE = auto()
204+
"""e.g. `nw.col('a').cum_sum().abs()` - the window function (`cum_sum`) wasn't immediately followed by
205+
`over(order_by=...)`, and so the window is uncloseable.
206+
207+
Uncloseable windows can be used freely in `nw.DataFrame`, but not in `nw.LazyFrame` where
208+
row-order is undefined."""
209+
210+
CLOSED = auto()
211+
"""e.g. `nw.col('a').cum_sum().over(order_by='i')`."""
212+
213+
def is_open(self) -> bool:
214+
return self in {WindowKind.UNCLOSEABLE, WindowKind.CLOSEABLE}
215+
216+
def is_closed(self) -> bool:
217+
return self is WindowKind.CLOSED
218+
219+
def is_uncloseable(self) -> bool:
220+
return self is WindowKind.UNCLOSEABLE
221+
222+
194223
class ExprMetadata:
195-
__slots__ = ("_expansion_kind", "_kind", "_n_open_windows")
224+
__slots__ = ("_expansion_kind", "_kind", "_window_kind")
196225

197226
def __init__(
198-
self, kind: ExprKind, /, *, n_open_windows: int, expansion_kind: ExpansionKind
227+
self,
228+
kind: ExprKind,
229+
/,
230+
*,
231+
window_kind: WindowKind,
232+
expansion_kind: ExpansionKind,
199233
) -> None:
200234
self._kind: ExprKind = kind
201-
self._n_open_windows = n_open_windows
235+
self._window_kind = window_kind
202236
self._expansion_kind = expansion_kind
203237

204238
def __init_subclass__(cls, /, *args: Any, **kwds: Any) -> Never: # pragma: no cover
205239
msg = f"Cannot subclass {cls.__name__!r}"
206240
raise TypeError(msg)
207241

208242
def __repr__(self) -> str:
209-
return f"ExprMetadata(kind: {self._kind}, n_open_windows: {self._n_open_windows}, expansion_kind: {self._expansion_kind})"
243+
return f"ExprMetadata(kind: {self._kind}, window_kind: {self._window_kind}, expansion_kind: {self._expansion_kind})"
210244

211245
@property
212246
def kind(self) -> ExprKind:
213247
return self._kind
214248

215249
@property
216-
def n_open_windows(self) -> int:
217-
return self._n_open_windows
250+
def window_kind(self) -> WindowKind:
251+
return self._window_kind
218252

219253
@property
220254
def expansion_kind(self) -> ExpansionKind:
@@ -223,50 +257,77 @@ def expansion_kind(self) -> ExpansionKind:
223257
def with_kind(self, kind: ExprKind, /) -> ExprMetadata:
224258
"""Change metadata kind, leaving all other attributes the same."""
225259
return ExprMetadata(
226-
kind, n_open_windows=self._n_open_windows, expansion_kind=self._expansion_kind
260+
kind,
261+
window_kind=self._window_kind,
262+
expansion_kind=self._expansion_kind,
227263
)
228264

229-
def with_extra_open_window(self) -> ExprMetadata:
230-
"""Increment `n_open_windows` leaving other attributes the same."""
265+
def with_uncloseable_window(self) -> ExprMetadata:
266+
"""Add uncloseable window, leaving other attributes the same."""
267+
if self._window_kind is WindowKind.CLOSED: # pragma: no cover
268+
msg = "Unreachable code, please report a bug."
269+
raise AssertionError(msg)
231270
return ExprMetadata(
232271
self.kind,
233-
n_open_windows=self._n_open_windows + 1,
272+
window_kind=WindowKind.UNCLOSEABLE,
273+
expansion_kind=self._expansion_kind,
274+
)
275+
276+
def with_kind_and_closeable_window(self, kind: ExprKind, /) -> ExprMetadata:
277+
"""Change metadata kind and add closeable window.
278+
279+
If we already have an uncloseable window, the window stays uncloseable.
280+
"""
281+
if self._window_kind is WindowKind.NONE:
282+
window_kind = WindowKind.CLOSEABLE
283+
elif self._window_kind is WindowKind.CLOSED: # pragma: no cover
284+
msg = "Unreachable code, please report a bug."
285+
raise AssertionError(msg)
286+
else:
287+
window_kind = WindowKind.UNCLOSEABLE
288+
return ExprMetadata(
289+
kind,
290+
window_kind=window_kind,
234291
expansion_kind=self._expansion_kind,
235292
)
236293

237-
def with_kind_and_extra_open_window(self, kind: ExprKind, /) -> ExprMetadata:
238-
"""Change metadata kind and increment `n_open_windows`."""
294+
def with_kind_and_uncloseable_window(self, kind: ExprKind, /) -> ExprMetadata:
295+
"""Change metadata kind and set window kind to uncloseable."""
239296
return ExprMetadata(
240297
kind,
241-
n_open_windows=self._n_open_windows + 1,
298+
window_kind=WindowKind.UNCLOSEABLE,
242299
expansion_kind=self._expansion_kind,
243300
)
244301

245302
@staticmethod
246-
def simple_selector() -> ExprMetadata:
303+
def selector_single() -> ExprMetadata:
247304
# e.g. `nw.col('a')`, `nw.nth(0)`
248305
return ExprMetadata(
249-
ExprKind.TRANSFORM, n_open_windows=0, expansion_kind=ExpansionKind.SINGLE
306+
ExprKind.TRANSFORM,
307+
window_kind=WindowKind.NONE,
308+
expansion_kind=ExpansionKind.SINGLE,
250309
)
251310

252311
@staticmethod
253-
def multi_output_selector_named() -> ExprMetadata:
312+
def selector_multi_named() -> ExprMetadata:
254313
# e.g. `nw.col('a', 'b')`
255314
return ExprMetadata(
256-
ExprKind.TRANSFORM, n_open_windows=0, expansion_kind=ExpansionKind.MULTI_NAMED
315+
ExprKind.TRANSFORM,
316+
window_kind=WindowKind.NONE,
317+
expansion_kind=ExpansionKind.MULTI_NAMED,
257318
)
258319

259320
@staticmethod
260-
def multi_output_selector_unnamed() -> ExprMetadata:
321+
def selector_multi_unnamed() -> ExprMetadata:
261322
# e.g. `nw.all()`
262323
return ExprMetadata(
263324
ExprKind.TRANSFORM,
264-
n_open_windows=0,
325+
window_kind=WindowKind.NONE,
265326
expansion_kind=ExpansionKind.MULTI_UNNAMED,
266327
)
267328

268329

269-
def combine_metadata(
330+
def combine_metadata( # noqa: PLR0915
270331
*args: IntoExpr | object | None,
271332
str_as_lit: bool,
272333
allow_multi_output: bool,
@@ -285,8 +346,10 @@ def combine_metadata(
285346
has_transforms_or_windows = False
286347
has_aggregations = False
287348
has_literals = False
288-
result_n_open_windows = 0
289349
result_expansion_kind = ExpansionKind.SINGLE
350+
has_closeable_windows = False
351+
has_uncloseable_windows = False
352+
has_closed_windows = False
290353

291354
for i, arg in enumerate(args):
292355
if isinstance(arg, str) and not str_as_lit:
@@ -307,8 +370,6 @@ def combine_metadata(
307370
result_expansion_kind = resolve_expansion_kind(
308371
result_expansion_kind, arg._metadata.expansion_kind
309372
)
310-
if arg._metadata.n_open_windows:
311-
result_n_open_windows += 1
312373
kind = arg._metadata.kind
313374
if kind is ExprKind.AGGREGATION:
314375
has_aggregations = True
@@ -322,6 +383,14 @@ def combine_metadata(
322383
msg = "unreachable code"
323384
raise AssertionError(msg)
324385

386+
window_kind = arg._metadata.window_kind
387+
if window_kind is WindowKind.UNCLOSEABLE:
388+
has_uncloseable_windows = True
389+
elif window_kind is WindowKind.CLOSEABLE:
390+
has_closeable_windows = True
391+
elif window_kind is WindowKind.CLOSED:
392+
has_closed_windows = True
393+
325394
if (
326395
has_literals
327396
and not has_aggregations
@@ -342,10 +411,15 @@ def combine_metadata(
342411
else:
343412
result_kind = ExprKind.AGGREGATION
344413

414+
if has_uncloseable_windows or has_closeable_windows:
415+
result_window_kind = WindowKind.UNCLOSEABLE
416+
elif has_closed_windows:
417+
result_window_kind = WindowKind.CLOSED
418+
else:
419+
result_window_kind = WindowKind.NONE
420+
345421
return ExprMetadata(
346-
result_kind,
347-
n_open_windows=result_n_open_windows,
348-
expansion_kind=result_expansion_kind,
422+
result_kind, window_kind=result_window_kind, expansion_kind=result_expansion_kind
349423
)
350424

351425

narwhals/dataframe.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2152,7 +2152,7 @@ def _extract_compliant(self: Self, arg: Any) -> Any:
21522152
plx = self.__narwhals_namespace__()
21532153
return plx.col(arg)
21542154
if isinstance(arg, Expr):
2155-
if arg._metadata.n_open_windows > 0:
2155+
if arg._metadata._window_kind.is_open():
21562156
msg = (
21572157
"Order-dependent expressions are not supported for use in LazyFrame.\n\n"
21582158
"Hints:\n"

0 commit comments

Comments
 (0)