feat(expr-ir): Support `over(*partition_by)` #3224

dangotbanned · 2025-10-18T20:05:55Z

Related issues

Child of feat(RFC): A richer Expr IR #2572
Follows feat(expr-ir): Acero order_by, hashjoin , DataFrame.{filter,join}, Expr.is_{first,last}_distinct #3173

Notes

Take advantage of
- pc.unique preserving order
- structs/lists
Questions
- How much can be performed without collection?

Tasks

Get all original tests passing
Fix by_dtype naive hash
- feat(expr-ir): Add dedicated selectors for problem children
- fix(expr-ir): Ensure by_dtype handles bare parametric types
fix(expr-ir): Ensure nested nodes expand correctly
Add DataFrame.partition_by
- The new internals required (for pyarrow) to support more kinds of over(*partition_by) are a superset of what's needed for it
- Getting that into a reliable state first means the problem for over is reduced to adapting vector functions to operate on those partitions
Fix non-strict selectors issues
- WIP (expr-ir/over-and-over-and-over-again...expr-ir/strict-selectors)
Adapting DataFrame.partition_by
- Reuse for Group_by.__iter__
- Reuse for over(*partition_by) (non-aggregating)
  - Concatenating partitions can be done with Acero using union or pyarrow.concat_tables
- Reuse for GroupBy.agg(<BinaryExpr>)

Child of #2572

#2572 (comment)

Resolves

`expected` is now taken from testing the same selector on `main`

Adopting what polars does is simpler than special-casing

https://github.com/narwhals-dev/narwhals/blob/ecde261d799a711c2e0a7acf11b108bc45035dc9/narwhals/_arrow/expr.py#L148-L156

…d-over-and-over-again

Aligning this is not important

Adding `parse_into_selector_ir` will require calling this a lot I'd rather skip using `re` when a more performant option is there

Still have some translations missing `by_index` will mean updating `matches_column` to *also* pass in the schema index

Supports selector input for partitions

- Already works, but I wanna add some optimizations for the single partition case - `pc.unique` can be used directly on a lot of `ChunkedArray` types, but `filter` will drop nulls by default, so needs some care if present

Avoids the need for a tempoary composite key column, by using `dictionary_encode` and generating boolean masks based on index position

Left a comment in `selectors` about this issue earlier

MarcoGorelli · 2025-10-23T09:42:32Z

narwhals/_plan/arrow/group_by.py

+    for idx in range(len(arr_dict.dictionary)):
+        # NOTE: Acero filter doesn't support `null_selection_behavior="emit_null"`
+        # Is there any reasonable way to do this in Acero?
+        yield native.filter(pc.equal(pa.scalar(idx), indices))


is this for use in over(partition_by=...)?

if so, just as a heads up, we won't be able to accept a solution which involves looping over partitions in python

dangotbanned added 3 commits October 18, 2025 18:36

feat(expr-ir): Support over(*partition_by)

1dfc6eb

Child of #2572

test: Start porting over_test.py

56c6049

test: hmm any multi-selection

dd62ae8

dangotbanned mentioned this pull request Oct 18, 2025

feat(RFC): A richer Expr IR #2572

Draft

71 tasks

test: Add some failing cases

ad47ac3

#2572 (comment)

dangotbanned added needs investigation internal labels Oct 19, 2025

dangotbanned added 2 commits October 19, 2025 13:34

fix(expr-ir): Ensure nested nodes expand correctly

c4d4030

Resolves

test: Update test that caught a different bug 😅

bf7fdc2

`expected` is now taken from testing the same selector on `main`

dangotbanned removed the needs investigation label Oct 19, 2025

dangotbanned added 3 commits October 19, 2025 14:09

test: Another case

d21ae34

feat(expr-ir): Add dedicated selectors for problem children

88dfdbc

Adopting what polars does is simpler than special-casing

fix(expr-ir): Ensure by_dtype handles bare parametric types

ecde261

dangotbanned added enhancement New feature or request fix labels Oct 19, 2025

dangotbanned added 15 commits October 19, 2025 19:19

feat: Support diff, shift

a2d5b2e

feat(expr-ir): Support anonymous reductions in over

5fc4075

https://github.com/narwhals-dev/narwhals/blob/ecde261d799a711c2e0a7acf11b108bc45035dc9/narwhals/_arrow/expr.py#L148-L156

feat(expr-ir): Partial cum_sum support

2a4acd7

feat(expr-ir): Rinse/repeat other cum_*

d97d047

Merge remote-tracking branch 'upstream/oh-nodes' into expr-ir/over-an…

b49a4f4

…d-over-and-over-again

simply document existing issue

34cac04

diff -> kernel, add some typing

79056d7

test: remove unused xfail

1f830bc

shift -> kernel, add fancy test

aedc330

test: Allow the exception difference *for now*

040d377

Aligning this is not important

feat(expr-ir): Add a concrete impl for cs.by_name

ab7330a

Adding `parse_into_selector_ir` will require calling this a lot I'd rather skip using `re` when a more performant option is there

feat(expr-ir): Add meta.as_selector, parse_into_selector_ir

15c87ea

Still have some translations missing `by_index` will mean updating `matches_column` to *also* pass in the schema index

more partition_by prep

a00dbb7

feat(expr-ir): Implement ArrowDataFrame.partition_by

e0d1a00

Supports selector input for partitions

test: Add test_partition_by_multiple

2bffdaa

dangotbanned added 8 commits October 22, 2025 08:58

test: Include None in partitions

f17781a

- Already works, but I wanna add some optimizations for the single partition case - `pc.unique` can be used directly on a lot of `ChunkedArray` types, but `filter` will drop nulls by default, so needs some care if present

perf: Add an optimized path for single-column partition_by

ac779dd

Avoids the need for a tempoary composite key column, by using `dictionary_encode` and generating boolean masks based on index position

refactor: Re-use partition_by in ArrowGroupBy.__iter__

c4d494a

refactor: Move partition_by impl to group_by.py

e55aeb0

refactor: Rename concat_str -> _composite_key and lightly doc

ae09fc1

test: Add some more targets for polars-parity

9810b73

Left a comment in `selectors` about this issue earlier

fix: raise on empty by

6d219f4

feat(DRAFT): Add acero union wrapper

41d8cc2

MarcoGorelli reviewed Oct 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(expr-ir): Support `over(*partition_by)` #3224

feat(expr-ir): Support `over(*partition_by)` #3224

dangotbanned commented Oct 18, 2025 •

edited

Loading

Uh oh!

MarcoGorelli Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(expr-ir): Support over(*partition_by) #3224

Are you sure you want to change the base?

feat(expr-ir): Support over(*partition_by) #3224

Conversation

dangotbanned commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issues

Notes

Tasks

Uh oh!

MarcoGorelli Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(expr-ir): Support `over(*partition_by)` #3224

feat(expr-ir): Support `over(*partition_by)` #3224

dangotbanned commented Oct 18, 2025 •

edited

Loading