-
Notifications
You must be signed in to change notification settings - Fork 168
feat(expr-ir): Support over(*partition_by)
#3224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
dangotbanned
wants to merge
32
commits into
oh-nodes
Choose a base branch
from
expr-ir/over-and-over-and-over-again
base: oh-nodes
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
+1,100
−109
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
71 tasks
`expected` is now taken from testing the same selector on `main`
Adopting what polars does is simpler than special-casing
…d-over-and-over-again
Aligning this is not important
Adding `parse_into_selector_ir` will require calling this a lot I'd rather skip using `re` when a more performant option is there
Still have some translations missing `by_index` will mean updating `matches_column` to *also* pass in the schema index
Supports selector input for partitions
- Already works, but I wanna add some optimizations for the single partition case - `pc.unique` can be used directly on a lot of `ChunkedArray` types, but `filter` will drop nulls by default, so needs some care if present
Avoids the need for a tempoary composite key column, by using `dictionary_encode` and generating boolean masks based on index position
Left a comment in `selectors` about this issue earlier
Comment on lines
+198
to
+201
for idx in range(len(arr_dict.dictionary)): | ||
# NOTE: Acero filter doesn't support `null_selection_behavior="emit_null"` | ||
# Is there any reasonable way to do this in Acero? | ||
yield native.filter(pc.equal(pa.scalar(idx), indices)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this for use in over(partition_by=...)
?
if so, just as a heads up, we won't be able to accept a solution which involves looping over partitions in python
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related issues
Expr
IR #2572order_by
,hashjoin
,DataFrame.{filter,join}
,Expr.is_{first,last}_distinct
#3173Notes
pc.unique
preserving orderTasks
by_dtype
naive hashby_dtype
handles bare parametric typesDataFrame.partition_by
pyarrow
) to support more kinds ofover(*partition_by)
are a superset of what's needed for itover
is reduced to adapting vector functions to operate on those partitionsDataFrame.partition_by
Group_by.__iter__
over(*partition_by)
(non-aggregating)union
orpyarrow.concat_tables
GroupBy.agg(<BinaryExpr>)