-
Notifications
You must be signed in to change notification settings - Fork 21
Add topk #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
dcherian
wants to merge
69
commits into
main
Choose a base branch
from
topk
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for more information, see https://pre-commit.ci
* main: Revert "Support first, last with datetime, timedelta (#402)" (#404) Support first, last with datetime, timedelta (#402) Bump codecov/codecov-action from 4.5.0 to 4.6.0 (#401) Bump mamba-org/setup-micromamba from 1 to 2 (#400) Revert "[revert] test with Xarray PR branch" (#393) [pre-commit.ci] pre-commit autoupdate (#399) Faster subsetting for cohorts (#397) Fix default int on windows, numpy<2 (#395) Avoid rechunking when preferred_method="blockwise" (#394) Preserve dtype better when specified. (#389) Drop python 3.9, use ruff (#392) silence warning (#390) Expand groupby_reduce property tests (#385) Fix bug with NaNs in `by` and method='blockwise' (#384) Avoid explicit np.nan, np.inf (#383)
* main: Optimize quantile. (#409)
dcherian
commented
Jan 8, 2025
Collaborator
Author
|
Note to myself: this needs some harmonizing of |
This reverts commit 8319f7f.
For groups with fewer than k elements, clamp extraction indices to stay within group boundaries instead of filling entire group with NaN. This allows returning partial results (e.g., [NaN, val1, val2] for a group with 2 values when k=3). Changes: - Compute group start/end positions from offset - Clamp lo_ indices to [group_start, group_end) range - Mark positions outside valid range with badmask - Extract all indices in lo_, not just virtual_index 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Add nantopk function that replaces NaN with -inf/+inf before calling topk. This is used as the combine function for topk aggregations to ensure NaN values don't interfere with selecting top k elements across chunks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Changes to topk aggregation: - Use nantopk as combine function instead of topk - Add _topk_finalize to convert -inf fill values back to NaN - Remove min_count enforcement (groups with <k elements return partial results instead of being entirely masked) This enables topk to work correctly in dask's map-reduce pathway where intermediate results from chunks need to be combined. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Aggregations with new_dims_func (quantile, topk) add an extra dimension to intermediate results. These must use _simple_combine which reduces along DUMMY_AXIS, not _grouped_combine which reduces along the groups axis. This ensures topk works correctly even when reindex=False and labels_are_unknown=True. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
topk with reindex=False has issues with group alignment during the combine step in map-reduce. Disable this combination and raise a clear error message. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Add test_topk_with_nan: Verifies NaNs are excluded for both k > 0 and k < 0 - Add test_topk_fewer_than_k_elements: Tests NaN padding when group size < |k| - Add test_topk_all_nan_group: Tests all-NaN groups return all NaN results - Add test_topk_fill_value_correctness: Tests missing groups get NaN fill value - Remove resolved FIXME comment about fill_value (already handled in _initialize_aggregation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Add flox/_version.py to .gitignore (auto-generated by setuptools_scm) - Add temporary files to .gitignore: *.rej, *.py.rej, Untitled.ipynb - Add development/profiling files: mutmut-cache, profile.json, profile.html, mydask.png, test.png, uv.lock, devel/ - Remove all these files from git tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Resolved conflicts by:
- Taking main's code organization (modularized into separate files)
- Adding topk support to the new structure:
- Added topk_new_dims_func import to core.py
- Added "topk" to _is_bool_supported_reduction in lib.py
- Added "topk" to _choose_engine to force flox engine
- Added topk validation for finalize_kwargs["k"]
- Added topk to groupby_reduce docstring
- Added Notes about topk limitations
- Added reindex=False check for topk
- Updated chunk_reduce to handle topk new_dims
- Updated dask.py to handle topk:
- Modified _expand_dims to skip first intermediate for topk
- Added must_use_simple_combine logic for new_dims_func
- Added topk kwargs passing to chunk_reduce
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
Add type ignore comment for normalize_axis_tuple unpacking. Mypy incorrectly infers the tuple length, but we know it's always a 1-tuple since axis is a single integer in this context. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is pretty ugly:
topkandnantopkyetkkwarg, passed infinalize_kwargsin many places. This isn't an issue for quantile, because that only works blockwise