Skip to content

Conversation

@dcherian
Copy link
Collaborator

@dcherian dcherian commented Jul 28, 2024

This is pretty ugly:

  1. Doesn't handle topk and nantopk yet
  2. Unlike quantile, topk can work as a tree-reduce algorithm. So we have to hack the k kwarg, passed in finalize_kwargs in many places. This isn't an issue for quantile, because that only works blockwise

pre-commit-ci bot and others added 4 commits July 28, 2024 20:16
* main:
  Fix first, last again (#381)
  Fix docs build (#382)
  Optimize for-loop merging of cohorts. (#378)
  Add cohorts snapshot tests with syrupy (#379)
  Stricter tolerance in property tests (#377)
  Better gridlines
  Update containment image with gridlines, higher dpi
@dcherian dcherian mentioned this pull request Aug 7, 2024
1 task
dcherian and others added 15 commits January 6, 2025 17:50
* main:
  Revert "Support first, last with datetime, timedelta (#402)" (#404)
  Support first, last with datetime, timedelta (#402)
  Bump codecov/codecov-action from 4.5.0 to 4.6.0 (#401)
  Bump mamba-org/setup-micromamba from 1 to 2 (#400)
  Revert "[revert] test with Xarray PR branch" (#393)
  [pre-commit.ci] pre-commit autoupdate (#399)
  Faster subsetting for cohorts (#397)
  Fix default int on windows, numpy<2 (#395)
  Avoid rechunking when preferred_method="blockwise" (#394)
  Preserve dtype better when specified. (#389)
  Drop python 3.9, use ruff (#392)
  silence warning (#390)
  Expand groupby_reduce property tests (#385)
  Fix bug with NaNs in `by` and method='blockwise' (#384)
  Avoid explicit np.nan, np.inf (#383)
* main:
  Optimize quantile. (#409)
@dcherian dcherian marked this pull request as draft January 24, 2025 17:49
@dcherian
Copy link
Collaborator Author

Note to myself: this needs some harmonizing of _DUMMY_AXIS is -2 and the fact that the k-axis is 0

dcherian and others added 20 commits March 17, 2025 19:15
* main:
  Fix upstream-dev tests (#421)
  Bump codecov/codecov-action from 5.1.2 to 5.3.1 (#420)
  optimize cohorts yet again (#419)
This reverts commit 8319f7f.
For groups with fewer than k elements, clamp extraction indices to
stay within group boundaries instead of filling entire group with NaN.
This allows returning partial results (e.g., [NaN, val1, val2] for a
group with 2 values when k=3).

Changes:
- Compute group start/end positions from offset
- Clamp lo_ indices to [group_start, group_end) range
- Mark positions outside valid range with badmask
- Extract all indices in lo_, not just virtual_index

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add nantopk function that replaces NaN with -inf/+inf before calling
topk. This is used as the combine function for topk aggregations to
ensure NaN values don't interfere with selecting top k elements across
chunks.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Changes to topk aggregation:
- Use nantopk as combine function instead of topk
- Add _topk_finalize to convert -inf fill values back to NaN
- Remove min_count enforcement (groups with <k elements return partial
  results instead of being entirely masked)

This enables topk to work correctly in dask's map-reduce pathway where
intermediate results from chunks need to be combined.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Aggregations with new_dims_func (quantile, topk) add an extra dimension
to intermediate results. These must use _simple_combine which reduces
along DUMMY_AXIS, not _grouped_combine which reduces along the groups
axis.

This ensures topk works correctly even when reindex=False and
labels_are_unknown=True.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
topk with reindex=False has issues with group alignment during the
combine step in map-reduce. Disable this combination and raise a clear
error message.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Add test_topk_with_nan: Verifies NaNs are excluded for both k > 0 and k < 0
- Add test_topk_fewer_than_k_elements: Tests NaN padding when group size < |k|
- Add test_topk_all_nan_group: Tests all-NaN groups return all NaN results
- Add test_topk_fill_value_correctness: Tests missing groups get NaN fill value
- Remove resolved FIXME comment about fill_value (already handled in _initialize_aggregation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Add flox/_version.py to .gitignore (auto-generated by setuptools_scm)
- Add temporary files to .gitignore: *.rej, *.py.rej, Untitled.ipynb
- Add development/profiling files: mutmut-cache, profile.json, profile.html, mydask.png, test.png, uv.lock, devel/
- Remove all these files from git tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Resolved conflicts by:
- Taking main's code organization (modularized into separate files)
- Adding topk support to the new structure:
  - Added topk_new_dims_func import to core.py
  - Added "topk" to _is_bool_supported_reduction in lib.py
  - Added "topk" to _choose_engine to force flox engine
  - Added topk validation for finalize_kwargs["k"]
  - Added topk to groupby_reduce docstring
  - Added Notes about topk limitations
  - Added reindex=False check for topk
  - Updated chunk_reduce to handle topk new_dims
  - Updated dask.py to handle topk:
    - Modified _expand_dims to skip first intermediate for topk
    - Added must_use_simple_combine logic for new_dims_func
    - Added topk kwargs passing to chunk_reduce

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add type ignore comment for normalize_axis_tuple unpacking.
Mypy incorrectly infers the tuple length, but we know it's always
a 1-tuple since axis is a single integer in this context.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants