Skip to content

Conversation

@Illviljan
Copy link
Contributor

@Illviljan Illviljan commented Dec 6, 2025

The non-flox version reduces chunksizes significantly:

x = xr.DataArray([1, 1, 1, 1, 1], name="x").chunk()
grp_idx = xr.DataArray([-1, 0, 0, -1, 1])
with xr.set_options(use_flox=False):
    print(x.groupby(grp_idx).cumsum())
<xarray.DataArray 'x' (dim_0: 5)> Size: 40B
dask.array<getitem, shape=(5,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>
Dimensions without coordinates: dim_0

With flox the chunksize is retained:

x = xr.DataArray([1, 1, 1, 1, 1], name="x").chunk()
grp_idx = xr.DataArray([-1, 0, 0, -1, 1])
with xr.set_options(use_flox=True):
    print(x.groupby(grp_idx).cumsum())
<xarray.DataArray 'x' (dim_0: 5)> Size: 40B
dask.array<_finalize_scan, shape=(5,), dtype=int64, chunksize=(5,), chunktype=numpy.ndarray>
Dimensions without coordinates: dim_0

Other changes:

  • Changes DataArray.cumsum/Dataset.cumsum/DataTree.cumsum/DataArray.groupby.cumsum/Dataset.groupby.cumsum etc.
  • Coordinates are now retained

Notes
groupby_scan was added in: https://github.com/xarray-contrib/flox/releases/tag/v0.9.9
cumsum was added in: https://github.com/xarray-contrib/flox/releases/tag/v0.10.5

Co-authored-by: Deepak Cherian <[email protected]>
@Illviljan
Copy link
Contributor Author

@dcherian, this is ready for a another review now. It was only changes in tests since the last time.

@dcherian
Copy link
Contributor

dcherian commented Jan 8, 2026

Are you able to address the extra testing requested in #10987 (comment)?

If you're too busy, we can just merge. This is a good improvement.

@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label Jan 10, 2026
@Illviljan
Copy link
Contributor Author

Writing down these flox issues before I forget:

ds = xr.Dataset(
    {
        "foo": (
            ("test", "time"),
            [[7, 2, 0, 1, 2, np.nan], [1, 1, 1, 1, 1, 1], [2, 2, 2, 2, 2, 2]],
        )
    },
    coords={
        "time": [0, 1 / 6, 2 / 6, 3 / 6, 4 / 6, 5 / 6],
        "test": ["a", "b", "b"],
        "group_idx": ("time", [0, 0, 1, 1, 2, 2]),
        "group_idx2": ("time", [0, 1, 1, 1, 1, 1]),
    },
)

# group_idx along 1 dim and cumsum dim along another fails with flox:
ds.groupby("group_idx").cumsum("test") 

# cumsum along multple dims fails with flox:
ds.groupby("group_idx").cumsum(...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-benchmark Run the ASV benchmark workflow topic-DataTree Related to the implementation of a DataTree class topic-groupby topic-performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cumsum drops index coordinates

2 participants