Skip to content

Commit 093a75e

Browse files
committed
edits
1 parent 503a36b commit 093a75e

File tree

1 file changed

+23
-25
lines changed

1 file changed

+23
-25
lines changed

src/posts/flox-smart/index.md

Lines changed: 23 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -21,23 +21,24 @@ See our [previous blog post](https://xarray.dev/blog/flox) for more.
2121
Two key realizations influenced the development of flox:
2222

2323
1. Array workloads frequently group by a relatively small in-memory array. Quite frequently those arrays have patterns to their values e.g. `"time.month"` is exactly periodic, `"time.dayofyear"` is approximately periodic (depending on calendar), `"time.year"` is commonly a monotonic increasing array.
24-
2. An important difference between arrays and dataframes is that chunk sizes (or "partition sizes") for arrays can be quite small along the core-dimension of an operation.
24+
2. Chunk sizes (or "partition sizes") for arrays can be quite small along the core-dimension of an operation. This is an important difference between arrays and dataframes!
2525

26-
These two properties are particularly relevant for climatology calculations --- a common Xarray workload.
26+
These two properties are particularly relevant for "climatology" calculations (e.g. `groupby("time.month").mean()`) — a common Xarray workload.
2727

2828
## Tree reductions can be catastrophically bad
2929

3030
For a catastrophic example, consider `ds.groupby("time.year").mean()`, or the equivalent `ds.resample(time="Y").mean()` for a 100 year long dataset of monthly averages with chunk size of **1** (or **4**) along the time dimension.
3131
This is a fairly common format for climate model output.
32-
The small chunk size along time is offset by much larger chunk sizes along the other dimensions --- commonly horizontal space (`x, y` or `latitude, longitude`).
32+
The small chunk size along time is offset by much larger chunk sizes along the other dimensions commonly horizontal space (`x, y` or `latitude, longitude`).
3333

3434
A naive tree reduction would accumulate all averaged values into a single output chunk of size 100.
3535
Depending on the chunking of the input dataset, this may overload the worker memory and fail catastrophically.
36-
More importantly, there is a lot of wasteful communication --- computing on the last year of data is completely independent of computing on the first year of the data, and there is no reason the two values need to reside in the same output chunk.
36+
More importantly, there is a lot of wasteful communication computing on the last year of data is completely independent of computing on the first year of the data, and there is no reason the two values need to reside in the same output chunk.
3737

3838
## Avoiding catastrophe
3939

40-
Thus `flox` quickly grew two new modes of computing the groupby reduction:
40+
Thus `flox` quickly grew two new modes of computing the groupby reduction.
41+
4142
First, `method="blockwise"` which applies the grouped-reduction in a blockwise fashion.
4243
This is great for `resample(time="Y").mean()` where we group by `"time.year"`, which is a monotonic increasing array.
4344
With an appropriate (and usually quite cheap) rechunking, the problem is embarassingly parallel.
@@ -93,7 +94,7 @@ After a fun exploration involving such fun ideas as [locality-sensitive hashing]
9394
I use set _containment_, or a "normalized intersection", to determine the similarity the sets of chunks occupied by two different groups (`Q` and `X`).
9495

9596
```
96-
C = |Q ∩ X| / |Q| ≤ 1
97+
C = |Q ∩ X| / |Q| ≤ 1; (∩ is set intersection)
9798
```
9899

99100
Unlike Jaccard similarity, _containment_ [isn't skewed](http://ekzhu.com/datasketch/lshensemble.html) when one of the sets is much larger than the other.
@@ -109,29 +110,25 @@ The steps are as follows:
109110
1. Use `"blockwise"` when every group is contained to one block each.
110111
1. Use `"cohorts"` when every chunk only has a single group, but that group might extend across multiple chunks
111112
1. [and more](https://github.com/xarray-contrib/flox/blob/e6159a657c55fa4aeb31bcbcecb341a4849da9fe/flox/core.py#L408-L426)
112-
Here is an example:
113-
<!-- ![bitmask-patterns](/../diagrams/bitmask-patterns-perfect.png) -->
114-
115-
- On the left, is a monthly grouping for a monthly time series with chunk size 4. There are 3 non-overlapping cohorts so
116-
`method="cohorts"` is perfect.
117-
- On the right, is a resampling problem of a daily time series with chunk size 10 to 5-daily frequency. Two 5-day periods
118-
are exactly contained in one chunk, so `method="blockwise"` is perfect.
119113

120114
1. At this point, we want to merge groups in to cohorts when they occupy _approximately_ the same chunks. For each group `i` we can quickly compute containment against
121-
all other groups `j` as `C = S.T @ S / number_chunks_per_group`. Here is `C` for a range of chunk sizes from 1 to 12, for computing
122-
the monthly mean of a monthly time series problem, \[the title on each image is `(chunk size, sparsity)`\].
123-
124-
```python
125-
chunks = np.arange(1, 13)
126-
labels = np.tile(np.arange(1, 13), 30)
127-
```
128-
129-
<!-- ![cohorts-schematic](/../diagrams/containment.png) -->
115+
all other groups `j` as `C = S.T @ S / number_chunks_per_group`.
130116

131117
1. To choose between `"map-reduce"` and `"cohorts"`, we need a summary measure of the degree to which the labels overlap with
132118
each other. We can use _sparsity_ --- the number of non-zero elements in `C` divided by the number of elements in `C`, `C.nnz/C.size`.
133119
We use _sparsity_ --- the number of non-zero elements in `C` divided by the number of elements in `C`, `C.nnz/C.size`. When sparsity is relatively high, we use `"map-reduce"`, otherwise we use `"cohorts"`.
134120

121+
For more detail [see the docs](https://flox.readthedocs.io/en/latest/implementation.html#heuristics).
122+
123+
Here is C for a range of chunk sizes from 1 to 12, for computing `groupby("time.month")` of a monthly mean dataset, [the title on each image is (chunk size, sparsity)].
124+
![flox sparsity image](https://flox.readthedocs.io/en/latest/_images/containment.png)
125+
126+
flox will choose:
127+
128+
1. `"blockwise"` for chunk size 1,
129+
2. `"cohorts"` for (2, 3, 4, 6, 12),
130+
3. and `"map-reduce"` for the rest.
131+
135132
Cool, isn't it?!
136133

137134
## What's next?
@@ -140,7 +137,7 @@ flox' ability to do cool inferences entirely relies on the input chunking, which
140137
Perfect optimization still requires some user-tuned chunking.
141138
Recent Xarray feature makes that a lot easier for time grouping:
142139

143-
```
140+
```python
144141
from xarray.groupers import TimeResampler
145142

146143
rechunked = ds.chunk(time=TimeResampler("YE"))
@@ -149,10 +146,11 @@ rechunked = ds.chunk(time=TimeResampler("YE"))
149146
will rechunk so that a year of data is in a single chunk.
150147

151148
Even so, it would be nice to automatically rechunk to minimize number of cohorts detected, or to a perfectly blockwise application.
152-
A key limitation is that we have lost context.
149+
A key limitation is that we have lost _context_.
153150
The string `"time.month"` tells me that I am grouping a perfectly periodic array with period 12; similarly
154-
the _string_ `"time.dayofyear"` tells me that I am grouping by a (quasi-)\_periodic array with period 365, and that group `366` may occur occasionally (depending on calendar).
151+
the _string_ `"time.dayofyear"` tells me that I am grouping by a (quasi-)periodic array with period 365, and that group `366` may occur occasionally (depending on calendar).
155152
This context is hard to infer from integer group labels `[1, 2, 3, 4, 5, ..., 1, 2, 3, 4, 5]`.
153+
/[Get in touch](https://github.com/xarray-contrib/flox/issues) if you have ideas for how to do this inference!\*.
156154

157155
One way to preserve context may be to use Xarray's new Grouper objects, and let them report ["preferred chunks"](https://github.com/pydata/xarray/blob/main/design_notes/grouper_objects.md#the-preferred_chunks-method-) for a particular grouping.
158156
This would allow a downstream system like `flox` or `dask-expr` to take this in to account later (or even earlier!) in the pipeline.

0 commit comments

Comments
 (0)