You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/posts/flox-smart/index.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ summary: 'flox adds heuristics for automatically choosing an appropriate strateg
16
16
17
17
[`flox` implements](https://flox.readthedocs.io/) grouped reductions for chunked array types like [cubed](https://cubed-dev.github.io/cubed/) and [dask](https://docs.dask.org/en/stable/array.html) using tree reductions.
18
18
Tree reductions ([example](https://people.csail.mit.edu/xchen/gpu-programming/Lecture11-reduction.pdf)) are a parallel-friendly way of computing common reduction operations like `sum`, `mean` etc.
19
-
Briefly, one computes the reduction for a subset of the array $N$ chunks at a time in parallel, then combines those results together again $N$ chunks at a time, until we have the final result.
19
+
Briefly, one computes the reduction for a subset of the array N chunks at a time in parallel, then combines those results together again N chunks at a time, until we have the final result.
20
20
21
21
Without flox, Xarray effectively shuffles — sorts the data to extract all values in a single group — and then runs the reduction group-by-group.
22
22
Depending on data layout or "chunking" this shuffle can be quite expensive.
@@ -57,12 +57,12 @@ Second, `method="cohorts"` which is a bit more subtle.
57
57
Consider `groupby("time.month")` for the monthly mean dataset i.e. grouping by an exactly periodic array.
58
58
When the chunk size along the core dimension "time" is a divisor of the period; so either 1, 2, 3, 4, or 6 in this case; groups tend to occur in cohorts ("groups of groups").
59
59
For example, with a chunk size of 4, monthly mean input data for the "cohort" Jan/Feb/Mar/Apr are _always_ in the same chunk, and totally separate from any of the other months.
60
-
Here is a schematic illustration where each month is represented by a different shade of red:
60
+
Here is a schematic illustration where each month is represented by a different shade of red and a single chunk contains 4 months:
This means that we can run the tree reduction for each cohort (three cohorts in total: `JFMA | MJJA | SOND`) independently and expose more parallelism.
63
63
Doing so can significantly reduce compute times and in particular memory required for the computation.
64
64
65
-
If there isn't much separation of groups into cohorts, like when groups are randomly distributed across chunks, then it's hard to do better than the standard `method="map-reduce"`.
65
+
Finally, if there isn't much separation of groups into cohorts, like when groups are randomly distributed across chunks, then it's hard to do better than the standard `method="map-reduce"`.
66
66
67
67
## Choosing a strategy is hard, and harder to teach.
0 commit comments