Skip to content

Commit 9aa6e88

Browse files
docs: Add introduction to flox (#206)
* Copy styling from cf-xarray * Add sphinx-codeautolink * Add intro docs * Add codespell * Add histogram * cache executed notebooks * Fix * IntervalIndex over isbin * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent ed51c19 commit 9aa6e88

File tree

9 files changed

+237
-11
lines changed

9 files changed

+237
-11
lines changed

.pre-commit-config.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,14 @@ repos:
4646
hooks:
4747
- id: nbstripout
4848
args: [--extra-keys=metadata.kernelspec metadata.language_info.version]
49+
50+
- repo: https://github.com/codespell-project/codespell
51+
rev: v2.2.2
52+
hooks:
53+
- id: codespell
54+
additional_dependencies:
55+
- tomli
56+
4957
- repo: https://github.com/asottile/pyupgrade
5058
rev: v3.3.1
5159
hooks:

ci/docs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,5 +16,6 @@ dependencies:
1616
- furo
1717
- ipykernel
1818
- jupyter
19+
- sphinx-codeautolink
1920
- pip:
2021
- git+https://github.com/xarray-contrib/flox

docs/source/conf.py

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,11 @@
3939
"numpydoc",
4040
"sphinx.ext.napoleon",
4141
"myst_nb",
42+
"sphinx_codeautolink",
4243
]
4344

45+
codeautolink_concat_default = True
46+
4447
extlinks = {
4548
"issue": ("https://github.com/xarray-contrib/flox/issues/%s", "GH#%s"),
4649
"pr": ("https://github.com/xarray-contrib/flox/pull/%s", "PR#%s"),
@@ -60,6 +63,7 @@
6063
# Myst_nb options
6164
nb_execution_excludepatterns = ["climatology-hourly.ipynb"]
6265
nb_execution_raise_on_error = True
66+
nb_execution_mode = "cache"
6367

6468
# The version info for the project you're documenting, acts as replacement for
6569
# |version| and |release|, also used in various other places throughout the
@@ -94,13 +98,34 @@
9498
# show_authors = False
9599

96100
# The name of the Pygments (syntax highlighting) style to use.
97-
pygments_style = "sphinx"
101+
pygments_style = "igor"
98102

99103

100104
# -- Options for HTML output ---------------------------------------------------
101105

102106
html_theme = "furo"
103107

108+
# Theme options are theme-specific and customize the look and feel of a theme
109+
# further. For a list of options available for each theme, see the
110+
# documentation.
111+
css_vars = {
112+
"admonition-font-size": "0.9rem",
113+
"font-size--small": "92%",
114+
"font-size--small--2": "87.5%",
115+
}
116+
html_theme_options = dict(
117+
sidebar_hide_name=True,
118+
light_css_variables=css_vars,
119+
dark_css_variables=css_vars,
120+
)
121+
122+
html_context = {
123+
"github_user": "xarray-contrib",
124+
"github_repo": "flox",
125+
"github_version": "main",
126+
"doc_path": "doc",
127+
}
128+
104129
# Theme options are theme-specific and customize the look and feel of a theme
105130
# further. For a list of options available for each theme, see the
106131
# documentation.

docs/source/implementation.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ width: 100%
136136
1. Currently the rechunking is only implemented for 1D arrays (being motivated by time resampling),
137137
but a nD generalization seems possible.
138138
1. Only can use the `blockwise` strategy for grouping by `nD` arrays.
139-
1. Works better when multiple groups are already in a single block; so that the intial
139+
1. Works better when multiple groups are already in a single block; so that the initial
140140
rechunking only involves a small amount of communication.
141141

142142
(method-cohorts)=
@@ -198,8 +198,8 @@ width: 100%
198198

199199
1. Group labels must be known at graph construction time, so this only works for numpy arrays.
200200
1. This does require more tasks and a more complicated graph, but the communication overhead can be significantly lower.
201-
1. The detection of "cohorts" is currrently slow but could be improved.
202-
1. The extra effort of detecting cohorts and mutiple copying of intermediate blocks may be worthwhile only if the chunk sizes are small
201+
1. The detection of "cohorts" is currently slow but could be improved.
202+
1. The extra effort of detecting cohorts and mul;tiple copying of intermediate blocks may be worthwhile only if the chunk sizes are small
203203
relative to the approximate period of group labels, or small relative to the size of spatially localized groups.
204204

205205
### Example : sensitivity to chunking
@@ -211,15 +211,15 @@ Consider our earlier example, `groupby("time.month")` with monthly frequency dat
211211
`flox` can find these cohorts, below it identifies the cohorts with labels `1,2,3,4`; `5,6,7,8`, and `9,10,11,12`.
212212

213213
```python
214-
>>> flox.find_group_cohorts(labels, array.chunks[-1]))
214+
>>> flox.find_group_cohorts(labels, array.chunks[-1]).values()
215215
[[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]] # 3 cohorts
216216
```
217217

218218
Now consider `chunksize=5`.
219219
![cohorts-schematic](/../diagrams/cohorts-month-chunk5.png)
220220

221221
```python
222-
>>> flox.core.find_group_cohorts(labels, array.chunks[-1]))
222+
>>> flox.core.find_group_cohorts(labels, array.chunks[-1]).values()
223223
[[1], [2, 3], [4, 5], [6], [7, 8], [9, 10], [11], [12]] # 8 cohorts
224224
```
225225

docs/source/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ It was motivated by many discussions in the [Pangeo](https://pangeo.io) communit
6262
.. toctree::
6363
:maxdepth: 1
6464
65+
intro.md
6566
aggregations.md
6667
engines.md
6768
arrays.md

docs/source/intro.md

Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
---
2+
jupytext:
3+
text_representation:
4+
format_name: myst
5+
kernelspec:
6+
display_name: Python 3
7+
name: python3
8+
---
9+
10+
```{eval-rst}
11+
.. currentmodule:: flox
12+
```
13+
14+
# 10 minutes to flox
15+
16+
## GroupBy single variable
17+
18+
```{code-cell}
19+
import numpy as np
20+
import xarray as xr
21+
22+
from flox.xarray import xarray_reduce
23+
24+
labels = xr.DataArray(
25+
[1, 2, 3, 1, 2, 3, 0, 0, 0],
26+
dims="x",
27+
name="label",
28+
)
29+
labels
30+
```
31+
32+
### With numpy
33+
34+
```{code-cell}
35+
da = xr.DataArray(
36+
np.ones((9,)), dims="x", name="array"
37+
)
38+
```
39+
40+
Apply the reduction using {py:func}`flox.xarray.xarray_reduce` specifying the reduction operation in `func`
41+
42+
```{code-cell}
43+
xarray_reduce(da, labels, func="sum")
44+
```
45+
46+
### With dask
47+
48+
Let's first chunk `da` and `labels`
49+
50+
```{code-cell}
51+
da_chunked = da.chunk(x=2)
52+
labels_chunked = labels.chunk(x=3)
53+
```
54+
55+
Grouping a dask array by a numpy array is unchanged
56+
57+
```{code-cell}
58+
xarray_reduce(da_chunked, labels, func="sum")
59+
```
60+
61+
When grouping **by** a dask array, we need to specify the "expected group labels" on the output so we can construct the result DataArray.
62+
Without the `expected_groups` kwarg, an error is raised
63+
64+
```{code-cell}
65+
---
66+
tags: [raises-exception]
67+
---
68+
xarray_reduce(da_chunked, labels_chunked, func="sum")
69+
```
70+
71+
Now we specify `expected_groups`:
72+
73+
```{code-cell}
74+
dask_result = xarray_reduce(
75+
da_chunked, labels_chunked, func="sum", expected_groups=[0, 1, 2, 3],
76+
)
77+
dask_result
78+
```
79+
80+
Note that any group labels not present in `expected_groups` will be ignored.
81+
You can also provide `expected_groups` for the pure numpy GroupBy.
82+
83+
```{code-cell}
84+
numpy_result = xarray_reduce(
85+
da, labels, func="sum", expected_groups=[0, 1, 2, 3],
86+
)
87+
numpy_result
88+
```
89+
90+
The two are identical:
91+
92+
```{code-cell}
93+
numpy_result.identical(dask_result)
94+
```
95+
96+
## Binning by a single variable
97+
98+
For binning, specify the bin edges in `expected_groups` using {py:class}`pandas.IntervalIndex`:
99+
100+
```{code-cell}
101+
import pandas as pd
102+
103+
xarray_reduce(
104+
da,
105+
labels,
106+
func="sum",
107+
expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6]),
108+
)
109+
```
110+
111+
Similarly for dask inputs
112+
113+
```{code-cell}
114+
xarray_reduce(
115+
da_chunked,
116+
labels_chunked,
117+
func="sum",
118+
expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6]),
119+
)
120+
```
121+
122+
For more control over the binning (which edge is closed), pass the appropriate kwarg to {py:class}`pandas.IntervalIndex`:
123+
124+
```{code-cell}
125+
xarray_reduce(
126+
da_chunked,
127+
labels_chunked,
128+
func="sum",
129+
expected_groups=pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5, 6], closed="left"),
130+
)
131+
```
132+
133+
## Grouping by multiple variables
134+
135+
```{code-cell}
136+
arr = np.ones((4, 12))
137+
labels1 = np.array(["a", "a", "c", "c", "c", "b", "b", "c", "c", "b", "b", "f"])
138+
labels2 = np.array([1, 2, 2, 1])
139+
140+
da = xr.DataArray(
141+
arr, dims=("x", "y"), coords={"labels2": ("x", labels2), "labels1": ("y", labels1)}
142+
)
143+
da
144+
```
145+
146+
To group by multiple variables simply pass them as `*args`:
147+
148+
```{code-cell}
149+
xarray_reduce(da, "labels1", "labels2", func="sum")
150+
```
151+
152+
## Histogramming (Binning by multiple variables)
153+
154+
An unweighted histogram is simply a groupby multiple variables with count.
155+
156+
```{code-cell} python
157+
arr = np.ones((4, 12))
158+
labels1 = np.array(np.linspace(0, 10, 12))
159+
labels2 = np.array([1, 2, 2, 1])
160+
161+
da = xr.DataArray(
162+
arr, dims=("x", "y"), coords={"labels2": ("x", labels2), "labels1": ("y", labels1)}
163+
)
164+
da
165+
```
166+
167+
Specify bins in `expected_groups`
168+
169+
```{code-cell} python
170+
xarray_reduce(
171+
da,
172+
"labels1",
173+
"labels2",
174+
func="count",
175+
expected_groups=(
176+
pd.IntervalIndex.from_breaks([-0.5, 4.5, 6.5, 8.9]), # labels1
177+
pd.IntervalIndex.from_breaks([0.5, 1.5, 1.9]), # labels2
178+
),
179+
)
180+
```
181+
182+
## Resampling
183+
184+
Use the xarray interface i.e. `da.resample(time="M").mean()`.
185+
186+
Optionally pass [`method="blockwise"`](method-blockwise): `da.resample(time="M").mean(method="blockwise")`

flox/core.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -258,7 +258,7 @@ def rechunk_for_cohorts(
258258
Labels at which we always start a new chunk. For
259259
the example ``labels`` array, this would be `1`.
260260
chunksize : int, optional
261-
nominal chunk size. Chunk size is exceded when the label
261+
nominal chunk size. Chunk size is exceeded when the label
262262
in ``force_new_chunk_at`` is less than ``chunksize//2`` elements away.
263263
If None, uses median chunksize along axis.
264264
@@ -447,7 +447,7 @@ def factorize_(
447447
for groupvar, expect in zip(by, expected_groups):
448448
flat = groupvar.reshape(-1)
449449
if isinstance(expect, pd.RangeIndex):
450-
# idx is a view of the original `by` aray
450+
# idx is a view of the original `by` array
451451
# copy here so we don't have a race condition with the
452452
# group_idx[nanmask] = nan_sentinel assignment later
453453
# this is important in shared-memory parallelism with dask
@@ -861,7 +861,7 @@ def _simple_combine(
861861
2. _expand_dims was used to insert an extra axis DUMMY_AXIS
862862
3. Here we concatenate along DUMMY_AXIS, and then call the combine function along
863863
DUMMY_AXIS
864-
4. At the final agggregate step, we squeeze out DUMMY_AXIS
864+
4. At the final aggregate step, we squeeze out DUMMY_AXIS
865865
"""
866866
from dask.array.core import deepfirst
867867
from dask.utils import deepmap

flox/xarray.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -347,7 +347,7 @@ def wrapper(array, *by, func, skipna, core_dims, **kwargs):
347347
array, *by = _broadcast_size_one_dims(array, *by, core_dims=core_dims)
348348

349349
# Handle skipna here because I need to know dtype to make a good default choice.
350-
# We cannnot handle this easily for xarray Datasets in xarray_reduce
350+
# We cannot handle this easily for xarray Datasets in xarray_reduce
351351
if skipna and func in ["all", "any", "count"]:
352352
raise ValueError(f"skipna cannot be truthy for {func} reductions.")
353353

@@ -511,7 +511,7 @@ def rechunk_for_cohorts(
511511
Labels at which we always start a new chunk. For
512512
the example ``labels`` array, this would be `1`.
513513
chunksize : int, optional
514-
nominal chunk size. Chunk size is exceded when the label
514+
nominal chunk size. Chunk size is exceeded when the label
515515
in ``force_new_chunk_at`` is less than ``chunksize//2`` elements away.
516516
If None, uses median chunksize along ``dim``.
517517

pyproject.toml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,3 +53,8 @@ ignore_missing_imports = true
5353

5454
[tool.pytest.ini_options]
5555
addopts = "--tb=short"
56+
57+
58+
[tool.codespell]
59+
ignore-words-list = "nd,nax"
60+
skip = "*.html"

0 commit comments

Comments
 (0)