Skip to content

Commit 60b9eb3

Browse files
committed
docs update
1 parent e2d0155 commit 60b9eb3

File tree

3 files changed

+118
-96
lines changed

3 files changed

+118
-96
lines changed

docs/guide.md

Lines changed: 111 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -666,7 +666,13 @@ You can also use dataset objects of type
666666
[`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html)
667667
as slice item. Such objects may originate from opening datasets from some storage, e.g.,
668668
`xarray.open_dataset(slice_store, ...)` or by composing, aggregating, resampling
669-
slice datasets from other datasets and data variables.
669+
slice datasets from other datasets and dataset variables.
670+
671+
!!! note "Datasets are not closed automatically"
672+
If you pass [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html) objects to `zappend` they will not be
673+
automatically closed. This may become be issue, if you have many datasets
674+
and each one binds resources such as open file handles. Consider using a
675+
_slice source_ then, see below.
670676

671677
Chunked data arrays of an `xarray.Dataset` are usually instances of
672678
[Dask arrays](https://docs.dask.org/en/stable/array.html), to allow for out-of-core
@@ -676,9 +682,9 @@ for computed slice datasets, especially if the specified target dataset chunking
676682
different from the slice dataset chunking. This may cause Dask graphs to be
677683
computed multiple times if the source chunking overlaps multiple target chunks,
678684
potentially causing large resource overheads while recomputing and/or reloading the same
679-
source chunks multiple times. In such cases it can help to "terminate"
680-
computations for each slice by persisting the computed dataset first and then to
681-
reopen it. This can be specified using the `persist_mem_slice` setting:
685+
source chunks multiple times. In such cases it can help to "terminate" computations for
686+
each slice by persisting the computed dataset first and then to reopen it. This can be
687+
specified using the `persist_mem_slice` setting:
682688

683689
```json
684690
{
@@ -692,61 +698,95 @@ at the cost of additional i/o. It therefore defaults to `false`.
692698

693699
#### Slice Sources
694700

695-
If you need some custom cleanup after a slice has been processed and appended to the
696-
target dataset, you can use instances of `zappend.api.SliceSource` as slice items.
697-
The `SliceSource` methods with special meaning are:
701+
A slice source gives you full control about how a slice dataset is created, loaded,
702+
or generated and how its bound resources, if any, are released. In its simplest form,
703+
a slice source is a plain Python function that can take any arguments and returns
704+
an `xarray.Dataset`:
705+
706+
```python
707+
import xarray as xr
708+
709+
# Slice source argument `path` is just an example.
710+
def get_dataset(path: str) -> xr.Dataset:
711+
# Provide dataset here. No matter how, e.g.:
712+
return xr.open_dataset(path)
713+
```
714+
715+
If you need cleanup code that is executed after the slice dataset has been appended,
716+
you can turn your slice source function into a [context manager](https://docs.python.org/3/library/contextlib.html)
717+
(new in zappend v0.7):
718+
719+
```python
720+
from contextlib import contextmanager
721+
import xarray as xr
722+
723+
# Slice source argument `path` is just an example.
724+
@contextmanager
725+
def get_dataset(path: str) -> xr.Dataset:
726+
# Bind any resources and provide dataset here, e.g.:
727+
dataset = xr.open_dataset(path)
728+
try:
729+
# Yield (not return!) the dataset
730+
yield dataset
731+
finally:
732+
# Cleanup code here, release any bound resources, e.g.:
733+
dataset.close()
734+
```
735+
736+
You can also implement your slice source as a class derived from the abstract
737+
`zappend.api.SliceSource` class. Its interface methods are:
698738

699739
* `get_dataset()`: a zero-argument method that returns the slice dataset of type
700740
`xarray.Dataset`. You must implement this abstract method.
701-
* `close()`: perform any resource cleanup tasks
741+
* `close()`: Optional method. Put your cleanup code here.
702742
(in zappend < v0.7, the `close` method was called `dispose`).
703-
* `__init__()`: optional constructor that receives any arguments passed to the
704-
slice source.
705743

706-
Here is the template code for your own slice source implementation:
707744

708745
```python
709746
import xarray as xr
710747
from zappend.api import SliceSource
711748

712749
class MySliceSource(SliceSource):
713-
# Pass any positional and keyword arguments that you need
714-
# to the constructor. `path` is just an example.
750+
# Slice source argument `path` is just an example.
715751
def __init__(self, path: str):
716752
self.path = path
717-
self.ds = None
753+
self.dataset = None
718754

719755
def get_dataset(self) -> xr.Dataset:
720-
# Write code here that obtains the dataset.
721-
self.ds = xr.open_dataset(self.path)
722-
# You can put any processing here.
723-
return self.ds
756+
# Bind any resources and provide dataset here, e.g.:
757+
self.dataset = xr.open_dataset(self.path)
758+
return self.dataset
724759

725760
def close(self):
726-
# Write code here that performs cleanup.
727-
if self.ds is not None:
728-
self.ds.close()
729-
self.ds = None
761+
# Cleanup code here, release any bound resources, e.g.:
762+
if self.dataset is not None:
763+
self.dataset.close()
730764
```
731765

732-
Instead of providing instances of `SliceSource` as slice items, it is often
733-
easier to pass your `SliceSource` class and let `zappend` pass the slice item as
734-
arguments(s) to your `SliceSource`'s constructor. This can be achieved using the
735-
the `slice_source` configuration setting. If you need to access configuration
736-
settings, it is even required to use the `slice_source` setting.
766+
You may prefer implementing a class because your slice source is complex and you want
767+
to split its logic into separate methods. You may also just prefer classes as a matter
768+
of your personal taste. Another advantage of using a class is that you can pass
769+
instances of it as slice items to the `zappend` function without further configuration.
770+
However, the intended use of a slice source is to configure it by specifying the
771+
`slice_source` setting. In a JSON or YAML configuration file it specifies the fully
772+
qualified name of the slice source function or class:
737773

738774
```json
739775
{
740776
"slice_source": "mymodule.MySliceSource"
741777
}
742778
```
743779

744-
The `slice_source` setting can actually be **any Python function** that returns a
745-
valid slice item as described above such as a file path or URI, or
746-
an `xarray.Dataset`.
780+
If you use the `zappend` function, you can pass the function or class directly:
781+
782+
```python
783+
zappend(["slice-1.nc", "slice-2.nc", "slice-3.nc"],
784+
target_dir="target.zarr",
785+
slice_source=MySliceSource)
786+
```
747787

748-
If a slice source is configured, each slice item passed to `zappend` is passed as
749-
argument to your slice source.
788+
If the slice source setting is used, each slice item passed to `zappend` is passed as
789+
argument(s) to your slice source.
750790

751791
* Slices passed to the `zappend` CLI command become slice source arguments
752792
of type `str`.
@@ -765,88 +805,71 @@ You can also pass extra keyword arguments to your slice source using the
765805
every slice source invocation. Keyword arguments passed as slice items with same name
766806
as in `slice_source_kwargs` take precedence.
767807

768-
In addition, your slice source function or class constructor specified by
769-
`slice_source` may define a 1st positional argument or keyword argument
770-
named `ctx`, which will receive the current processing context of type
771-
`zappend.api.Context`. This can be useful if you need to read configuration
772-
settings.
808+
Slice arguments are passed to your slice source for every slice. If your slice source
809+
has many parameters that stay the same for all slices you may prefer providing
810+
parameters as settings in the configuration. This can be achieved using the `extra`
811+
setting:
812+
813+
```json
814+
{
815+
"extra": {
816+
"quantiles": [0.1, 0.5, 0.9],
817+
"use_threshold": true,
818+
"filter": "gauss"
819+
}
820+
}
821+
```
822+
823+
To access the settings in `extra` your slice source function or class constructor
824+
must define a special argument named `ctx`. It must be a 1st positional argument or
825+
a keyword argument. The argument `ctx` is the current processing context of type
826+
`zappend.api.Context` that also contains the configuration.
773827

774828
Here is a more advanced example of a slice source that opens datasets from a given
775829
file path and averages the values along the time dimension:
776830

777831
```python
778832
import numpy as np
779833
import xarray as xr
834+
from zappend.api import Context
780835
from zappend.api import SliceSource
781836
from zappend.api import zappend
782837

783-
def get_mean_time(slice_ds: xr.Dataset) -> xr.DataArray:
784-
time = slice_ds.time
785-
t0 = time[0]
786-
dt = time[-1] - t0
787-
return xr.DataArray(np.array([t0 + dt / 2],
788-
dtype=slice_ds.time.dtype),
789-
dims="time")
790-
791-
def get_mean_slice(slice_ds: xr.Dataset) -> xr.Dataset:
792-
mean_slice_ds = slice_ds.mean("time")
793-
# Re-introduce time dimension of size one
794-
mean_slice_ds = mean_slice_ds.expand_dims("time", axis=0)
795-
mean_slice_ds.coords["time"] = get_mean_time(slice_ds)
796-
return mean_slice_ds
797-
798838
class MySliceSource(SliceSource):
799-
def __init__(self, slice_path):
839+
def __init__(self, ctx: Context, slice_path: str):
840+
self.quantiles = ctx.config.extra.get("quantiles", [0.5])
800841
self.slice_path = slice_path
801842
self.ds = None
802843

803844
def get_dataset(self):
804845
self.ds = xr.open_dataset(self.slice_path)
805-
return get_mean_slice(self.ds)
846+
return self.get_agg_slice(self.ds)
806847

807848
def close(self):
808849
if self.ds is not None:
809850
self.ds.close()
810-
self.ds = None
811851

852+
def get_agg_slice(self, slice_ds: xr.Dataset) -> xr.Dataset:
853+
agg_slice_ds = slice_ds.quantile(self.quantiles, dim="time")
854+
# Re-introduce time dimension of size one
855+
agg_slice_ds = agg_slice_ds.expand_dims("time", axis=0)
856+
agg_slice_ds.coords["time"] = self.get_mean_time(slice_ds)
857+
return agg_slice_ds
858+
859+
@classmethod
860+
def get_mean_time(cls, slice_ds: xr.Dataset) -> xr.DataArray:
861+
time = slice_ds.time
862+
t0 = time[0]
863+
dt = time[-1] - t0
864+
return xr.DataArray(np.array([t0 + dt / 2],
865+
dtype=slice_ds.time.dtype),
866+
dims="time")
867+
812868
zappend(["slice-1.nc", "slice-2.nc", "slice-3.nc"],
813869
target_dir="target.zarr",
814870
slice_source=MySliceSource)
815871
```
816872

817-
Since zappend 0.7, a slice source can also be written as a Python
818-
[context manager](https://docs.python.org/3/library/contextlib.html),
819-
which allows you implementing the `get_dataset()` and `close()`
820-
methods in one single function, instead of a class. Here is the above example
821-
written as context manager.
822-
823-
```python
824-
from contextlib import contextmanager
825-
import numpy as np
826-
import xarray as xr
827-
from zappend.api import zappend
828-
829-
# Same as above here
830-
831-
@contextmanager
832-
def get_slice_dataset(slice_path):
833-
# allocate resources here
834-
ds = xr.open_dataset(slice_path)
835-
mean_ds = get_mean_slice(ds)
836-
try:
837-
# yield (!) the slice dataset
838-
# so it can be appended
839-
yield mean_ds
840-
finally:
841-
# after slice dataset has been appended
842-
# release resources here
843-
ds.close()
844-
845-
zappend(["slice-1.nc", "slice-2.nc", "slice-3.nc"],
846-
target_dir="target.zarr",
847-
slice_source=get_slice_dataset)
848-
```
849-
850873
## Profiling
851874

852875
Runtime profiling is very important for understanding program runtime behavior

zappend/config/schema.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -752,9 +752,9 @@
752752
extra={
753753
"category": "Misc.",
754754
"description": (
755-
"Arbitrary configuration that is not validated by default."
755+
"Extra settings."
756756
" Intended use is by a `slice_source` that expects an argument"
757-
" named `ctx` and therefore can access the configuration."
757+
" named `ctx` to access the extra settings and other configuration."
758758
),
759759
"type": "object",
760760
"additionalProperties": True,

zappend/slice/source.py

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -72,12 +72,11 @@ def dispose(self):
7272
"""The possible types that can represent a slice dataset."""
7373

7474
SliceCallable = Type[SliceSource] | Callable[[...], SliceItem]
75-
"""This type is either a class derived from `SliceSource`
76-
or a function that returns a `SliceItem`. Both can be invoked
77-
with any number of positional or keyword arguments. The processing
78-
context, if used, is passed either as 1st positional argument or
79-
as keyword argument and must be named `ctx` and must have
80-
type `Context`.
75+
"""This type is either a class derived from `SliceSource` or a function that
76+
returns a `SliceItem`. Both can be invoked with any number of positional or
77+
keyword arguments. The processing context, if used, must be named `ctx` and
78+
must be either the 1st positional argument or a keyword argument. Its type
79+
is `Context`.
8180
"""
8281

8382

0 commit comments

Comments
 (0)