@@ -534,9 +534,38 @@ passed using the optional `target_storage_options` setting.
534534
535535### Slice Datasets
536536
537- If the slice paths passed to the ` zappend ` tool are given as URIs,
538- additional storage options may be provided for the filesystem given by the
539- URI's protocol. They may be specified using the ` slice_storage_options ` setting.
537+ A _ slice dataset_ is the dataset that is appended for every slice item passed
538+ to ` zappend ` . Slice datasets can be provided in various ways.
539+
540+ * When using the ` zappend ` CLI command, slice datasets are passed as local file path
541+ or URI argument.
542+
543+ * When using the [ ` zappend ` ] ( api.md ) Python function, slice datasets are passed
544+ using the ` slices ` argument, which must be iterable. You can pass a list or tuple
545+ of slice items or provide a generator that provides the slice items.
546+
547+ Each slice item can be a local file path or URI of type ` str ` or ` FileObj ` ,
548+ a dataset of type ` xarray.Dataset ` , or a ` SliceSource ` object explained in more detail
549+ below.
550+
551+ #### Paths and URIs
552+
553+ A slice object of type ` str ` is interpreted as local file path or URI, in the case
554+ the path has a protocol prefix, such as ` s3:// ` as described above.
555+
556+ In the majority of ` zappend ` use cases the slice datasets to be appended to a target
557+ dataset are passed as local file paths or URIs. A slice URI starts with a protocol
558+ prefix, such as ` s3:// ` , or ` memory:// ` . Additional storage options may be required
559+ for the filesystem given by the URI's protocol. They may be specified using the
560+ ` slice_storage_options ` setting.
561+
562+ ``` json
563+ {
564+ "slice_storage_options" : {
565+ "anon" : true
566+ }
567+ }
568+ ```
540569
541570Sometimes, the slice dataset to be processed are not yet available, e.g.,
542571because another process is currently generating them. For such cases, the
@@ -563,82 +592,29 @@ Or use default polling:
563592}
564593```
565594
566-
567- ### Slice Sources
568-
569- A _ slice source_ is an object that provides a slice dataset of type ` xarray.Dataset `
570- for given parameters of any type.
571-
572- The optional ` slice_source ` configuration setting is used to specify a custom
573- slice source. If not specified, ` zappend ` selects the slice source based on the type
574- of a given slice object. These types are described in following subsections.
575-
576- If given, the value of the ` slice_source ` setting is a class derived from
577- ` zappend.api.SliceSource ` , or a function that creates an instance of
578- ` zappend.api.SliceSource ` , or the fully qualified name of the aforementioned.
579- In the case ` slice_source ` is given, the _ slices_ argument passed to the CLI
580- command and Python function become parameters to the specified class constructor
581- or factory function.
582- The individual slice items in the ` SLICES ` arguments of the ` zappend ` CLI
583- command are of type ` str ` , typically interpreted as file paths or URIs.
584- The individual slice items passed in the ` slices ` argument of the
585- ` zappend.api.zappend() ` function can be of any type, but the ` tuple ` , ` list ` ,
586- and ` dict ` types have a special meaning:
587-
588- * ` tuple ` : a pair of the form ` (args, kwargs) ` , where ` args ` is a list
589- or tuple of positional arguments and ` kwargs ` is a dictionary of keyword
590- arguments;
591- * ` list ` : positional arguments only;
592- * ` dict ` : keyword arguments only;
593- * Any other type is interpreted as single positional argument.
594-
595- In addition, your class constructor or factory function specified by ` slice_source `
596- may specify a positional or keyword argument named ` ctx ` , which will receive the
597- current processing context of type ` zappend.api.Context ` .
598-
599- If the ` slice_source ` setting is _ not_ specified, the slice items passed as ` slices `
600- argument to the [ ` zappend ` ] ( api.md ) Python function can be one of the types described
601- in the following subsections.
602-
603- #### Types ` str ` and ` FileObj `
604-
605- A slice object of type ` str ` is interpreted as local file path or URI, in the case
606- the path has a protocol prefix, such as ` s3:// ` .
607-
608595An alternative to providing the slice dataset as path or URI is using the
609596` zappend.api.FileObj ` class, which combines a URI with dedicated filesystem
610597storage options.
611598
612- ``` python
613- from zappend.api import FileObj
614-
615- slice_obj = FileObj(slice_uri, storage_options = dict (... ))
616- ```
617-
618- #### Type ` Dataset `
619-
620- In-memory slice objects can be passed as [ ` xarray.Dataset ` ] ( https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html )
621- objects. Such objects may originate from opening datasets from some storage
622-
623- ``` python
624- import xarray as xr
599+ #### Dataset Objects
625600
626- slice_obj = xr.open_dataset(slice_store, ... )
627- ```
601+ You can also use dataset objects of type
602+ [ ` xarray.Dataset ` ] ( https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html )
603+ as slice item. Such objects may originate from opening datasets from some storage, e.g.,
604+ ` xarray.open_dataset(slice_store, ...) ` or by composing, aggregating, resampling
605+ slice datasets from other datasets and data variables.
628606
629- or by composing, aggregating, resampling slice datasets from other datasets and
630- data variables. To allow for out-of-core computation of large datasets [ Dask arrays] ( https://docs.dask.org/en/stable/array.html )
631- are used by both ` xarray ` and ` zarr ` . As a dask array may represent complex and/or
607+ Chunked data arrays of an ` xarray.Dataset ` are usually instances of
608+ [ Dask arrays] ( https://docs.dask.org/en/stable/array.html ) , to allow for out-of-core
609+ computation of large datasets . As a dask array may represent complex and/or
632610expensive processing graphs, high CPU loads and memory consumption are common issues
633611for computed slice datasets, especially if the specified target dataset chunking is
634612different from the slice dataset chunking. This may cause Dask graphs to be
635613computed multiple times if the source chunking overlaps multiple target chunks,
636614potentially causing large resource overheads while recomputing and/or reloading the same
637- source chunks multiple times.
638-
639- In such cases it can help to "terminate" such computations for each slice by
640- persisting the computed dataset first and then to reopen it. This can be specified
641- using the ` persist_mem_slice ` setting:
615+ source chunks multiple times. In such cases it can help to "terminate" such
616+ computations for each slice by persisting the computed dataset first and then to
617+ reopen it. This can be specified using the ` persist_mem_slice ` setting:
642618
643619``` json
644620{
@@ -650,34 +626,82 @@ If the flag is set, in-memory slices will be persisted to a temporary Zarr befor
650626appending them to the target dataset. It may prevent expensive re-computation of chunks
651627at the cost of additional i/o. It therefore defaults to ` false ` .
652628
653- #### Type ` SliceSource `
629+ #### Slice Sources
630+
631+ If you need some custom cleanup after a slice has been processed and appended to the
632+ target dataset, you can use an instance of ` zappend.api.SliceSource ` as slice item.
633+ A ` SliceSource ` class requires you to implement two methods:
634+
635+ * ` get_dataset() ` to return the slice dataset of type ` xarray.Dataset ` , and
636+ * ` dispose() ` to perform any resource cleanup tasks.
637+
638+ Here is the template code for your own slice source implementation:
639+
640+ ``` python
641+ import xarray as xr
642+ from zappend.api import SliceSource
643+
644+ class MySliceSource (SliceSource ):
645+ # Pass any positional and keyword arguments that you need
646+ # to the constructor. `path` is just an example.
647+ def __init__ (self , path : str ):
648+ self .path = path
649+ self .ds = None
650+
651+ def get_dataset (self ) -> xr.Dataset:
652+ # Write code here that obtains the dataset.
653+ self .ds = xr.open_dataset(self .path)
654+ # You can put any processing here.
655+ return self .ds
654656
655- Often you want to perform some custom cleanup after a slice has been processed and
656- appended to the target dataset. In this case you can write your own
657- ` zappend.api.SliceSource ` by implementing its ` get_dataset() ` and ` dispose() `
658- methods.
657+ def dispose (self ):
658+ # Write code here that performs cleanup.
659+ if self .ds is not None :
660+ self .ds.close()
661+ self .ds = None
662+ ```
659663
660- Slice source instances are supposed to be created by _ slice factories_ , see
661- subsection below.
664+ Instead of providing instances of ` SliceSource ` directly as a slice item, it is often
665+ easier to pass your ` SliceSource ` class and let ` zappend ` pass the slice item as
666+ arguments(s) to your ` SliceSource ` 's constructor. This can be achieved using the
667+ the ` slice_source ` configuration setting.
662668
663- #### Type ` SliceFactory `
669+ ``` json
670+ {
671+ "slice_source" : " mymodule.MySliceSource"
672+ }
673+ ```
664674
665- A slice factory is a 1-argument function that receives a processing context of type
666- ` zappend.api.Context ` and yields a slice dataset object of one of the types
667- described above. Since a slice factory cannot have additional arguments, it is
668- normally defined as a [ closure] ( https://en.wikipedia.org/wiki/Closure_(computer_programming) )
669- to capture slice-specific information.
675+ The ` slice_source ` setting can actually be ** any Python function** that returns a
676+ valid slice item as described above.
670677
671- In the following example, the actual slice dataset is computed from averaging another
672- dataset. A ` SliceSource ` is used to close the datasets after the slice has been
673- processed. Slice factories are created from the custom slice source and the slice paths
674- using the utility function [ to_slice_factories()] [ zappend.slice.factory.to_slice_factories ] :
678+ If a slice source is configured, each slice item passed to ` zappend ` is passed as
679+ argument to your slice source.
680+
681+ * Slices passed to the ` zappend ` CLI command become slice source arguments
682+ of type ` str ` .
683+ * Slice items passed to the ` zappend ` function via the ` slices ` argument can be of
684+ any type, but the ` tuple ` , ` list ` , and ` dict ` types have a special meaning:
685+
686+ - ` tuple ` : a pair of the form ` (args, kwargs) ` , where ` args ` is a list
687+ or tuple of positional arguments and ` kwargs ` is a dictionary of keyword
688+ arguments;
689+ - ` list ` : positional arguments only;
690+ - ` dict ` : keyword arguments only;
691+ - Any other type is interpreted as single positional argument.
692+
693+ In addition, your slice source function or class constructor specified by ` slice_source `
694+ may define a 1st positional argument or keyword argument named ` ctx ` ,
695+ which will receive the current processing context of type ` zappend.api.Context ` .
696+ This can be useful if you need to read configuration settings.
697+
698+ Here is a more advanced example of a slice source that opens datasets from a given
699+ file path and averages the values first along the time dimension:
675700
676701``` python
677702import numpy as np
678703import xarray as xr
679704from zappend.api import SliceSource
680- from zappend.api import to_slice_factories
681705from zappend.api import zappend
682706
683707def get_mean_time (slice_ds : xr.Dataset) -> xr.DataArray:
@@ -690,37 +714,25 @@ def get_mean_time(slice_ds: xr.Dataset) -> xr.DataArray:
690714
691715def get_mean_slice (slice_ds : xr.Dataset) -> xr.Dataset:
692716 mean_slice_ds = slice_ds.mean(" time" )
717+ # Re-introduce time dimension of size one
693718 mean_slice_ds = mean_slice_ds.expand_dims(" time" , axis = 0 )
694719 mean_slice_ds.coords[" time" ] = get_mean_time(slice_ds)
695720 return mean_slice_ds
696721
697722class MySliceSource (SliceSource ):
698- def __init__ (self , ctx , slice_path ):
699- super ().__init__ (ctx)
723+ def __init__ (self , slice_path ):
700724 self .slice_path = slice_path
701725 self .ds = None
702- self .mean_ds = None
703726
704727 def get_dataset (self ):
705728 self .ds = xr.open_dataset(self .slice_path)
706- self .mean_ds = get_mean_slice(self .ds)
707- return self .mean_ds
729+ return get_mean_slice(self .ds)
708730
709731 def dispose (self ):
710732 if self .ds is not None :
711733 self .ds.close()
712734 self .ds = None
713- if self .mean_ds is not None :
714- self .mean_ds.close()
715- self .mean_ds = None
716-
717- zappend(to_slice_factories(MySliceSource, [" slice-1.nc" , " slice-2.nc" , " slice-3.nc" ]),
718- target_dir = " target.zarr" )
719- ```
720-
721- Note, the above example can be simplified by using the ` slice_source ` setting directly:
722735
723- ``` python
724736zappend([" slice-1.nc" , " slice-2.nc" , " slice-3.nc" ],
725737 target_dir = " target.zarr" ,
726738 slice_source = MySliceSource)
0 commit comments