@@ -555,9 +555,38 @@ back. Therefore, a log message with warning level will be emitted if the
555555
556556### Slice Datasets
557557
558- If the slice paths passed to the ` zappend ` tool are given as URIs,
559- additional storage options may be provided for the filesystem given by the
560- URI's protocol. They may be specified using the ` slice_storage_options ` setting.
558+ A _ slice dataset_ is the dataset that is appended for every slice item passed
559+ to ` zappend ` . Slice datasets can be provided in various ways.
560+
561+ * When using the [ zappend CLI command] ( cli.md ) , slice items are passed as
562+ command arguments where they point to slice datasets by local file paths or URIs.
563+
564+ * When using the [ zappend Python function] ( api.md ) , slice items are passed
565+ using the ` slices ` argument, which is a Python iterable. You can pass a list or tuple
566+ of slice items or provide a Python generator that provides the slice items.
567+
568+ Each slice item can be a local file path or URI of type ` str ` or ` FileObj ` ,
569+ a dataset of type ` xarray.Dataset ` , or a ` SliceSource ` object explained in more detail
570+ below.
571+
572+ #### Paths and URIs
573+
574+ A slice item of type ` str ` is interpreted as local file path or URI, in the case
575+ the path has a protocol prefix, such as ` s3:// ` as described above.
576+
577+ In the majority of ` zappend ` use cases the slice datasets to be appended to a target
578+ dataset are passed as local file paths or URIs. A slice URI starts with a protocol
579+ prefix, such as ` s3:// ` , or ` memory:// ` . Additional storage options may be required
580+ for the filesystem given by the URI's protocol. They may be specified using the
581+ ` slice_storage_options ` setting.
582+
583+ ``` json
584+ {
585+ "slice_storage_options" : {
586+ "anon" : true
587+ }
588+ }
589+ ```
561590
562591Sometimes, the slice dataset to be processed are not yet available, e.g.,
563592because another process is currently generating them. For such cases, the
@@ -584,82 +613,29 @@ Or use default polling:
584613}
585614```
586615
587-
588- ### Slice Sources
589-
590- A _ slice source_ is an object that provides a slice dataset of type ` xarray.Dataset `
591- for given parameters of any type.
592-
593- The optional ` slice_source ` configuration setting is used to specify a custom
594- slice source. If not specified, ` zappend ` selects the slice source based on the type
595- of a given slice object. These types are described in following subsections.
596-
597- If given, the value of the ` slice_source ` setting is a class derived from
598- ` zappend.api.SliceSource ` , or a function that creates an instance of
599- ` zappend.api.SliceSource ` , or the fully qualified name of the aforementioned.
600- In the case ` slice_source ` is given, the _ slices_ argument passed to the CLI
601- command and Python function become parameters to the specified class constructor
602- or factory function.
603- The individual slice items in the ` SLICES ` arguments of the ` zappend ` CLI
604- command are of type ` str ` , typically interpreted as file paths or URIs.
605- The individual slice items passed in the ` slices ` argument of the
606- ` zappend.api.zappend() ` function can be of any type, but the ` tuple ` , ` list ` ,
607- and ` dict ` types have a special meaning:
608-
609- * ` tuple ` : a pair of the form ` (args, kwargs) ` , where ` args ` is a list
610- or tuple of positional arguments and ` kwargs ` is a dictionary of keyword
611- arguments;
612- * ` list ` : positional arguments only;
613- * ` dict ` : keyword arguments only;
614- * Any other type is interpreted as single positional argument.
615-
616- In addition, your class constructor or factory function specified by ` slice_source `
617- may specify a positional or keyword argument named ` ctx ` , which will receive the
618- current processing context of type ` zappend.api.Context ` .
619-
620- If the ` slice_source ` setting is _ not_ specified, the slice items passed as ` slices `
621- argument to the [ ` zappend ` ] ( api.md ) Python function can be one of the types described
622- in the following subsections.
623-
624- #### Types ` str ` and ` FileObj `
625-
626- A slice object of type ` str ` is interpreted as local file path or URI, in the case
627- the path has a protocol prefix, such as ` s3:// ` .
628-
629616An alternative to providing the slice dataset as path or URI is using the
630617` zappend.api.FileObj ` class, which combines a URI with dedicated filesystem
631618storage options.
632619
633- ``` python
634- from zappend.api import FileObj
635-
636- slice_obj = FileObj(slice_uri, storage_options = dict (... ))
637- ```
638-
639- #### Type ` Dataset `
640-
641- In-memory slice objects can be passed as [ ` xarray.Dataset ` ] ( https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html )
642- objects. Such objects may originate from opening datasets from some storage
643-
644- ``` python
645- import xarray as xr
620+ #### Dataset Objects
646621
647- slice_obj = xr.open_dataset(slice_store, ... )
648- ```
622+ You can also use dataset objects of type
623+ [ ` xarray.Dataset ` ] ( https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html )
624+ as slice item. Such objects may originate from opening datasets from some storage, e.g.,
625+ ` xarray.open_dataset(slice_store, ...) ` or by composing, aggregating, resampling
626+ slice datasets from other datasets and data variables.
649627
650- or by composing, aggregating, resampling slice datasets from other datasets and
651- data variables. To allow for out-of-core computation of large datasets [ Dask arrays] ( https://docs.dask.org/en/stable/array.html )
652- are used by both ` xarray ` and ` zarr ` . As a dask array may represent complex and/or
628+ Chunked data arrays of an ` xarray.Dataset ` are usually instances of
629+ [ Dask arrays] ( https://docs.dask.org/en/stable/array.html ) , to allow for out-of-core
630+ computation of large datasets . As a dask array may represent complex and/or
653631expensive processing graphs, high CPU loads and memory consumption are common issues
654632for computed slice datasets, especially if the specified target dataset chunking is
655633different from the slice dataset chunking. This may cause Dask graphs to be
656634computed multiple times if the source chunking overlaps multiple target chunks,
657635potentially causing large resource overheads while recomputing and/or reloading the same
658- source chunks multiple times.
659-
660- In such cases it can help to "terminate" such computations for each slice by
661- persisting the computed dataset first and then to reopen it. This can be specified
662- using the ` persist_mem_slice ` setting:
636+ source chunks multiple times. In such cases it can help to "terminate" such
637+ computations for each slice by persisting the computed dataset first and then to
638+ reopen it. This can be specified using the ` persist_mem_slice ` setting:
663639
664640``` json
665641{
@@ -671,34 +647,82 @@ If the flag is set, in-memory slices will be persisted to a temporary Zarr befor
671647appending them to the target dataset. It may prevent expensive re-computation of chunks
672648at the cost of additional i/o. It therefore defaults to ` false ` .
673649
674- #### Type ` SliceSource `
650+ #### Slice Sources
651+
652+ If you need some custom cleanup after a slice has been processed and appended to the
653+ target dataset, you can use an instance of ` zappend.api.SliceSource ` as slice item.
654+ A ` SliceSource ` class requires you to implement two methods:
675655
676- Often you want to perform some custom cleanup after a slice has been processed and
677- appended to the target dataset. In this case you can write your own
678- ` zappend.api.SliceSource ` by implementing its ` get_dataset() ` and ` dispose() `
679- methods.
656+ * ` get_dataset() ` to return the slice dataset of type ` xarray.Dataset ` , and
657+ * ` dispose() ` to perform any resource cleanup tasks.
680658
681- Slice source instances are supposed to be created by _ slice factories_ , see
682- subsection below.
659+ Here is the template code for your own slice source implementation:
683660
684- #### Type ` SliceFactory `
661+ ``` python
662+ import xarray as xr
663+ from zappend.api import SliceSource
685664
686- A slice factory is a 1-argument function that receives a processing context of type
687- ` zappend.api.Context ` and yields a slice dataset object of one of the types
688- described above. Since a slice factory cannot have additional arguments, it is
689- normally defined as a [ closure] ( https://en.wikipedia.org/wiki/Closure_(computer_programming) )
690- to capture slice-specific information.
665+ class MySliceSource (SliceSource ):
666+ # Pass any positional and keyword arguments that you need
667+ # to the constructor. `path` is just an example.
668+ def __init__ (self , path : str ):
669+ self .path = path
670+ self .ds = None
671+
672+ def get_dataset (self ) -> xr.Dataset:
673+ # Write code here that obtains the dataset.
674+ self .ds = xr.open_dataset(self .path)
675+ # You can put any processing here.
676+ return self .ds
677+
678+ def dispose (self ):
679+ # Write code here that performs cleanup.
680+ if self .ds is not None :
681+ self .ds.close()
682+ self .ds = None
683+ ```
691684
692- In the following example, the actual slice dataset is computed from averaging another
693- dataset. A ` SliceSource ` is used to close the datasets after the slice has been
694- processed. Slice factories are created from the custom slice source and the slice paths
695- using the utility function [ to_slice_factories()] [ zappend.slice.factory.to_slice_factories ] :
685+ Instead of providing instances of ` SliceSource ` directly as a slice item, it is often
686+ easier to pass your ` SliceSource ` class and let ` zappend ` pass the slice item as
687+ arguments(s) to your ` SliceSource ` 's constructor. This can be achieved using the
688+ the ` slice_source ` configuration setting.
689+
690+ ``` json
691+ {
692+ "slice_source" : " mymodule.MySliceSource"
693+ }
694+ ```
695+
696+ The ` slice_source ` setting can actually be ** any Python function** that returns a
697+ valid slice item as described above.
698+
699+ If a slice source is configured, each slice item passed to ` zappend ` is passed as
700+ argument to your slice source.
701+
702+ * Slices passed to the ` zappend ` CLI command become slice source arguments
703+ of type ` str ` .
704+ * Slice items passed to the ` zappend ` function via the ` slices ` argument can be of
705+ any type, but the ` tuple ` , ` list ` , and ` dict ` types have a special meaning:
706+
707+ - ` tuple ` : a pair of the form ` (args, kwargs) ` , where ` args ` is a list
708+ or tuple of positional arguments and ` kwargs ` is a dictionary of keyword
709+ arguments;
710+ - ` list ` : positional arguments only;
711+ - ` dict ` : keyword arguments only;
712+ - Any other type is interpreted as single positional argument.
713+
714+ In addition, your slice source function or class constructor specified by ` slice_source `
715+ may define a 1st positional argument or keyword argument named ` ctx ` ,
716+ which will receive the current processing context of type ` zappend.api.Context ` .
717+ This can be useful if you need to read configuration settings.
718+
719+ Here is a more advanced example of a slice source that opens datasets from a given
720+ file path and averages the values first along the time dimension:
696721
697722``` python
698723import numpy as np
699724import xarray as xr
700725from zappend.api import SliceSource
701- from zappend.api import to_slice_factories
702726from zappend.api import zappend
703727
704728def get_mean_time (slice_ds : xr.Dataset) -> xr.DataArray:
@@ -711,37 +735,25 @@ def get_mean_time(slice_ds: xr.Dataset) -> xr.DataArray:
711735
712736def get_mean_slice (slice_ds : xr.Dataset) -> xr.Dataset:
713737 mean_slice_ds = slice_ds.mean(" time" )
738+ # Re-introduce time dimension of size one
714739 mean_slice_ds = mean_slice_ds.expand_dims(" time" , axis = 0 )
715740 mean_slice_ds.coords[" time" ] = get_mean_time(slice_ds)
716741 return mean_slice_ds
717742
718743class MySliceSource (SliceSource ):
719- def __init__ (self , ctx , slice_path ):
720- super ().__init__ (ctx)
744+ def __init__ (self , slice_path ):
721745 self .slice_path = slice_path
722746 self .ds = None
723- self .mean_ds = None
724747
725748 def get_dataset (self ):
726749 self .ds = xr.open_dataset(self .slice_path)
727- self .mean_ds = get_mean_slice(self .ds)
728- return self .mean_ds
750+ return get_mean_slice(self .ds)
729751
730752 def dispose (self ):
731753 if self .ds is not None :
732754 self .ds.close()
733755 self .ds = None
734- if self .mean_ds is not None :
735- self .mean_ds.close()
736- self .mean_ds = None
737-
738- zappend(to_slice_factories(MySliceSource, [" slice-1.nc" , " slice-2.nc" , " slice-3.nc" ]),
739- target_dir = " target.zarr" )
740- ```
741-
742- Note, the above example can be simplified by using the ` slice_source ` setting directly:
743756
744- ``` python
745757zappend([" slice-1.nc" , " slice-2.nc" , " slice-3.nc" ],
746758 target_dir = " target.zarr" ,
747759 slice_source = MySliceSource)
0 commit comments