@@ -666,7 +666,13 @@ You can also use dataset objects of type
666666[ ` xarray.Dataset ` ] ( https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html )
667667as slice item. Such objects may originate from opening datasets from some storage, e.g.,
668668` xarray.open_dataset(slice_store, ...) ` or by composing, aggregating, resampling
669- slice datasets from other datasets and data variables.
669+ slice datasets from other datasets and dataset variables.
670+
671+ !!! note "Datasets are not closed automatically"
672+ If you pass [ ` xarray.Dataset ` ] ( https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html ) objects to ` zappend ` they will not be
673+ automatically closed. This may become be issue, if you have many datasets
674+ and each one binds resources such as open file handles. Consider using a
675+ _ slice source_ then, see below.
670676
671677Chunked data arrays of an ` xarray.Dataset ` are usually instances of
672678[ Dask arrays] ( https://docs.dask.org/en/stable/array.html ) , to allow for out-of-core
@@ -676,9 +682,9 @@ for computed slice datasets, especially if the specified target dataset chunking
676682different from the slice dataset chunking. This may cause Dask graphs to be
677683computed multiple times if the source chunking overlaps multiple target chunks,
678684potentially causing large resource overheads while recomputing and/or reloading the same
679- source chunks multiple times. In such cases it can help to "terminate"
680- computations for each slice by persisting the computed dataset first and then to
681- reopen it. This can be specified using the ` persist_mem_slice ` setting:
685+ source chunks multiple times. In such cases it can help to "terminate" computations for
686+ each slice by persisting the computed dataset first and then to reopen it. This can be
687+ specified using the ` persist_mem_slice ` setting:
682688
683689``` json
684690{
@@ -692,61 +698,95 @@ at the cost of additional i/o. It therefore defaults to `false`.
692698
693699#### Slice Sources
694700
695- If you need some custom cleanup after a slice has been processed and appended to the
696- target dataset, you can use instances of ` zappend.api.SliceSource ` as slice items.
697- The ` SliceSource ` methods with special meaning are:
701+ A slice source gives you full control about how a slice dataset is created, loaded,
702+ or generated and how its bound resources, if any, are released. In its simplest form,
703+ a slice source is a plain Python function that can take any arguments and returns
704+ an ` xarray.Dataset ` :
705+
706+ ``` python
707+ import xarray as xr
708+
709+ # Slice source argument `path` is just an example.
710+ def get_dataset (path : str ) -> xr.Dataset:
711+ # Provide dataset here. No matter how, e.g.:
712+ return xr.open_dataset(path)
713+ ```
714+
715+ If you need cleanup code that is executed after the slice dataset has been appended,
716+ you can turn your slice source function into a [ context manager] ( https://docs.python.org/3/library/contextlib.html )
717+ (new in zappend v0.7):
718+
719+ ``` python
720+ from contextlib import contextmanager
721+ import xarray as xr
722+
723+ # Slice source argument `path` is just an example.
724+ @contextmanager
725+ def get_dataset (path : str ) -> xr.Dataset:
726+ # Bind any resources and provide dataset here, e.g.:
727+ dataset = xr.open_dataset(path)
728+ try :
729+ # Yield (not return!) the dataset
730+ yield dataset
731+ finally :
732+ # Cleanup code here, release any bound resources, e.g.:
733+ dataset.close()
734+ ```
735+
736+ You can also implement your slice source as a class derived from the abstract
737+ ` zappend.api.SliceSource ` class. Its interface methods are:
698738
699739* ` get_dataset() ` : a zero-argument method that returns the slice dataset of type
700740 ` xarray.Dataset ` . You must implement this abstract method.
701- * ` close() ` : perform any resource cleanup tasks
741+ * ` close() ` : Optional method. Put your cleanup code here.
702742 (in zappend < v0.7, the ` close ` method was called ` dispose ` ).
703- * ` __init__() ` : optional constructor that receives any arguments passed to the
704- slice source.
705743
706- Here is the template code for your own slice source implementation:
707744
708745``` python
709746import xarray as xr
710747from zappend.api import SliceSource
711748
712749class MySliceSource (SliceSource ):
713- # Pass any positional and keyword arguments that you need
714- # to the constructor. `path` is just an example.
750+ # Slice source argument `path` is just an example.
715751 def __init__ (self , path : str ):
716752 self .path = path
717- self .ds = None
753+ self .dataset = None
718754
719755 def get_dataset (self ) -> xr.Dataset:
720- # Write code here that obtains the dataset.
721- self .ds = xr.open_dataset(self .path)
722- # You can put any processing here.
723- return self .ds
756+ # Bind any resources and provide dataset here, e.g.:
757+ self .dataset = xr.open_dataset(self .path)
758+ return self .dataset
724759
725760 def close (self ):
726- # Write code here that performs cleanup.
727- if self .ds is not None :
728- self .ds.close()
729- self .ds = None
761+ # Cleanup code here, release any bound resources, e.g.:
762+ if self .dataset is not None :
763+ self .dataset.close()
730764```
731765
732- Instead of providing instances of ` SliceSource ` as slice items, it is often
733- easier to pass your ` SliceSource ` class and let ` zappend ` pass the slice item as
734- arguments(s) to your ` SliceSource ` 's constructor. This can be achieved using the
735- the ` slice_source ` configuration setting. If you need to access configuration
736- settings, it is even required to use the ` slice_source ` setting.
766+ You may prefer implementing a class because your slice source is complex and you want
767+ to split its logic into separate methods. You may also just prefer classes as a matter
768+ of your personal taste. Another advantage of using a class is that you can pass
769+ instances of it as slice items to the ` zappend ` function without further configuration.
770+ However, the intended use of a slice source is to configure it by specifying the
771+ ` slice_source ` setting. In a JSON or YAML configuration file it specifies the fully
772+ qualified name of the slice source function or class:
737773
738774``` json
739775{
740776 "slice_source" : " mymodule.MySliceSource"
741777}
742778```
743779
744- The ` slice_source ` setting can actually be ** any Python function** that returns a
745- valid slice item as described above such as a file path or URI, or
746- an ` xarray.Dataset ` .
780+ If you use the ` zappend ` function, you can pass the function or class directly:
781+
782+ ``` python
783+ zappend([" slice-1.nc" , " slice-2.nc" , " slice-3.nc" ],
784+ target_dir = " target.zarr" ,
785+ slice_source = MySliceSource)
786+ ```
747787
748- If a slice source is configured , each slice item passed to ` zappend ` is passed as
749- argument to your slice source.
788+ If the slice source setting is used , each slice item passed to ` zappend ` is passed as
789+ argument(s) to your slice source.
750790
751791* Slices passed to the ` zappend ` CLI command become slice source arguments
752792 of type ` str ` .
@@ -765,88 +805,71 @@ You can also pass extra keyword arguments to your slice source using the
765805every slice source invocation. Keyword arguments passed as slice items with same name
766806as in ` slice_source_kwargs ` take precedence.
767807
768- In addition, your slice source function or class constructor specified by
769- ` slice_source ` may define a 1st positional argument or keyword argument
770- named ` ctx ` , which will receive the current processing context of type
771- ` zappend.api.Context ` . This can be useful if you need to read configuration
772- settings.
808+ Slice arguments are passed to your slice source for every slice. If your slice source
809+ has many parameters that stay the same for all slices you may prefer providing
810+ parameters as settings in the configuration. This can be achieved using the ` extra `
811+ setting:
812+
813+ ``` json
814+ {
815+ "extra" : {
816+ "quantiles" : [0.1 , 0.5 , 0.9 ],
817+ "use_threshold" : true ,
818+ "filter" : " gauss"
819+ }
820+ }
821+ ```
822+
823+ To access the settings in ` extra ` your slice source function or class constructor
824+ must define a special argument named ` ctx ` . It must be a 1st positional argument or
825+ a keyword argument. The argument ` ctx ` is the current processing context of type
826+ ` zappend.api.Context ` that also contains the configuration.
773827
774828Here is a more advanced example of a slice source that opens datasets from a given
775829file path and averages the values along the time dimension:
776830
777831``` python
778832import numpy as np
779833import xarray as xr
834+ from zappend.api import Context
780835from zappend.api import SliceSource
781836from zappend.api import zappend
782837
783- def get_mean_time (slice_ds : xr.Dataset) -> xr.DataArray:
784- time = slice_ds.time
785- t0 = time[0 ]
786- dt = time[- 1 ] - t0
787- return xr.DataArray(np.array([t0 + dt / 2 ],
788- dtype = slice_ds.time.dtype),
789- dims = " time" )
790-
791- def get_mean_slice (slice_ds : xr.Dataset) -> xr.Dataset:
792- mean_slice_ds = slice_ds.mean(" time" )
793- # Re-introduce time dimension of size one
794- mean_slice_ds = mean_slice_ds.expand_dims(" time" , axis = 0 )
795- mean_slice_ds.coords[" time" ] = get_mean_time(slice_ds)
796- return mean_slice_ds
797-
798838class MySliceSource (SliceSource ):
799- def __init__ (self , slice_path ):
839+ def __init__ (self , ctx : Context, slice_path : str ):
840+ self .quantiles = ctx.config.extra.get(" quantiles" , [0.5 ])
800841 self .slice_path = slice_path
801842 self .ds = None
802843
803844 def get_dataset (self ):
804845 self .ds = xr.open_dataset(self .slice_path)
805- return get_mean_slice (self .ds)
846+ return self .get_agg_slice (self .ds)
806847
807848 def close (self ):
808849 if self .ds is not None :
809850 self .ds.close()
810- self .ds = None
811851
852+ def get_agg_slice (self , slice_ds : xr.Dataset) -> xr.Dataset:
853+ agg_slice_ds = slice_ds.quantile(self .quantiles, dim = " time" )
854+ # Re-introduce time dimension of size one
855+ agg_slice_ds = agg_slice_ds.expand_dims(" time" , axis = 0 )
856+ agg_slice_ds.coords[" time" ] = self .get_mean_time(slice_ds)
857+ return agg_slice_ds
858+
859+ @ classmethod
860+ def get_mean_time (cls , slice_ds : xr.Dataset) -> xr.DataArray:
861+ time = slice_ds.time
862+ t0 = time[0 ]
863+ dt = time[- 1 ] - t0
864+ return xr.DataArray(np.array([t0 + dt / 2 ],
865+ dtype = slice_ds.time.dtype),
866+ dims = " time" )
867+
812868zappend([" slice-1.nc" , " slice-2.nc" , " slice-3.nc" ],
813869 target_dir = " target.zarr" ,
814870 slice_source = MySliceSource)
815871```
816872
817- Since zappend 0.7, a slice source can also be written as a Python
818- [ context manager] ( https://docs.python.org/3/library/contextlib.html ) ,
819- which allows you implementing the ` get_dataset() ` and ` close() `
820- methods in one single function, instead of a class. Here is the above example
821- written as context manager.
822-
823- ``` python
824- from contextlib import contextmanager
825- import numpy as np
826- import xarray as xr
827- from zappend.api import zappend
828-
829- # Same as above here
830-
831- @contextmanager
832- def get_slice_dataset (slice_path ):
833- # allocate resources here
834- ds = xr.open_dataset(slice_path)
835- mean_ds = get_mean_slice(ds)
836- try :
837- # yield (!) the slice dataset
838- # so it can be appended
839- yield mean_ds
840- finally :
841- # after slice dataset has been appended
842- # release resources here
843- ds.close()
844-
845- zappend([" slice-1.nc" , " slice-2.nc" , " slice-3.nc" ],
846- target_dir = " target.zarr" ,
847- slice_source = get_slice_dataset)
848- ```
849-
850873## Profiling
851874
852875Runtime profiling is very important for understanding program runtime behavior
0 commit comments