@@ -666,7 +666,13 @@ You can also use dataset objects of type
666666[ ` xarray.Dataset ` ] ( https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html )
667667as slice item. Such objects may originate from opening datasets from some storage, e.g.,
668668` xarray.open_dataset(slice_store, ...) ` or by composing, aggregating, resampling
669- slice datasets from other datasets and data variables.
669+ slice datasets from other datasets and dataset variables.
670+
671+ !!! note "Datasets are not closed automatically"
672+ If you pass [ ` xarray.Dataset ` ] ( https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html ) objects to ` zappend ` they will not be
673+ automatically closed. This may become be issue, if you have many datasets
674+ and each one binds resources such as open file handles. Consider using a
675+ _ slice source_ then, see below.
670676
671677Chunked data arrays of an ` xarray.Dataset ` are usually instances of
672678[ Dask arrays] ( https://docs.dask.org/en/stable/array.html ) , to allow for out-of-core
@@ -676,9 +682,9 @@ for computed slice datasets, especially if the specified target dataset chunking
676682different from the slice dataset chunking. This may cause Dask graphs to be
677683computed multiple times if the source chunking overlaps multiple target chunks,
678684potentially causing large resource overheads while recomputing and/or reloading the same
679- source chunks multiple times. In such cases it can help to "terminate"
680- computations for each slice by persisting the computed dataset first and then to
681- reopen it. This can be specified using the ` persist_mem_slice ` setting:
685+ source chunks multiple times. In such cases it can help to "terminate" computations for
686+ each slice by persisting the computed dataset first and then to reopen it. This can be
687+ specified using the ` persist_mem_slice ` setting:
682688
683689``` json
684690{
@@ -692,61 +698,96 @@ at the cost of additional i/o. It therefore defaults to `false`.
692698
693699#### Slice Sources
694700
695- If you need some custom cleanup after a slice has been processed and appended to the
696- target dataset, you can use instances of ` zappend.api.SliceSource ` as slice items.
697- The ` SliceSource ` methods with special meaning are:
701+ A slice source gives you full control about how a slice dataset is created, loaded,
702+ or generated and how its bound resources, if any, are released. In its simplest form,
703+ a slice source is a plain Python function that can take any arguments and returns
704+ an ` xarray.Dataset ` :
705+
706+ ``` python
707+ import xarray as xr
708+
709+ # Slice source argument `path` is just an example.
710+ def get_dataset (path : str ) -> xr.Dataset:
711+ # Provide dataset here. No matter how, e.g.:
712+ return xr.open_dataset(path)
713+ ```
714+
715+ If you need cleanup code that is executed after the slice dataset has been appended,
716+ you can turn your slice source function into a
717+ [ context manager] ( https://docs.python.org/3/library/contextlib.html )
718+ (new in zappend v0.7):
719+
720+ ``` python
721+ from contextlib import contextmanager
722+ import xarray as xr
723+
724+ # Slice source argument `path` is just an example.
725+ @contextmanager
726+ def get_dataset (path : str ) -> xr.Dataset:
727+ # Bind any resources and provide dataset here, e.g.:
728+ dataset = xr.open_dataset(path)
729+ try :
730+ # Yield (not return!) the dataset
731+ yield dataset
732+ finally :
733+ # Cleanup code here, release any bound resources, e.g.:
734+ dataset.close()
735+ ```
736+
737+ You can also implement your slice source as a class derived from the abstract
738+ ` zappend.api.SliceSource ` class. Its interface methods are:
698739
699740* ` get_dataset() ` : a zero-argument method that returns the slice dataset of type
700741 ` xarray.Dataset ` . You must implement this abstract method.
701- * ` close() ` : perform any resource cleanup tasks
742+ * ` close() ` : Optional method. Put your cleanup code here.
702743 (in zappend < v0.7, the ` close ` method was called ` dispose ` ).
703- * ` __init__() ` : optional constructor that receives any arguments passed to the
704- slice source.
705744
706- Here is the template code for your own slice source implementation:
707745
708746``` python
709747import xarray as xr
710748from zappend.api import SliceSource
711749
712750class MySliceSource (SliceSource ):
713- # Pass any positional and keyword arguments that you need
714- # to the constructor. `path` is just an example.
751+ # Slice source argument `path` is just an example.
715752 def __init__ (self , path : str ):
716753 self .path = path
717- self .ds = None
754+ self .dataset = None
718755
719756 def get_dataset (self ) -> xr.Dataset:
720- # Write code here that obtains the dataset.
721- self .ds = xr.open_dataset(self .path)
722- # You can put any processing here.
723- return self .ds
757+ # Bind any resources and provide dataset here, e.g.:
758+ self .dataset = xr.open_dataset(self .path)
759+ return self .dataset
724760
725761 def close (self ):
726- # Write code here that performs cleanup.
727- if self .ds is not None :
728- self .ds.close()
729- self .ds = None
762+ # Cleanup code here, release any bound resources, e.g.:
763+ if self .dataset is not None :
764+ self .dataset.close()
730765```
731766
732- Instead of providing instances of ` SliceSource ` as slice items, it is often
733- easier to pass your ` SliceSource ` class and let ` zappend ` pass the slice item as
734- arguments(s) to your ` SliceSource ` 's constructor. This can be achieved using the
735- the ` slice_source ` configuration setting. If you need to access configuration
736- settings, it is even required to use the ` slice_source ` setting.
767+ You may prefer implementing a class because your slice source is complex and you want
768+ to split its logic into separate methods. You may also just prefer classes as a matter
769+ of your personal taste. Another advantage of using a class is that you can pass
770+ instances of it as slice items to the ` zappend ` function without further configuration.
771+ However, the intended use of a slice source is to configure it by specifying the
772+ ` slice_source ` setting. In a JSON or YAML configuration file it specifies the fully
773+ qualified name of the slice source function or class:
737774
738775``` json
739776{
740777 "slice_source" : " mymodule.MySliceSource"
741778}
742779```
743780
744- The ` slice_source ` setting can actually be ** any Python function** that returns a
745- valid slice item as described above such as a file path or URI, or
746- an ` xarray.Dataset ` .
781+ If you use the ` zappend ` function, you can pass the function or class directly:
782+
783+ ``` python
784+ zappend([" slice-1.nc" , " slice-2.nc" , " slice-3.nc" ],
785+ target_dir = " target.zarr" ,
786+ slice_source = MySliceSource)
787+ ```
747788
748- If a slice source is configured , each slice item passed to ` zappend ` is passed as
749- argument to your slice source.
789+ If the slice source setting is used , each slice item passed to ` zappend ` is passed as
790+ argument(s) to your slice source.
750791
751792* Slices passed to the ` zappend ` CLI command become slice source arguments
752793 of type ` str ` .
@@ -762,90 +803,73 @@ argument to your slice source.
762803
763804You can also pass extra keyword arguments to your slice source using the
764805` slice_source_kwargs ` setting. Keyword arguments passed as slice items take
765- precedence, that is, they overwrite arguments passed by ` slice_source_kawrgs ` .
806+ precedence, that is, they overwrite arguments passed by ` slice_source_kwargs ` .
807+
808+ If your slice source has many parameters that stay the same for all slices you may
809+ prefer providing parameters as configuration settings, rather than function or class
810+ arguments. This can be achieved using the ` extra ` setting:
766811
767- In addition, your slice source function or class constructor specified by
768- ` slice_source ` may define a 1st positional argument or keyword argument
769- named ` ctx ` , which will receive the current processing context of type
770- ` zappend.api.Context ` . This can be useful if you need to read configuration
771- settings.
812+ ``` json
813+ {
814+ "extra" : {
815+ "quantiles" : [0.1 , 0.5 , 0.9 ],
816+ "use_threshold" : true ,
817+ "filter" : " gauss"
818+ }
819+ }
820+ ```
821+
822+ To access the settings in ` extra ` your slice source function or class constructor
823+ must define a special argument named ` ctx ` . It must be a 1ˢᵗ positional argument or
824+ a keyword argument. The argument ` ctx ` is the current processing context of type
825+ ` zappend.api.Context ` that also contains the configuration. The settings in ` extra `
826+ can be accessed using the dictionary returned from ` ctx.config.extra ` .
772827
773828Here is a more advanced example of a slice source that opens datasets from a given
774829file path and averages the values along the time dimension:
775830
776831``` python
777832import numpy as np
778833import xarray as xr
834+ from zappend.api import Context
779835from zappend.api import SliceSource
780836from zappend.api import zappend
781837
782- def get_mean_time (slice_ds : xr.Dataset) -> xr.DataArray:
783- time = slice_ds.time
784- t0 = time[0 ]
785- dt = time[- 1 ] - t0
786- return xr.DataArray(np.array([t0 + dt / 2 ],
787- dtype = slice_ds.time.dtype),
788- dims = " time" )
789-
790- def get_mean_slice (slice_ds : xr.Dataset) -> xr.Dataset:
791- mean_slice_ds = slice_ds.mean(" time" )
792- # Re-introduce time dimension of size one
793- mean_slice_ds = mean_slice_ds.expand_dims(" time" , axis = 0 )
794- mean_slice_ds.coords[" time" ] = get_mean_time(slice_ds)
795- return mean_slice_ds
796-
797838class MySliceSource (SliceSource ):
798- def __init__ (self , slice_path ):
839+ def __init__ (self , ctx : Context, slice_path : str ):
840+ self .quantiles = ctx.config.extra.get(" quantiles" , [0.5 ])
799841 self .slice_path = slice_path
800842 self .ds = None
801843
802844 def get_dataset (self ):
803845 self .ds = xr.open_dataset(self .slice_path)
804- return get_mean_slice (self .ds)
846+ return self .get_agg_slice (self .ds)
805847
806848 def close (self ):
807849 if self .ds is not None :
808850 self .ds.close()
809- self .ds = None
810851
852+ def get_agg_slice (self , slice_ds : xr.Dataset) -> xr.Dataset:
853+ agg_slice_ds = slice_ds.quantile(self .quantiles, dim = " time" )
854+ # Re-introduce time dimension of size one
855+ agg_slice_ds = agg_slice_ds.expand_dims(" time" , axis = 0 )
856+ agg_slice_ds.coords[" time" ] = self .get_mean_time(slice_ds)
857+ return agg_slice_ds
858+
859+ @ classmethod
860+ def get_mean_time (cls , slice_ds : xr.Dataset) -> xr.DataArray:
861+ time = slice_ds.time
862+ t0 = time[0 ]
863+ dt = time[- 1 ] - t0
864+ return xr.DataArray(np.array([t0 + dt / 2 ],
865+ dtype = slice_ds.time.dtype),
866+ dims = " time" )
867+
811868zappend([" slice-1.nc" , " slice-2.nc" , " slice-3.nc" ],
812869 target_dir = " target.zarr" ,
813870 slice_source = MySliceSource)
814871```
815872
816- Since zappend 0.7, a slice source can also be written as a Python
817- [ context manager] ( https://docs.python.org/3/library/contextlib.html ) ,
818- which allows you implementing the ` get_dataset() ` and ` close() `
819- methods in one single function, instead of a class. Here is the above example
820- written as context manager.
821-
822- ``` python
823- from contextlib import contextmanager
824- import numpy as np
825- import xarray as xr
826- from zappend.api import zappend
827-
828- # Same as above here
829-
830- @contextmanager
831- def get_slice_dataset (slice_path ):
832- # allocate resources here
833- ds = xr.open_dataset(slice_path)
834- mean_ds = get_mean_slice(ds)
835- try :
836- # yield (!) the slice dataset
837- # so it can be appended
838- yield mean_ds
839- finally :
840- # after slice dataset has been appended
841- # release resources here
842- ds.close()
843-
844- zappend([" slice-1.nc" , " slice-2.nc" , " slice-3.nc" ],
845- target_dir = " target.zarr" ,
846- slice_source = get_slice_dataset)
847- ```
848-
849873## Profiling
850874
851875Runtime profiling is very important for understanding program runtime behavior
0 commit comments