From a48c6c0999fda79069566582909bb95f319e8e2a Mon Sep 17 00:00:00 2001 From: Tonio Fincke Date: Thu, 22 Jul 2021 17:10:17 +0200 Subject: [PATCH 1/7] created cube convention document --- docs/source/cubeconv.md | 180 ++++++++++++++++++++++++++++++++++++++++ docs/source/cubespec.md | 157 ----------------------------------- 2 files changed, 180 insertions(+), 157 deletions(-) create mode 100644 docs/source/cubeconv.md delete mode 100644 docs/source/cubespec.md diff --git a/docs/source/cubeconv.md b/docs/source/cubeconv.md new file mode 100644 index 000000000..4f68f4617 --- /dev/null +++ b/docs/source/cubeconv.md @@ -0,0 +1,180 @@ +# xcube Dataset Convention + +This document describes a convention for *xcube datasets*, which are data cubes +in the xcube sense. Any dataset can be considered a data cube as long as at +least a subset of its data variables are cube-like, i.e., meet the requirements +listed in this document. + +The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, +“SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this +document are to be interpreted as described in +[RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). + +## Document Status + +This is the latest version, which is still in development. + +Version: 1.0, draft + +Updated: 21.07.2021 + + +## Motivation + +For many users of Earth observation data, common operations such as +multivariate co-registration, extraction, comparison, and analysis of different +data sources are difficult, while data is provided in various formats and at +different spatio-temporal resolutions. + +## High-level requirements + +xcube datasets + +* SHALL be time series of gridded, geo-spatial, geo-physical variables. +* SHALL use a common, equidistant, global or regional geo-spatial grid. +* SHALL be easy to read, write, process, generate. +* SHALL conform to the requirements of analysis ready data (ARD). +* SHALL be compatible with existing tools and APIs. +* SHALL conform to standards or common practices and follow a common + data model. +* SHALL be formatted as self-contained datasets. +* SHALL be "cloud ready", in the sense that subsets of the data can be + accessed by individual URIs. + +ARD links: + +* http://ceos.org/ard/ +* https://www.usgs.gov/core-science-systems/nli/landsat/us-landsat-analysis-ready-data +* https://medium.com/planet-stories/analysis-ready-data-defined-5694f6f48815 + + +## xcube Dataset Schemas + +### Basic Schema + +* Attributes + * SHALL be [CF](http://cfconventions.org/) >= 1.7 + * SHOULD adhere to + [Attribute Convention for Data Discovery](http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery) +* Dimensions: + * SHALL all be greater than zero. + + SHALL include two spatial dimensions + * SHOULD include a dimension `time` + * SHOULD include a dimension `bnds` of size 2 that may be used by bounding + coordinate variables +* Coordinate Variables + * SHALL contain labels for a dimension + * SHOULD be 1-dimensional + * MAY be 2-dimensional if, e.g., they are bound coordinate variables (see + below) or they carry `latitude`/`longitude` values in case of + * 1-dimensional coordinate variables SHOULD be named like the dimension they + describe + * For each dimension of a data variable, a coordinate variable MUST exist +* Temporal coordinate variables: + * SHALL provide time coordinates for a given time index. + * MAY be non-equidistant or equidistant. + * SHOULD be named `time` + * One variable value SHALL provide observation or average time of + *cell centers*. + * Attributes: + * Temporal coordinate variables MUST have `units`, `standard_name`, + and any others. + * `standard_name` MUST be `"time"`, `units` MUST have format + `" since "`, where `datetime` must have + ISO-format. `calendar` may be given, if not, `"gregorian"` is + assumed. +* Spatial coordinate variables + * SHALL provide spatial coordinates for given spatial index. + * SHALL be equidistant in either angular or metric units + * Different spatial coordinate variables MAY have different spatial + resolutions +* Bound coordinate variables + * SHOULD be included for any spatial or temporal coordinate variable + * SHALL consist of two dimensions: The one of the respective coordinate + variable and another one of length 2, that SHOULD be named `bnds` + * SHOULD be named `_bnds` + * `[, 0]` SHALL provide the *lower cell boundary*, + `[, 1]` SHALL provide the *upper cell boundary* +* Data variables: + * MAY have any dimensionality, including no dimensions at all. + * SHALL have the spatial dimensions at the innermost position in case it has + spatial dimensions (e.g., [..., y, x]) + * SHALL have its time dimension at the outermost position in case it has a + time dimension (e.g., [time, ...]) + * MAY have extra dimensions, e.g. `layer` (of the atmosphere) or + `band` (of a spectrum). These extra dimensions MUST be positioned between + the time and the spatial coordinates + * SHALL provide *cube cells* with the dimensions as index. + * SHOULD specify the `units` metadata attribute. + * SHOULD specify metadata attributes that are used to identify + missing values, namely `_FillValue` and / or `valid_min`, + `valid_max`, see notes in CF conventions on these attributes. + * MAY specify metadata attributes that can be used to visualise the + data: + * `color_bar_name`: Name of a predefined colour mapping. + The colour bar is applied between a minimum and a maximum value. + * `color_value_min`, `color_value_max`: Minimum and maximum value + for applying the colour bar. If not provided, minimum and maximum + default to `valid_min`, `valid_max`. If neither are provided, + minimum and maximum default to `0` and `1`. + +### WGS84 Schema (extends Basic) + +* Dimensions: + * SHALL include two spatial dimensions, which SHOULD be named `lat` and `lon` +* Spatial coordinate variables: + * SHALL use WGS84 (EPSG:4326) CRS. + * One entry of the variable describing the latitude SHALL provide the + observation or average latitude of *cell centers*. It SHOULD have the + attributes: `standard_name="latitude"` `units="degrees_north"`. + * One entry of the variable describing the longitude SHALL provide the + observation or average longitude of *cell centers*. It SHOULD have the + attributes: `standard_name="longitude"` `units="degrees_east"`. + +### Generic Schema (extends Basic) + +* Dimensions: + * SHALL include two spatial dimensions, which SHOULD be named `y` and `x` +* Spatial coordinate variables: + * MAY use any spatial grid and CRS. + * SHOULD have attributes `standard_name`, `units` + * MAY have `lat[,]`: latitude of *cell centers*. + * Attributes: `standard_name="latitude"`, `units="degrees_north"`. + * MAY have `lon[,]`: longitude of *cell centers*. + * Attributes: `standard_name="longitude"`, `units="degrees_east"`. +* Grid Mapping variable: + * SHALL be included in case the CRS is not WGS84. + * SHALL not carry any data, therefore it MAY be of any type + * SHOULD be named `crs` + * MUST have attributes that describe a CF Grid Mapping v1.8 (see + http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#grid-mappings-and-projections + ). This means that there MUST either be + * an attribute `crs_wkt` that desribes a CRS in WKT format + * an attribute `spatial_ref` (e.g., an EPSG code) + * an attribute `grid_mapping_name`. If this is given, more attributes + MAY be required, depending on the grid mapping. + + + +## xcube EO Processing Levels + +This section provides an attempt to characterize xcube datasets +generated from Earth Observation (EO) data according to their +processing levels as they are commonly used in EO data processing. + +### Level-1C and Level-2C + +* Generated from Level-1A, -1B, -2A, -2B EO data. +* Spatially resampled to common grid + * Typically resampled at original resolution. + * May be down-sampled: aggregation/integration. + * May be upsampled: interpolation. +* No temporal aggregation/integration. +* Temporally non-equidistant. + +### Level-3 + +* Generated from Level-2C or -3 by temporal aggregation. +* No spatial processing. +* Temporally equidistant. +* Temporally integrated/aggregated. diff --git a/docs/source/cubespec.md b/docs/source/cubespec.md deleted file mode 100644 index a5a69c486..000000000 --- a/docs/source/cubespec.md +++ /dev/null @@ -1,157 +0,0 @@ -# xcube Dataset Specification - -This document provides a technical specification of the protocol and -format for *xcube datasets*, data cubes in the xcube sense. - -The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, -“SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this -document are to be interpreted as described in -[RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). - -## Document Status - -This is the latest version, which is still in development. - -Version: 1.0, draft - -Updated: 31.05.2018 - - -## Motivation - -For many users of Earth observation data, multivariate coregistration, -extraction, comparison, and analysis of different data sources is -difficult, while data is provided in various formats and at different -spatio-temporal resolutions. - -## High-level requirements - -xcube datasets - -* SHALL be time series of gridded, geo-spatial, geo-physical variables. -* SHALL use a common, equidistant, global or regional geo-spatial grid. -* SHALL shall be easy to read, write, process, generate. -* SHALL conform to the requirements of analysis ready data (ARD). -* SHALL be compatible with existing tools and APIs. -* SHALL conform to standards or common practices and follow a common - data model. -* SHALL be formatted as self-contained datasets. -* SHALL be "cloud ready", in the sense that subsets of the data can be - accessed by individual URIs. - -ARD links: - -* http://ceos.org/ard/ -* https://landsat.usgs.gov/ard -* https://medium.com/planet-stories/analysis-ready-data-defined-5694f6f48815 - - -## xcube Dataset Schemas - -### Basic Schema - -* Attributes metadata convention - * SHALL be [CF](http://cfconventions.org/) >= 1.7 - * SHOULD adhere to - [Attribute Convention for Data Discovery](http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery) -* Dimensions: - * SHALL be at least `time`, `bnds`, and MAY be any others. - * SHALL all be greater than zero, but `bnds` must always be two. -* Temporal coordinate variables: - * SHALL provide time coordinates for given time index. - * MAY be non-equidistant or equidistant. - * `time[time]` SHALL provide observation or average time of - *cell centers*. - * `time_bnds[time, bnds]` SHALL provide observation or integration - time of *cell boundaries*. - * Attributes: - * Temporal coordinate variables MUST have `units`, `standard_name`, - and any others. - * `standard_name` MUST be `"time"`, `units` MUST have format - `" since "`, where `datetime` must have - ISO-format. `calendar` may be given, if not, `"gregorian"` is - assumed. -* Spatial coordinate variables - * SHALL provide spatial coordinates for given spatial index. - * SHALL be equidistant in either angular or metric units -* Cube variables: - * SHALL provide *cube cells* with the dimensions as index. - * SHALL have shape - * `[time, ..., lat, lon]` (see WGS84 schema) or - * `[time, ..., y, x]` (see Generic schema) - * MAY have extra dimensions, e.g. `layer` (of the atmosphere), - `band` (of a spectrum). - * SHALL specify the `units` metadata attribute. - * SHOULD specify metadata attributes that are used to identify - missing values, namely `_FillValue` and / or `valid_min`, - `valid_max`, see notes in CF conventions on these attributes. - * MAY specify metadata attributes that can be used to visualise the - data: - * `color_bar_name`: Name of a predefined colour mapping. - The colour bar is applied between a minimum and a maximum value. - * `color_value_min`, `color_value_max`: Minimum and maximum value - for applying the colour bar. If not provided, minimum and maximum - default to `valid_min`, `valid_max`. If neither are provided, - minimum and maximum default to `0` and `1`. - -### WGS84 Schema (extends Basic) - -* Dimensions: - * SHALL be at least `time`, `lat`, `lon`, `bnds`, and MAY be any - others. -* Spatial coordinate variables: - * SHALL use WGS84 (EPSG:4326) CRS. - * SHALL have `lat[lat]` that provides observation or average latitude - of *cell centers* - with attributes: `standard_name="latitude"` `units="degrees_north"`. - * SHALL have `lon[lon]` that provides observation or average longitude - of *cell centers* with attributes: `standard_name="longitude"` and - `units="degrees_east"`. - * SHOULD HAVE `lat_bnds[lat, bnds]`, `lon_bnds[lon, bnds]`: provide - geodetic observation or integration coordinates of - *cell boundaries*. -* Cube variables: - * SHALL have shape `[time, ..., lat, lon]`. - -### Generic Schema (extends Basic) - -* Dimensions: `time`, `y`, `x`, `bnds`, and any others. - * SHALL be at least `time`, `y`, `x`, `bnds`, and MAY be any others. -* Spatial coordinate variables: - * Any spatial grid and CRS. - * `y[y]`, `x[x]`: provide spatial observation or average coordinates - of *cell centers*. - * Attributes: `standard_name`, `units`, other units describe the - CRS / projections, see CF. - * `y_bnds[y, bnds]`, `x_bnds[x, bnds]`: provide spatial observation - or integration coordinates of *cell boundaries*. - * MAY have `lat[y,x]`: latitude of *cell centers*. - * Attributes: `standard_name="latitude"`, `units="degrees_north"`. - * `lon[y,x]`: longitude of *cell centers*. - * Attributes: `standard_name="longitude"`, `units="degrees_east"`. -* Cube variables: - * MUST have shape `[time, ..., y, x]`. - - -## xcube EO Processing Levels - -This section provides an attempt to characterize xcube datasets -generated from Earth Observation (EO) data according to their -processing levels as they are commonly used in EO data processing. - -### Level-1C and Level-2C - -* Generated from Level-1A, -1B, -2A, -2B EO data. -* Spatially resampled to common grid - * Typically resampled at original resolution. - * May be down-sampled: aggregation/integration. - * May be upsampled: interpolation. -* No temporal aggregation/integration. -* Temporally non-equidistant. - -### Level-3 - -* Generated from Level-2C or -3 by temporal aggregation. -* No spatial processing. -* Temporally equidistant. -* Temporally integrated/aggregated. From b9de2acf7033abe6b22ff56eb0d1358adcede8ba Mon Sep 17 00:00:00 2001 From: Tonio Fincke Date: Fri, 23 Jul 2021 11:47:55 +0200 Subject: [PATCH 2/7] formatting --- docs/source/cubeconv.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/cubeconv.md b/docs/source/cubeconv.md index 4f68f4617..c8adae265 100644 --- a/docs/source/cubeconv.md +++ b/docs/source/cubeconv.md @@ -98,9 +98,9 @@ ARD links: * Data variables: * MAY have any dimensionality, including no dimensions at all. * SHALL have the spatial dimensions at the innermost position in case it has - spatial dimensions (e.g., [..., y, x]) + spatial dimensions (e.g., `[..., y, x]`) * SHALL have its time dimension at the outermost position in case it has a - time dimension (e.g., [time, ...]) + time dimension (e.g., `[time, ...]`) * MAY have extra dimensions, e.g. `layer` (of the atmosphere) or `band` (of a spectrum). These extra dimensions MUST be positioned between the time and the spatial coordinates From f1e6a15f159d910d35995a4292522c34f8eafa61 Mon Sep 17 00:00:00 2001 From: Tonio Fincke Date: Tue, 27 Jul 2021 15:13:03 +0200 Subject: [PATCH 3/7] updated link --- docs/source/devguide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/devguide.md b/docs/source/devguide.md index 3bd0b43f1..226c24315 100644 --- a/docs/source/devguide.md +++ b/docs/source/devguide.md @@ -181,7 +181,7 @@ Create new module in `xcube.core` and add your functions. For any functions added make sure naming is in line with other API. Add clear doc-string to the new API. Use Sphinx RST format. -Decide if your API methods requires [xcube datasets](./cubespec.md) as +Decide if your API methods requires [xcube datasets](./cubeconv.md) as inputs, if so, name the primary dataset argument `cube` and add a keyword parameter `cube_asserted: bool = False`. Otherwise name the primary dataset argument `dataset`. From cb6b381782c7e13ad31b83b4a122ebcacafcaf0d Mon Sep 17 00:00:00 2001 From: Tonio Fincke Date: Tue, 27 Jul 2021 15:14:33 +0200 Subject: [PATCH 4/7] added option to only normalize non-spatial properties --- xcube/core/normalize.py | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/xcube/core/normalize.py b/xcube/core/normalize.py index 82f4e6ceb..96634a879 100644 --- a/xcube/core/normalize.py +++ b/xcube/core/normalize.py @@ -49,7 +49,9 @@ def cubify_dataset(ds: xr.Dataset) -> xr.Dataset: def normalize_dataset(ds: xr.Dataset, - reverse_decreasing_lat: bool = False + *, + reverse_decreasing_lat: bool = False, + do_not_normalize_spatial_dims: bool = False ) -> xr.Dataset: """ Normalize the geo- and time-coding upon opening the given dataset w.r.t. a common @@ -74,12 +76,15 @@ def normalize_dataset(ds: xr.Dataset, are increasing :return: The normalized dataset, or the original dataset, if it is already "normal". """ - ds = _normalize_zonal_lat_lon(ds) + if not do_not_normalize_spatial_dims: + ds = _normalize_zonal_lat_lon(ds) ds = normalize_coord_vars(ds) - ds = _normalize_lat_lon(ds) - ds = _normalize_lat_lon_2d(ds) + if not do_not_normalize_spatial_dims: + ds = _normalize_lat_lon(ds) + ds = _normalize_lat_lon_2d(ds) ds = _normalize_dim_order(ds) - ds = _normalize_lon_360(ds) + if not do_not_normalize_spatial_dims: + ds = _normalize_lon_360(ds) if reverse_decreasing_lat: ds = _reverse_decreasing_lat(ds) ds = normalize_missing_time(ds) From 9773c92b0b563b9a0555f26c9e39b64a7785d71a Mon Sep 17 00:00:00 2001 From: Tonio Fincke Date: Tue, 27 Jul 2021 15:27:55 +0200 Subject: [PATCH 5/7] added split and merge --- test/core/test_treatascube.py | 74 ++++++++++++++++++++++++++ xcube/core/treatascube.py | 99 +++++++++++++++++++++++++++++++++++ 2 files changed, 173 insertions(+) create mode 100644 test/core/test_treatascube.py create mode 100644 xcube/core/treatascube.py diff --git a/test/core/test_treatascube.py b/test/core/test_treatascube.py new file mode 100644 index 000000000..27222f2a8 --- /dev/null +++ b/test/core/test_treatascube.py @@ -0,0 +1,74 @@ +from xcube.core.treatascube import merge_cube +from xcube.core.treatascube import split_cube +from xcube.core.treatascube import verify_cube_subset +from xcube.core.new import new_cube +from xcube.core.verify import assert_cube + +import numpy as np +import xarray as xr +import unittest + + +class VerifyCubSubsetTest(unittest.TestCase): + + def test_all_well(self): + cube = new_cube(variables=dict(x=1, y=2)) + try: + verify_cube_subset(cube) + except ValueError as ve: + self.fail(f'No value error expected: {ve}') + + def test_no_vars(self): + cube = new_cube(variables=None) + with self.assertRaises(ValueError) as ve: + verify_cube_subset(cube) + self.assertEqual('Not at least one data variable ' + 'has spatial dimensions.', + f'{ve.exception}') + + def test_no_grid_mapping(self): + cube = new_cube(variables=dict(x=1, y=2)) + cube = cube.drop_dims('lat') + with self.assertRaises(ValueError) as ve: + verify_cube_subset(cube) + self.assertEqual('cannot find any grid mapping in dataset', + f'{ve.exception}') + + def test_no_time_info(self): + cube = new_cube(drop_bounds=True, variables=dict(x=1, y=2)) + cube = cube.drop_vars('time') + with self.assertRaises(ValueError) as ve: + verify_cube_subset(cube) + self.assertEqual('Dataset has no temporal information.', + f'{ve.exception}') + + +class SplitAndMergeTest(unittest.TestCase): + + def test_split(self): + cube = new_cube(variables=dict(x=1, y=2)) + splitcube, removed_data_vars = split_cube(cube) + self.assertEqual(dict(), removed_data_vars) + self.assertEqual(cube.data_vars.keys(), splitcube.data_vars.keys()) + + def test_split_remove_vars_and_merge(self): + cube = new_cube(variables=dict(x=1, y=2)) + non_cube_dims = {} + non_cube_dims['no_spatial_dims'] = \ + xr.DataArray([0.1, 0.2, 0.3, 0.4, 0.5], + dims=('time')) + non_cube_dims['no_dims'] = np.array(b'', dtype='|S1') + cube = cube.assign(non_cube_dims) + + with self.assertRaises(ValueError): + assert_cube(cube) + + splitcube, removed_data_vars = split_cube(cube) + self.assertEqual(non_cube_dims.keys(), removed_data_vars.keys()) + self.assertEqual(['x', 'y'], list(splitcube.data_vars.keys())) + + assert_cube(splitcube) + + merged_cube = merge_cube(splitcube, removed_data_vars) + self.assertEqual(['x', 'y', 'no_spatial_dims', 'no_dims'], + list(merged_cube.data_vars.keys())) diff --git a/xcube/core/treatascube.py b/xcube/core/treatascube.py new file mode 100644 index 000000000..3e70a14fa --- /dev/null +++ b/xcube/core/treatascube.py @@ -0,0 +1,99 @@ +# The MIT License (MIT) +# Copyright (c) 2021 by the xcube development team and contributors +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. + +from typing import Mapping + +import xarray as xr +from xcube.core.gridmapping import GridMapping +from xcube.core.normalize import normalize_dataset +from xcube.core.timecoord import get_time_range_from_data +from xcube.core.verify import assert_cube + + +def verify_cube_subset(dataset: xr.Dataset): + """ + Verifies that the dataset fulfils the minimum requirements for a dataset + that either is or may be converted to be a cube. In order to do so, the + dataset + * must have two spatial dimensions + * must have at least one data variable that uses the spatial dimensions + * must have either a temporal dimension or temporal information in its + attributes + + :param dataset: The dataset to be validated. + :raise: ValueError, if dataset contains no subset that is a valid xcube + dataset. + """ + grid_mapping = GridMapping.from_dataset(dataset) + # if a gridmapping exists, the dataset contains spatial dimensions + # if no gridmapping exists, a ValueError is raised + at_least_one_valid_var = False + for data_var in dataset.data_vars.values(): + if grid_mapping.xy_dim_names[0] in data_var.dims and \ + grid_mapping.xy_dim_names[1] in data_var.dims: + at_least_one_valid_var = True + break + if not at_least_one_valid_var: + raise ValueError('Not at least one data variable has ' + 'spatial dimensions.') + start_time, end_time = get_time_range_from_data(dataset) + if start_time is None and end_time is None: + raise ValueError('Dataset has no temporal information.') + + +def split_cube(dataset: xr.Dataset) -> (xr.Dataset, Mapping[str, xr.DataArray]): + """ + Creates a subset of a dataset that meets all hard requirements of a cube. + To this end, all variables that do not include spatial dimensions will be + removed and returned in a mapping from dataset name to data array. + + :param dataset: The dataset from which the subset shall be built + :raise: ValueError, , if dataset contains no subset that is a valid xcube + dataset. + :return: a tuple, consisting of (a) a subset of the input dataset that has + been normalized to conform to strict cube requirements and (b) a mapping + of the names of removed data variables to these data variables + """ + verify_cube_subset(dataset) + + non_cube_data_vars = dict() + grid_mapping = GridMapping.from_dataset(dataset) + + for data_var_name, data_var in dataset.data_vars.items(): + if grid_mapping.xy_dim_names[0] not in data_var.dims \ + and grid_mapping.xy_dim_names[1] not in data_var.dims: + non_cube_data_vars[data_var_name] = data_var + dataset = dataset.drop_vars(list(non_cube_data_vars.keys())) + dataset = normalize_dataset(dataset, do_not_normalize_spatial_dims=True) + return dataset, non_cube_data_vars + + +def merge_cube(dataset: xr.Dataset, + data_vars: Mapping[str, xr.DataArray]) -> xr.Dataset: + """ + Merges data_vars into a data set. + + :param dataset: The dataset into which the data variables shall be merged + :param data_vars: The data variables that shall be merged into the dataset + :raise: ValueError, if dataset is not a valid xcube dataset + """ + assert_cube(dataset) + return dataset.assign(data_vars) From cf9fc980d561823b031003968ae56d20d0573d28 Mon Sep 17 00:00:00 2001 From: Tonio Fincke Date: Tue, 27 Jul 2021 15:31:37 +0200 Subject: [PATCH 6/7] use split_cube rather than assert_cube --- xcube/core/compute.py | 20 +++++++++--------- xcube/core/extract.py | 26 ++++++++--------------- xcube/core/mldataset.py | 13 ++++++------ xcube/core/resampling/temporal.py | 12 ++++++----- xcube/core/timeseries.py | 8 +++----- xcube/core/vars2dim.py | 34 ++++++++++++++++++++++--------- 6 files changed, 59 insertions(+), 54 deletions(-) diff --git a/xcube/core/compute.py b/xcube/core/compute.py index 7275b2830..7ad453825 100644 --- a/xcube/core/compute.py +++ b/xcube/core/compute.py @@ -28,6 +28,7 @@ from xcube.core.schema import CubeSchema from xcube.core.chunkstore import ChunkStore +from xcube.core.treatascube import split_cube from xcube.core.verify import assert_cube CubeFuncOutput = Union[xr.DataArray, np.ndarray, Sequence[Union[xr.DataArray, np.ndarray]]] @@ -79,8 +80,7 @@ def compute_cube(cube_func: CubeFunc, output_var_name=output_var_name, output_var_dtype=output_var_dtype, output_var_attrs=output_var_attrs, - vectorize=vectorize, - cube_asserted=cube_asserted) + vectorize=vectorize) def compute_dataset(cube_func: CubeFunc, @@ -92,8 +92,7 @@ def compute_dataset(cube_func: CubeFunc, output_var_dims: AbstractSet[str] = None, output_var_dtype: Any = np.float64, output_var_attrs: Dict[str, Any] = None, - vectorize: bool = None, - cube_asserted: bool = False) -> xr.Dataset: + vectorize: bool = None) -> xr.Dataset: """ Compute a new output dataset with a single variable named *output_var_name* from variables named *input_var_names* contained in zero, one, or more @@ -139,7 +138,6 @@ def cube_func(*input_vars: np.ndarray, :param output_var_attrs: Optional metadata attributes for the output variable. :param vectorize: Whether all *input_cubes* have the same variables which are concatenated and passed as vectors to *cube_func*. Not implemented yet. - :param cube_asserted: If False, *cube* will be verified, otherwise it is expected to be a valid cube. :return: A new dataset that contains the computed output variable. """ if vectorize is not None: @@ -147,16 +145,18 @@ def cube_func(*input_vars: np.ndarray, # receives variables as vectors (with extra dim) raise NotImplementedError('vectorize is not supported yet') - if not cube_asserted: - for cube in input_cubes: - assert_cube(cube) + # TODO resample all input cubes to WGS84 + split_input_cubes = [] + for cube in input_cubes: + cube, _ = split_cube(cube) + assert_cube(cube) + split_input_cubes.append(cube) + input_cubes = tuple(split_input_cubes) # Check compatibility of inputs if input_cubes: input_cube_schema = CubeSchema.new(input_cubes[0]) for cube in input_cubes: - if not cube_asserted: - assert_cube(cube) if cube != input_cubes[0]: # noinspection PyUnusedLocal other_schema = CubeSchema.new(cube) diff --git a/xcube/core/extract.py b/xcube/core/extract.py index 3018cdbc2..71b9a3749 100644 --- a/xcube/core/extract.py +++ b/xcube/core/extract.py @@ -6,7 +6,7 @@ import pandas as pd import xarray as xr -from xcube.core.verify import assert_cube +from xcube.core.treatascube import split_cube DEFAULT_INDEX_NAME_PATTERN = '{name}_index' DEFAULT_REF_NAME_PATTERN = '{name}_ref' @@ -41,8 +41,7 @@ def get_cube_values_for_points( index_name_pattern: str = DEFAULT_INDEX_NAME_PATTERN, include_refs: bool = False, ref_name_pattern: str = DEFAULT_REF_NAME_PATTERN, - method: str = DEFAULT_INTERP_POINT_METHOD, - cube_asserted: bool = False + method: str = DEFAULT_INTERP_POINT_METHOD ) -> xr.Dataset: """ Extract values from *cube* variables at given @@ -77,8 +76,7 @@ def get_cube_values_for_points( :return: A new data frame whose columns are values from *cube* variables at given *points*. """ - if not cube_asserted: - assert_cube(cube) + cube, other_vars = split_cube(cube) index_dtype = np.int64 \ if method == POINT_INTERP_METHOD_NEAREST else np.float64 @@ -87,8 +85,7 @@ def get_cube_values_for_points( cube, points, index_name_pattern=index_name_pattern, - index_dtype=index_dtype, - cube_asserted=True + index_dtype=index_dtype ) cube_values = get_cube_values_for_indexes( @@ -98,8 +95,7 @@ def get_cube_values_for_points( include_bounds, data_var_names=var_names, index_name_pattern=index_name_pattern, - method=method, - cube_asserted=True + method=method ) if include_indexes: @@ -131,8 +127,7 @@ def get_cube_values_for_indexes( include_bounds: bool = False, data_var_names: Sequence[str] = None, index_name_pattern: str = DEFAULT_INDEX_NAME_PATTERN, - method: str = DEFAULT_INTERP_POINT_METHOD, - cube_asserted: bool = False + method: str = DEFAULT_INTERP_POINT_METHOD ) -> xr.Dataset: """ Get values from the *cube* at given *indexes*. @@ -155,8 +150,7 @@ def get_cube_values_for_indexes( :return: A new data frame whose columns are values from *cube* variables at given *indexes*. """ - if not cube_asserted: - assert_cube(cube) + cube, other_vars = split_cube(cube) if method not in {POINT_INTERP_METHOD_NEAREST, POINT_INTERP_METHOD_LINEAR}: raise ValueError(f"invalid method {method!r}") @@ -263,8 +257,7 @@ def get_cube_point_indexes( points: PointsLike, dim_name_mapping: Mapping[str, str] = None, index_name_pattern: str = DEFAULT_INDEX_NAME_PATTERN, - index_dtype=np.float64, - cube_asserted: bool = False + index_dtype=np.float64 ) -> xr.Dataset: """ Get indexes of given point coordinates *points* into the given *dataset*. @@ -288,8 +281,7 @@ def get_cube_point_indexes( it is expected to be a valid cube. :return: A dataset containing the index columns. """ - if not cube_asserted: - assert_cube(cube) + cube, _ = split_cube(cube) dim_name_mapping = dim_name_mapping if dim_name_mapping is not None else {} dim_names = _get_cube_data_var_dims(cube) diff --git a/xcube/core/mldataset.py b/xcube/core/mldataset.py index 9b973e8ee..420836137 100644 --- a/xcube/core/mldataset.py +++ b/xcube/core/mldataset.py @@ -18,7 +18,6 @@ from xcube.core.dsio import parse_s3_fs_and_root from xcube.core.dsio import write_cube from xcube.core.geom import get_dataset_bounds -from xcube.core.verify import assert_cube from xcube.util.perf import measure_time from xcube.util.tilegrid import TileGrid @@ -287,7 +286,7 @@ def _get_dataset_lazily(self, index: int, parameters: Dict[str, Any]) -> xr.Data base_dir = os.path.dirname(self._dir_path) level_path = os.path.join(base_dir, level_path) with measure_time(tag=f"opened local dataset {level_path} for level {index}"): - return assert_cube(xr.open_zarr(level_path, **parameters), name=level_path) + return xr.open_zarr(level_path, **parameters) def _get_tile_grid_lazily(self): """ @@ -386,7 +385,7 @@ def _get_dataset_lazily(self, index: int, parameters: Dict[str, Any]) -> xr.Data store = zarr.LRUStoreCache(store, max_size=max_size) with measure_time(tag=f"opened remote dataset {level_path} for level {index}"): consolidated = self._s3_file_system.exists(f'{level_path}/.zmetadata') - return assert_cube(xr.open_zarr(store, consolidated=consolidated, **parameters), name=level_path) + return xr.open_zarr(store, consolidated=consolidated, **parameters) def _get_tile_grid_lazily(self): """ @@ -510,7 +509,7 @@ def _get_dataset_lazily(self, index: int, parameters: Dict[str, Any]) -> xr.Data raise self._exception_type(f"Failed to compute in-memory dataset {self.ds_id!r} at level {index} " f"from function {self._callable_name!r}: " f"expected an xarray.Dataset but got {type(computed_value)}") - return assert_cube(computed_value, name=self.ds_id) + return computed_value def get_dataset_tile_grid(dataset: xr.Dataset, num_levels: int = None) -> TileGrid: @@ -663,7 +662,7 @@ def open_ml_dataset_from_object_storage(path: str, store = zarr.LRUStoreCache(store, max_size=chunk_cache_capacity) with measure_time(tag=f"opened remote zarr dataset {path}"): consolidated = s3.exists(f'{root}/.zmetadata') - ds = assert_cube(xr.open_zarr(store, consolidated=consolidated, **kwargs)) + ds = xr.open_zarr(store, consolidated=consolidated, **kwargs) return BaseMultiLevelDataset(ds, ds_id=ds_id) elif data_format == FORMAT_NAME_LEVELS: with measure_time(tag=f"opened remote levels dataset {path}"): @@ -686,11 +685,11 @@ def open_ml_dataset_from_local_fs(path: str, if data_format == FORMAT_NAME_NETCDF4: with measure_time(tag=f"opened local NetCDF dataset {path}"): - ds = assert_cube(xr.open_dataset(path, **kwargs)) + ds = xr.open_dataset(path, **kwargs) return BaseMultiLevelDataset(ds, ds_id=ds_id) elif data_format == FORMAT_NAME_ZARR: with measure_time(tag=f"opened local zarr dataset {path}"): - ds = assert_cube(xr.open_zarr(path, **kwargs)) + ds = xr.open_zarr(path, **kwargs) return BaseMultiLevelDataset(ds, ds_id=ds_id) elif data_format == FORMAT_NAME_LEVELS: with measure_time(tag=f"opened local levels dataset {path}"): diff --git a/xcube/core/resampling/temporal.py b/xcube/core/resampling/temporal.py index 454d4caea..8b405312e 100644 --- a/xcube/core/resampling/temporal.py +++ b/xcube/core/resampling/temporal.py @@ -26,7 +26,8 @@ from xcube.core.schema import CubeSchema from xcube.core.select import select_variables_subset -from xcube.core.verify import assert_cube +from xcube.core.treatascube import merge_cube +from xcube.core.treatascube import split_cube def resample_in_time(dataset: xr.Dataset, @@ -38,8 +39,7 @@ def resample_in_time(dataset: xr.Dataset, interp_kind=None, time_chunk_size=None, var_names: Sequence[str] = None, - metadata: Dict[str, Any] = None, - cube_asserted: bool = False) -> xr.Dataset: + metadata: Dict[str, Any] = None) -> xr.Dataset: """ Resample a dataset in the time dimension. @@ -84,8 +84,7 @@ def resample_in_time(dataset: xr.Dataset, otherwise it is expected to be a valid cube. :return: A new xcube dataset resampled in time. """ - if not cube_asserted: - assert_cube(dataset) + dataset, other_data_vars = split_cube(dataset) if frequency == 'all': time_gap = np.array(dataset.time[-1]) - np.array(dataset.time[0]) @@ -152,6 +151,9 @@ def resample_in_time(dataset: xr.Dataset, if isinstance(time_chunk_size, int) and time_chunk_size >= 0: chunk_sizes['time'] = time_chunk_size + # TODO consider cases where a data var in other_data_vars has time dimension + resampled_cube = merge_cube(resampled_cube, other_data_vars) + return resampled_cube.chunk(chunk_sizes) diff --git a/xcube/core/timeseries.py b/xcube/core/timeseries.py index 882d482af..548f3b3a9 100644 --- a/xcube/core/timeseries.py +++ b/xcube/core/timeseries.py @@ -29,7 +29,7 @@ from xcube.core.geom import mask_dataset_by_geometry, convert_geometry, GeometryLike, get_dataset_geometry from xcube.core.select import select_variables_subset -from xcube.core.verify import assert_cube +from xcube.core.treatascube import split_cube Date = Union[np.datetime64, str] @@ -61,8 +61,7 @@ def get_time_series(cube: xr.Dataset, agg_methods: Union[str, Sequence[str], AbstractSet[str]] = AGG_MEAN, include_count: bool = False, include_stdev: bool = False, - use_groupby: bool = False, - cube_asserted: bool = False) -> Optional[xr.Dataset]: + use_groupby: bool = False) -> Optional[xr.Dataset]: """ Get a time series dataset from a data *cube*. @@ -97,8 +96,7 @@ def get_time_series(cube: xr.Dataset, :return: A new dataset with time-series for each variable. """ - if not cube_asserted: - assert_cube(cube) + cube, other_data_vars = split_cube(cube) geometry = convert_geometry(geometry) diff --git a/xcube/core/vars2dim.py b/xcube/core/vars2dim.py index 7e3178c11..c90240e90 100644 --- a/xcube/core/vars2dim.py +++ b/xcube/core/vars2dim.py @@ -21,25 +21,33 @@ import xarray as xr -from xcube.core.verify import assert_cube +from xcube.core.treatascube import merge_cube +from xcube.core.treatascube import split_cube def vars_to_dim(cube: xr.Dataset, dim_name: str = 'var', var_name='data', - cube_asserted: bool = False): + consider_cube_data_vars_only: bool = False): """ Convert data variables into a dimension. :param cube: The xcube dataset. - :param dim_name: The name of the new dimension and coordinate variable. Defaults to 'var'. - :param var_name: The name of the new, single data variable. Defaults to 'data'. - :param cube_asserted: If False, *cube* will be verified, otherwise it is expected to be a valid cube. - :return: A new xcube dataset with data variables turned into a new dimension. + :param dim_name: The name of the new dimension and coordinate variable. + Defaults to 'var'. + :param var_name: The name of the new, single data variable. + Defaults to 'data'. + :param consider_cube_data_vars_only: If true, the dimension will only consider the data + variables that carry spatial dimensions + If False, *cube* will be verified, otherwise it is expected to be a valid + cube. + :return: A new xcube dataset with data variables turned into a new + dimension. """ - if not cube_asserted: - assert_cube(cube) + other_data_vars = {} + if consider_cube_data_vars_only: + cube, other_data_vars = split_cube(cube) if var_name == dim_name: raise ValueError("var_name must be different from dim_name") @@ -48,8 +56,14 @@ def vars_to_dim(cube: xr.Dataset, if not data_var_names: raise ValueError("cube must not be empty") - da = xr.concat([cube[data_var_name] for data_var_name in data_var_names], dim_name) + da = xr.concat([cube[data_var_name] for data_var_name in data_var_names], + dim_name) new_coord_var = xr.DataArray(data_var_names, dims=[dim_name]) da = da.assign_coords(**{dim_name: new_coord_var}) - return xr.Dataset(dict(**{var_name: da})) + dataset = xr.Dataset(dict(**{var_name: da})) + + if consider_cube_data_vars_only: + dataset = merge_cube(dataset, other_data_vars) + + return dataset From be8bbf9d488c7010feb1fe7910d1a3af637cba94 Mon Sep 17 00:00:00 2001 From: Tonio Fincke Date: Fri, 30 Jul 2021 15:54:14 +0200 Subject: [PATCH 7/7] updated cubeconv.md --- docs/source/cubeconv.md | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/docs/source/cubeconv.md b/docs/source/cubeconv.md index c8adae265..925a49204 100644 --- a/docs/source/cubeconv.md +++ b/docs/source/cubeconv.md @@ -58,8 +58,7 @@ ARD links: [Attribute Convention for Data Discovery](http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery) * Dimensions: * SHALL all be greater than zero. - + SHALL include two spatial dimensions - * SHOULD include a dimension `time` + + SHALL include one temporal and two spatial dimensions * SHOULD include a dimension `bnds` of size 2 that may be used by bounding coordinate variables * Coordinate Variables @@ -70,10 +69,10 @@ ARD links: * 1-dimensional coordinate variables SHOULD be named like the dimension they describe * For each dimension of a data variable, a coordinate variable MUST exist -* Temporal coordinate variables: +* Temporal coordinate variable: * SHALL provide time coordinates for a given time index. * MAY be non-equidistant or equidistant. - * SHOULD be named `time` + * SHALL be named `time` * One variable value SHALL provide observation or average time of *cell centers*. * Attributes: @@ -95,12 +94,10 @@ ARD links: * SHOULD be named `_bnds` * `[, 0]` SHALL provide the *lower cell boundary*, `[, 1]` SHALL provide the *upper cell boundary* -* Data variables: - * MAY have any dimensionality, including no dimensions at all. - * SHALL have the spatial dimensions at the innermost position in case it has - spatial dimensions (e.g., `[..., y, x]`) - * SHALL have its time dimension at the outermost position in case it has a - time dimension (e.g., `[time, ...]`) +* Cube Data variables: + * SHALL have its time dimension at the outermost position and the + spatial dimensions at the innermost positions (`[time, ..., y, x]` + in this order (where `y` and `x` denote the spatial dimensions)) * MAY have extra dimensions, e.g. `layer` (of the atmosphere) or `band` (of a spectrum). These extra dimensions MUST be positioned between the time and the spatial coordinates @@ -117,6 +114,10 @@ ARD links: for applying the colour bar. If not provided, minimum and maximum default to `valid_min`, `valid_max`. If neither are provided, minimum and maximum default to `0` and `1`. +* Non-Cube Data variables: + * Consists of all data variables that are not cube data variables as + described above + * MAY have any dimensionality, including no dimensions at all ### WGS84 Schema (extends Basic) @@ -134,7 +135,7 @@ ARD links: ### Generic Schema (extends Basic) * Dimensions: - * SHALL include two spatial dimensions, which SHOULD be named `y` and `x` + * SHALL include two spatial dimensions, the names `y` and `x` are RECOMMENDED * Spatial coordinate variables: * MAY use any spatial grid and CRS. * SHOULD have attributes `standard_name`, `units` @@ -143,7 +144,7 @@ ARD links: * MAY have `lon[,]`: longitude of *cell centers*. * Attributes: `standard_name="longitude"`, `units="degrees_east"`. * Grid Mapping variable: - * SHALL be included in case the CRS is not WGS84. + * MUST be included in case the CRS is not WGS84. * SHALL not carry any data, therefore it MAY be of any type * SHOULD be named `crs` * MUST have attributes that describe a CF Grid Mapping v1.8 (see