Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions standard/template/sections/clause_0_front_material.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ This Standard has been developed in collaboration with contributors from Earth o
[abstract]
== Abstract

The GeoZarr Unified Data Model and Encoding Standard specifies a conceptual and implementation framework for representing multidimensional, geospatial datasets using the Zarr format. This Standard builds upon the Unidata Common Data Model (CDM) and the Climate and Forecast (CF) Conventions, and introduces interoperable constructs for tiling, georeferencing, and metadata integration.
Zarr provides efficient chunked storage for n-dimensional arrays but do not provide with the semantic constructs required for geospatial and scientific data workflows. The GeoZarr Unified Data Model and Encoding Standard addresses this gap by adding essential concepts—coordinate systems, grid mappings, temporal semantics, and CF-compliant metadata—on top of Zarr's storage foundation.

The model defines core elements—dimensions, coordinate variables, data variables, attributes—and optional extensions for multi-resolution overviews, affine geotransforms, and STAC metadata. Encoding guidance is provided for Zarr Version 2 and Zarr Version 3, including chunking, group hierarchy, and metadata conventions.
The Standard builds upon proven concepts from the Common Data Model (CDM) and Climate and Forecast (CF) Conventions to define core elements—dimensions, coordinate variables, data variables, and attributes—along with extensions for multi-resolution overviews, affine geotransforms, and STAC metadata. This layered approach ensures applications can work with semantically rich geospatial data while leveraging Zarr's cloud-optimized storage capabilities.

GeoZarr aims to bridge scientific and geospatial communities by enabling round-trip transformations with formats such as NetCDF and GeoTIFF, and supporting compatibility with tools in the scientific Python and geospatial ecosystems. This Standard enables scalable, standards-compliant, and semantically rich data structures for cloud-native Earth observation applications.
By providing a standardized framework for geospatial semantics, GeoZarr enables scientific and geospatial applications to fully utilize cloud-native storage architectures while maintaining the rich metadata and coordinate referencing required for Earth observation workflows. The result is a modern, scalable approach to storing and accessing geospatial data that meets the needs of both data providers and consumers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this introduction.

== Submitters

Expand All @@ -29,4 +29,4 @@ All questions regarding this submission should be directed to the editor or the
|Brianna Pagán _(editor)_ | DevSeed
|Ryan Abernathey| EarthMover
| TBD | TBD
|===
|===
24 changes: 22 additions & 2 deletions standard/template/sections/clause_1_scope.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,26 @@

The GeoZarr Unified Data Model and Encoding Standard defines a conceptual and implementation framework for representing and encoding geospatial and scientific datasets using the Zarr format. The scope of this Standard includes the definition of a format-agnostic unified data model, the specification of its encoding into Zarr Version 2 and Version 3, and the establishment of extension points to support interoperability with external metadata and tiling standards.

This Standard addresses the needs of Earth observation, environmental monitoring, and geospatial analysis applications that require efficient, scalable access to multidimensional datasets. It enables the harmonisation of existing data models, such as the Unidata Common Data Model (CDM) and the Climate and Forecast (CF) Conventions, with operational encoding formats suitable for cloud-native storage and analysis.
These capabilities are necessary because Zarr does not provide semantic constructs for geospatial data interpretation. Applications need to understand not just array shapes and values, but coordinate meanings, projection parameters, and scientific metadata. GeoZarr fills this gap without compromising Zarr's performance characteristics.

Typical use cases include the storage, transformation, discovery, and processing of raster and gridded data, data cubes with temporal or vertical dimensions, and catalogue-enabled datasets integrated with metadata standards such as STAC and OGC Tile Matrix Sets.
=== Why GeoZarr Exists

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may be missing an important clarification to justify the purpose of Geozarr: There are already existing conventions for geospatial data in Zarr, as implemented in Xarray, NCZarr, GDAL, those conventions primarily translate aspects of the CF/NetCDF data model into Zarr encoding.

However:

  1. The CF/NetCDF data model itself may lack certain capabilities, such as support for multiscale overviews, affine transforms, etc. .
  2. The current encoding conventions to Zarr – for example, mapping all NetCDF attributes into Zarr string attributes – may not be optimal and could be revisited.


Zarr, by design, is a low-level container for storing n-dimensional arrays and metadata. While this simplicity is a strength for performance and interoperability, it means Zarr lacks higher-level concepts that geospatial applications require:

* *Coordinate Systems:* No native way to associate spatial or temporal meaning with array dimensions
* *Grid Mappings:* No standard mechanism for projection and coordinate reference system metadata
* *Semantic Metadata:* No conventions for units, standard names, or scientific attributes
* *Variable Relationships:* No formal distinction between coordinate variables and data variables

These concepts are essential for geospatial workflows but must be layered on top of Zarr's array storage. GeoZarr provides this semantic layer through proven standards (Common Data Model and CF conventions) while preserving Zarr's cloud-native advantages.

=== Use Cases and Applications

This Standard addresses the needs of Earth observation, environmental monitoring, and geospatial analysis applications that require efficient, scalable access to multidimensional datasets. It enables the harmonisation of existing data models with operational encoding formats suitable for cloud-native storage and analysis.

Typical use cases include:
* Storage and processing of raster and gridded data
* Management of data cubes with temporal or vertical dimensions
* Integration with catalogue systems through standardized metadata
* Multi-resolution tiling for efficient visualization and analysis
* Cloud-optimized access to large geospatial datasets
11 changes: 7 additions & 4 deletions standard/template/sections/clause_4_terms_and_definitions.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

=== Terms and definitions

GeoZarr specification inherits https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#concepts-and-terminology[concepts and terminology from the Zarr core specification].
The following terms adds Geozarr specificity to the existing Zarr terminology

==== array

A multidimensional, regularly spaced collection of values (e.g., raster data or gridded measurements), typically indexed by dimensions such as time, latitude, longitude, or spectral band.
Expand All @@ -22,17 +25,17 @@ An array containing the primary geospatial or scientific measurements of interes

An index axis along which arrays are organised. Dimensions provide a naming and ordering scheme for accessing data in multidimensional arrays (e.g., `time`, `x`, `y`, `band`).

==== group
==== dataset

A container for datasets, variables, dimensions, and metadata in Zarr. Groups may be nested to represent a logical hierarchy (e.g., for resolutions or collections).
A group that contains one or more data variables along with their associated coordinate variables, having a consistent relationship between these components. A dataset represents a coherent set of related data arrays and follows the unified data model.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use Unified Data Model in capitals wherever it is formal reference to the clause 7 definition?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it say what the group is first? group is probably still a container for datasets that can be nested.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We inherit from Zarr for the group terminology. the section starts with:

GeoZarr specification inherits https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#concepts-and-terminology[concepts and terminology from the Zarr core specification].
The following terms adds Geozarr specificity to the existing Zarr terminology

I would like to avoid repeating the Zarr terminology in order to limit the maintenance if they evolve.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capitals solved


==== metadata

Structured information describing the content, context, and semantics of datasets, variables, and attributes. GeoZarr metadata includes CF attributes, geotransform definitions, and links to STAC metadata where applicable.

==== multiscale dataset
==== multiscale group

A dataset that includes multiple representations of the same data variable at varying spatial resolutions. Each resolution level is associated with a tile matrix from an OGC Tile Matrix Set.
A group that contains 2 or more child groups representing the same data at different resolutions, where each child group is a <<term-dataset,dataset>>. The multiscale group includes metadata describing the relationship between resolution levels.

==== tile matrix set

Expand Down
60 changes: 28 additions & 32 deletions standard/template/sections/clause_7_unified_data_model.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ This clause specifies the logical composition of the unified model, the external

=== Foundational Model and Standards Reuse

The unified data model described in this Standard is derived from established community specifications to maximise interoperability and to enable the reuse of mature tools and practices. The model is grounded in the Unidata Common Data Model (CDM) and the Climate and Forecast (CF) Conventions, which together provide a robust framework for representing scientific and geospatial datasets.
GeoZarr adopts established data model concepts because Zarr itself provides only array storage without semantic interpretation. The Unidata Common Data Model (CDM) provides the conceptual framework for understanding dimensions, variables, and attributes, while CF Conventions provide standardized metadata semantics. This reuse ensures compatibility with existing scientific software while avoiding reinvention of proven concepts.

==== Common Data Model (CDM)

Expand Down Expand Up @@ -87,11 +87,11 @@ To enable discovery of resources within the hierarchical structure of the data m

A STAC extension consists of embedding or referencing STAC Collection and Item metadata within the data model:

* Each dataset resource MAY reference a corresponding STAC `Collection` or `Item` using an identifier or embedded object.
* Each store resource MAY reference a corresponding STAC `Collection` or `Item` using an identifier or embedded object.
* STAC properties such as `datetime`, `bbox`, and `eo:bands` MAY be included in the metadata to enable spatial, temporal, and spectral filtering.
* The structure is compatible with external STAC APIs and metadata harvesting systems.

STAC integration is non-intrusive and modular. It does not impose changes on the internal organisation of datasets and MAY be adopted incrementally by implementations requiring catalogue-based discovery capabilities.
STAC integration is non-intrusive and modular. It does not impose changes on the internal organisation of the store and MAY be adopted incrementally by implementations requiring catalogue-based discovery capabilities.


==== Modularity and Interoperability
Expand All @@ -101,22 +101,22 @@ Each extension point is specified independently. Implementations may advertise s

=== Unified Model Structure

This clause defines the structural organisation of datasets conforming to the unified data model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources.
This clause defines the structural organisation of stores conforming to the unified data model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources.

The model represents datasets as abstract compositions of dimensions, coordinate variables, data variables, and associated metadata. This abstraction ensures that applications and services can reason about the content and semantics of a dataset without reliance on storage layout or specific serialisation.

==== Dataset Structure
==== Store Structure

A dataset conforming to the Unified Data Model (UDM) is structured as a hierarchy rooted at a top-level dataset entity. This design enables modularity and facilitates the representation of complex, multi-resolution, or thematically partitioned data collections.
A store conforming to the Unified Data Model (UDM) is structured as a hierarchy rooted at a top-level group. This design enables modularity and facilitates the representation of complex, multi-resolution, or thematically partitioned data collections.

Each dataset node comprises the following core components, aligned with the Unidata Common Data Model (CDM) and Climate and Forecast (CF) Conventions:
Each <<term-dataset, dataset>> comprises the following core components, aligned with the Unidata Common Data Model (CDM) and Climate and Forecast (CF) Conventions:

- **Dimensions** – Named, integer-valued axes defining the extent of data variables. Examples include `time`, `x`, `y`, and `band`.
- **Coordinate Variables** – Arrays that supply coordinate values along dimensions, providing spatial, temporal, or contextual referencing. These may be scalar or higher-dimensional, depending on the referencing scheme.
- **Data Variables** – Multidimensional arrays representing physical measurements or derived products. Defined over one or more dimensions, these variables are associated with coordinate variables and annotated with metadata.
- **Attributes** – Key-value pairs attached to variables or dataset components. Attributes convey semantic information such as units, standard names, and geospatial metadata.

The hierarchy is implemented through **groups**, which function as containers for variables, dimensions, and metadata. Groups may define local context while inheriting attributes from parent nodes. This supports the logical subdivision of datasets by theme, resolution, or processing stage, and enhances the clarity and reusability of complex geospatial structures.
A Zarr hierarchy is a tree structure, where each node in the tree is either a group or an array. Group nodes may have children but array nodes may not. This supports the logical subdivision by theme, resolution, or processing stage, and enhances the clarity and reusability of complex geospatial structures.

The diagram below represents the structural layer of the unified data model, derived from the Unidata Common Data Model, which serves as the foundational framework for supporting all overlaying model layer.

Expand All @@ -129,18 +129,17 @@ The diagram below represents the structural layer of the unified data model, der
....
@startuml CDM_DAL_Object_Model

class Dataset {
class Store {
+ String location
+ open()
+ close()
}

class Group {
+ String name
+ List<Group> subgroups
+ List<Variable> variables
+ List<Dimension> dimensions
+ List<Attribute> attributes
}

class Dataset {
}

class Dimension {
Expand All @@ -152,9 +151,6 @@ class Dimension {

class Variable {
+ String name
+ DataType dataType
+ List<Dimension> shape
+ List<Attribute> attributes
+ read()
}

Expand All @@ -169,19 +165,20 @@ class Attribute {
+ List<String> values
}

Dataset --> Group : rootGroup
Group --> Group : contains >
Group --> Variable : contains >
Group --> Dimension : defines >
Group --> Attribute : has >
Variable --> Dimension : uses >
Variable --> DataType : has >
Variable --> Attribute : has >
Store "1" --> "*" Group : rootGroup
Group "1" --> "*" Group : contains
Dataset -up-|> Group
Dataset --> "*" Variable : contains
Dataset --> "*" Dimension : defines
Group --> "*" Attribute : has
Variable --> "*" Dimension : uses
Variable --> "1" DataType : has
Variable --> "*" Attribute : has
@enduml
....
//endif::never-shown[]

Note that, conceptually, node within this hierarchy might be treated as a self-contained dataset.
Note that, conceptually, node within this hierarchy might be treated as a self-contained store.

==== Coordinate Referencing

Expand All @@ -196,7 +193,7 @@ The model accommodates both standard CF-compatible definitions and extended refe

Metadata may be declared at various levels within the model structure:

- **Global Metadata** – Attributes describing the dataset as a whole, including elements such as `title`, `summary`, and `license`.
- **Global Metadata** – Attributes describing the store as a whole, including elements such as `title`, `summary`, and `license`.
- **Variable Metadata** – Attributes associated with individual data or coordinate variables, conveying descriptive or semantic information.
- **Extension Metadata** – Structured metadata linked to optional model extensions (e.g., multiscale tiling, catalogue references, geotransform properties).

Expand All @@ -218,15 +215,15 @@ Overviews enable:

===== Conceptual Structure

An *Overviews* construct is defined as a *hierarchical set of multiscale representations* of one or more data variables. It comprises the following components:
A <<term-multiscale-group,multiscale group>> contains child groups representing the data at different resolutions, where each child group is a <<term-dataset, dataset>> following the unified data model. It comprises the following components:

[horizontal]
*Base Variable*:: The original, highest-resolution variable to which the overview hierarchy is anchored. It is defined using the standard `DataVariable` structure in the model.
*Overview Levels*:: A sequence of variables representing the same logical quantity as the base variable, but sampled at coarser spatial resolutions.
*Base Dataset*:: The original, highest-resolution dataset to which the multiscale hierarchy is anchored.
*Zoom Level Datasets*:: A sequence of datasets representing the same data as the base dataset, but sampled at coarser spatial resolutions.
*Zoom Level Identifier*:: A unique identifier associated with each level, ordered from finest (e.g. `"0"`) to coarsest resolution (e.g. `"N"`).
*Tile Grid Definition*:: A mapping that associates each zoom level with a spatial tiling layout, defined in alignment with a `TileMatrixSet`.
*Spatial Alignment*:: Each overview variable MUST be spatially aligned with the base variable using a consistent coordinate reference system and compatible axis orientation.
*Resampling Method*:: A declared method indicating the technique used to derive coarser levels from the base variable (e.g. `nearest`, `average`, `cubic`).
*Spatial Alignment*:: Each zoom-level dataset MUST be spatially aligned with the base dataset using a consistent coordinate reference system and compatible axis orientation.
*Resampling Method*:: A declared method indicating the technique used to derive coarser levels from the base dataset (e.g. `nearest`, `average`, `cubic`).

===== Model Components

Expand Down Expand Up @@ -351,4 +348,3 @@ The unified data model facilitates interoperability with tools and libraries acr
- *Cloud-native infrastructure*: support for parallel access, chunked storage, and hierarchical grouping compatible with object storage.

Tooling support is expected to grow via standard-conformant implementations, easing adoption across domains and infrastructures.

7 changes: 3 additions & 4 deletions standard/template/sections/clause_9_zarr_encoding_core.adoc
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@

=== Hierarchical Structure

A dataset conforming to the unified data model is represented as a hierarchical structure of groups, variables (arrays), dimensions, and metadata. The dataset is rooted in a *top-level group*, which may contain:
A store conforming to the unified data model is structured as a hierarchy of groups, variables (arrays), dimensions, and metadata. Following Zarr conventions, this hierarchy is rooted in a group, which may contain:

- Arrays representing coordinate or data variables
- Child groups for modular organisation, including logical sub-collections or resolution levels
Expand All @@ -14,7 +14,7 @@ Each group adheres to a consistent structure, allowing recursive composition. Th
|===
|Model Element |Zarr v2 Encoding |Zarr v3 Encoding

|Root Dataset | Directory with `.zgroup` and `.zattrs` | Directory with `zarr.json`, with `node_type: group`
|Root Group | Directory with `.zgroup` and `.zattrs` | Directory with `zarr.json`, with `node_type: group`

|Child Group | Subdirectory with `.zgroup` and `.zattrs` | Subdirectory with `zarr.json`, with `node_type: group`

Expand Down Expand Up @@ -115,7 +115,7 @@ Example:

=== Global Metadata

Metadata associated with the dataset as a whole is stored at the root group level.
Metadata associated with the store is stored at the root group level.


[cols="1,2,2"]
Expand Down Expand Up @@ -157,4 +157,3 @@ In all cases:

- Attribute names are case-sensitive and encoded as UTF-8 strings
- Values shall conform to JSON-compatible types (string, number, boolean, array)

Loading