|
2 | 2 |
|
3 | 3 | !!! abstract "Summary" |
4 | 4 |
|
5 | | - **This will include a user guide for the OceanDataCatalog API to explore and access ocean data stored in the JASMIN Object Store.** |
| 5 | + **This is the User Guide for the OceanDataCatalog API to explore and access ocean data stored in the JASMIN Object Store.** |
6 | 6 |
|
7 | 7 | --- |
| 8 | + |
| 9 | +## What is the OceanDataCatalog? |
| 10 | + |
| 11 | +**OceanDataCatalog** is a Python API which allows users to: |
| 12 | + |
| 13 | +* Interfaces with National Oceanography Centre Spatio-Temporal Access Catalogs ([**STAC**](https://stacspec.org/en)) to expose collections of publicly available ocean model outputs stored in the JASMIN Object Store. |
| 14 | +* Search catalogs by collection of ocean model outputs, standard variable names or platform (type of grid on which outputs are stored). |
| 15 | +* Subset & open Analysis-Ready Cloud Optimised ([**ARCO**](https://doi.org/10.1109/MCSE.2021.3059437)) datasets as lazy [**xarray**](https://docs.xarray.dev/en/stable/user-guide/data-structures.html) Datasets. |
| 16 | + |
| 17 | +## What is STAC? |
| 18 | + |
| 19 | +Spatio-Temporal Asset Catalogs (**STAC**) provides a standardized way to describe geospatial and temporal data so that it can be easily discovered & shared across many different platforms. |
| 20 | + |
| 21 | +A STAC catalog organises datasets as a **Collection** of **Items** — each representing a geospatial asset (e.g., a model output file or satellite image) — and describes their spatial and temporal extent through structured metadata. |
| 22 | + |
| 23 | +STAC is intentionally simple and extensible: it builds on widely used web standards (JSON and GeoJSON) and can describe geospatial assets stored in diverse formats, including large, cloud-optimized Zarr stores & Icechunk repositories. |
| 24 | + |
| 25 | +Behind the **OceanDataCatalog** API, STAC catalogs are used to describe publicly available ocean model outputs produced by the National Oceanography Centre. |
| 26 | + |
| 27 | +### STAC Basics: |
| 28 | + |
| 29 | +📁 **Catalog** — Container storing STAC **Collections** or other **Catalogs** - provides high-level metadata about its contents. |
| 30 | + |
| 31 | +🗂️ **Collection** — Group of related **Items** that share common metadata, such as a modelling activity or model configuration. |
| 32 | + |
| 33 | +📄 **Item** — Single spatio-temporal record within a Collection, typically representing one dataset instance (e.g., a model output file / dataset). Each **Item** includes geometry, timestamps, and links to a data **Asset**. |
| 34 | + |
| 35 | +🧩 **Asset** — Actual data or file associated with an **Item**, such as a Zarr Store, NetCDF file, or Icechunk repository. **Assets** include URLs and media types which determines how data can be accessed. |
| 36 | + |
| 37 | +## NOC Ocean Modelling STAC |
| 38 | + |
| 39 | +National Oceanography Centre model outputs are organised in the `noc-model-stac` **Catalog**, which serves as the highest-level STAC object in our hierarchy. |
| 40 | + |
| 41 | +``` |
| 42 | +Catalog: noc-model-stac |
| 43 | +| |
| 44 | +└── Collection: noc-npd-era5 |
| 45 | + | |
| 46 | + ├── Catalog: npd-eorca1-era5v1 |
| 47 | + | ├── Catalog: gn |
| 48 | + | └── Catalog: tn |
| 49 | + | |
| 50 | + ├── Catalog: npd-eorca025-era5v1 |
| 51 | + | ├── Catalog: gn |
| 52 | + | └── Catalog: tn |
| 53 | + | |
| 54 | + └── Catalog: npd-eorca12-era5v1 |
| 55 | + ├── Catalog: gn |
| 56 | + └── Catalog: tn |
| 57 | +
|
| 58 | +``` |
| 59 | + |
| 60 | +The `noc-model-stac` **Catalog** is comprised of STAC **Collections** which group **Items** belonging to the same modelling activity. In the example above, we have included the NOC Near-Present Day simulations produced using ERA-5 atmospheric forcing in the `noc-npd-era5` **Collection**. |
| 61 | + |
| 62 | +The `noc-npd-era5` **Collection** is in-turn comprised of two **Catalogs** used to differentiate between ocean model outputs stored on their native global model grid `gn` and those diagnostics stored along transects of the native model grid `tn`. |
| 63 | + |
| 64 | +Inside each of the `gn` and `tn` **Catalogs** are STAC **Items** corresponding to ocean model output datasets. These are named according to both the location on the native NEMO model grid where variables are stored and the temporal frequency at which they are output by the NEMO ocean model (see table below). |
| 65 | + |
| 66 | +| Example | Grid | Frequency | |
| 67 | +| ----------- | ------------------ | --------------| |
| 68 | +| `T1y` | **T** (scalar) | Annual Means | |
| 69 | +| `U1m` | **U** (vector) | Monthly Means | |
| 70 | +| `I5d` | **I** (sea ice) | 5-day Means | |
| 71 | +| `W1d` | **W** (vector) | Daily-Means | |
| 72 | + |
| 73 | + |
| 74 | +To improve the accesibility of NOC ocean model assets, each **Item** is given a unique path-like identifier describing its relationship within the wider `noc-model-stac` **Catalog**. |
| 75 | + |
| 76 | +For example, `noc-npd-era5/npd-eorca1-era5v1/gn/T1y` identifies the **Item** containing monthly-mean scalar variables (e.g., conservative temperature) for the eORCA1-ERA5v1 (1-degree) simulation contained in the NOC Near-Present Day ERA-5 **Collection**. |
| 77 | + |
| 78 | +## How To... |
| 79 | + |
| 80 | +**A Quickstart Guide to Common Operations using OceanDataCatalog** |
| 81 | + |
| 82 | +Below, we briefly introduce some of the most common `OceanDataCatalog` operations in a concise how-to guide (inspired by the excellent documentation of [**Icechunk**](https://icechunk.io/en/latest/howto/)). |
| 83 | + |
| 84 | +### Create a new OceanDataCatalog instance |
| 85 | + |
| 86 | +We can assign a new instance of the **OceanDataCatalog** to the object `catalog` using: |
| 87 | + |
| 88 | +```python |
| 89 | +catalog = OceanDataCatalog(catalog_name="noc-model-stac") |
| 90 | +``` |
| 91 | + |
| 92 | +Here, we use the `noc-model-stac` (default) **Catalog**. |
| 93 | + |
| 94 | +### Explore Available Collections |
| 95 | + |
| 96 | +We can return a list of available **Collections** contained in the root **Catalog** (`noc-model-stac`) using: |
| 97 | + |
| 98 | +```python |
| 99 | +catalog.available_collections |
| 100 | +``` |
| 101 | + |
| 102 | +### Searching the OceanDataCatalog |
| 103 | + |
| 104 | +We search for **Items** contained in the root **Catalog** using: |
| 105 | + |
| 106 | +```python |
| 107 | +catalog.search(collection='noc-npd-jra55', standard_name='sea_surface_salinity') |
| 108 | +``` |
| 109 | + |
| 110 | +In the example above, we confine our search to the `noc-npd-jra55` collection before searching for any **Item** which includes a variable with the standard name `sea_surface_salinity`. |
| 111 | + |
| 112 | +Users can search the root **Catalog** using any combination of the following parameters: |
| 113 | + |
| 114 | +* `collection` : Activity **Collection** name (e.g., `noc-npd-era5`). |
| 115 | + |
| 116 | +* `platform` : Platform **Catalog** name (e.g., `gn`). |
| 117 | + |
| 118 | +* `variable_name` : Variable name contained in **Item** **Asset** (e.g., `tos_con`). |
| 119 | + |
| 120 | +* `standard_name` : Standard variable name contained in **Item** **Asset** (e.g., `sea_surface_temperature`). |
| 121 | + |
| 122 | +* `item_name` : Substring to filter Item IDs (e.g., `domain`). |
| 123 | + |
| 124 | +**Important:** Once a search has been performed on the root **Catalog**, the `.Collection` and `.Items` are populated according to the results of last query performed. In the example above, the `catalog.Collection` attribute would return the `noc-npd-jra55` STAC **Collection** and the `catalog.Items` attribute would return a list of STAC **Items** meeting the specified criteria. |
| 125 | + |
| 126 | +### Opening a dataset using the OceanDataCatalog |
| 127 | + |
| 128 | +Once we have searched the `noc-model-stac` and found the unique identifier of the **Item** we would like to explore further, we can then open its associated **Asset** as a lazy `xarray.Dataset` using the `.open_dataset()` method: |
| 129 | + |
| 130 | +```python |
| 131 | +catalog.open_dataset(id="noc-npd-era5/npd-eorca1-era5v1/gn/T1m", |
| 132 | + variable_names=["tos_con", "sos_abs"] |
| 133 | + start_datetime='2004-01', |
| 134 | + end_datetime='2008-12', |
| 135 | + bbox=(-65, 45, 10, 65) |
| 136 | + ) |
| 137 | +``` |
| 138 | + |
| 139 | +In the example above, we open the monthly mean sea surface temperature `tos_con` and sea surface salinity `sos_abs` variable from the eORCA1-ERA5v1 NOC Near-Present Day simulation, subsetting the data to consider only 2004-2008 and a geographical bounding box with limits (-65°E to 10°E) & (45°N to 65°N). |
0 commit comments