Skip to content
Draft
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
c129966
Add recognised intake-esm datastores on NCI systems to config_develop…
charles-turner-1 Feb 4, 2025
b1b76fb
Skeleton
charles-turner-1 Feb 5, 2025
dd73d1d
Playing around
charles-turner-1 Feb 5, 2025
ed1676b
Almost at a working IntakeDataset.load()
charles-turner-1 Feb 12, 2025
fa1ea2e
Working intake-esm implementation - probably still some kinks to iron…
charles-turner-1 Feb 25, 2025
648f119
Working with multiple catalogues per project
charles-turner-1 Mar 12, 2025
2b91fec
Cleanup - mypy & ruff errors
charles-turner-1 Mar 13, 2025
c7b8ffb
Remove WIP
charles-turner-1 Mar 13, 2025
31b35cb
Update depenencies & dev environment
charles-turner-1 Mar 13, 2025
a8532a5
Pre-commit modifications
charles-turner-1 Mar 13, 2025
7e56959
Merge branch 'main' into intake-esm
charles-turner-1 Mar 13, 2025
568cb8d
Fixed most of codacy (mypy-strict?) gripes
charles-turner-1 Mar 13, 2025
91fee56
Fix typo
charles-turner-1 Mar 13, 2025
9d894b9
Beginning to work on Bouwe's comments (WIP)
charles-turner-1 Apr 2, 2025
59d0d02
Updates - restructured esmvalcore/data/intake following Bouwe's sugge…
charles-turner-1 Apr 3, 2025
2050081
Reorder imports (ruff maybe?)
charles-turner-1 May 6, 2025
59e4205
Add `_read_facets` to intake configuration: see https://github.com/in…
charles-turner-1 May 12, 2025
2527059
Add `merge_intake_seach_history` function (see https://github.com/int…
charles-turner-1 May 13, 2025
4641965
Merge branch 'main' into intake-esm
charles-turner-1 May 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ dependencies:
- fire
- geopy
- humanfriendly
- intake >=2.0.0
- intake-esm >=2025.2.3
- iris >=3.11 # 3.11 first to support Numpy 2 and Python 3.13
- iris-esmf-regrid >=0.11.0
- iris-grib >=0.20.0 # github.com/ESMValGroup/ESMValCore/issues/2535
Expand Down
74 changes: 74 additions & 0 deletions esmvalcore/config-developer.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,34 @@ CMIP6:
SYNDA: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
NCI: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
input_file: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
catalogs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan was to not further extend config-developer, but rather move this to the new configuration that lives in ~/.config/esmvaltool. See #2371 for an example of what we thought the configuration should look like.

NCI:
- file:
/g/data/fs38/catalog/v2/esm/catalog.json
facets:
activity: activity_id
dataset: source_id
ensemble: member_id
exp: experiment_id
grid: grid_label
institute: institution_id
mip: table_id
short_name: variable_id
version: version
frequency: frequency
- file:
/g/data/oi10/catalog/v2/esm/catalog.json
facets:
activity: activity_id
dataset: source_id
ensemble: member_id
exp: experiment_id
grid: grid_label
institute: institution_id
mip: table_id
short_name: variable_id
version: version
frequency: frequency
output_file: '{project}_{dataset}_{mip}_{exp}_{ensemble}_{short_name}_{grid}'
cmor_type: 'CMIP6'

Expand All @@ -56,6 +84,36 @@ CMIP5:
SMHI: '{dataset}/{ensemble}/{exp}/{frequency}'
SYNDA: '{institute}/{dataset}/{exp}/{frequency}/{modeling_realm}/{mip}/{ensemble}/{version}'
input_file: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}*.nc'
catalogs:
Β NCI:
- file:
/g/data/rr3/catalog/v2/esm/catalog.json
facets:
# mapping from recipe facets to intake-esm catalog facets
# TODO: Fix these when Gadi is back up
activity: activity_id
dataset: source_id
ensemble: ensemble
exp: experiment
grid: grid_label
institute: institution_id
mip: table_id
short_name: variable
version: version
- file:
/g/data/al33/catalog/v2/esm/catalog.json
facets:
# mapping from recipe facets to intake-esm catalog facets
# TODO: Fix these when Gadi is back up
activity: activity_id
dataset: source_id
ensemble: ensemble
exp: experiment
institute: institute
mip: table
short_name: variable
version: version
timerange: time_range
output_file: '{project}_{dataset}_{mip}_{exp}_{ensemble}_{short_name}'

CMIP3:
Expand Down Expand Up @@ -156,6 +214,22 @@ CORDEX:
ESGF: '{project.lower}/output/{domain}/{institute}/{driver}/{exp}/{ensemble}/{dataset}/{rcm_version}/{frequency}/{short_name}/{version}'
SYNDA: '{domain}/{institute}/{driver}/{exp}/{ensemble}/{dataset}/{rcm_version}/{frequency}/{short_name}/{version}'
input_file: '{short_name}_{domain}_{driver}_{exp}_{ensemble}_{institute}-{dataset}_{rcm_version}_{mip}*.nc'
catalogs:
Β NCI:
files:
- /g/data/oi10/catalog/v2/esm/catalog.json
facets:
# mapping from recipe facets to intake-esm catalog facets
# TODO: Fix these when Gadi is back up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also test on DKRZ Levante, the intake catalogs are located at /pool/data/Catalogs/dkrz_cmip6_disk.json

activity: activity_id
dataset: source_id
ensemble: member_id
exp: experiment_id
grid: grid_label
institute: institution_id
mip: table_id
short_name: variable_id
version: version
output_file: '{project}_{institute}_{dataset}_{rcm_version}_{driver}_{domain}_{mip}_{exp}_{ensemble}_{short_name}'
cmor_type: 'CMIP5'
cmor_path: 'cordex'
Expand Down
5 changes: 5 additions & 0 deletions esmvalcore/intake/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"""Find files using an intake-esm catalog and load them."""

from ._dataset import IntakeDataset, load_catalogs

Check warning on line 3 in esmvalcore/intake/__init__.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/__init__.py#L3

Added line #L3 was not covered by tests

__all__ = ["IntakeDataset", "load_catalogs"]

Check warning on line 5 in esmvalcore/intake/__init__.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/__init__.py#L5

Added line #L5 was not covered by tests
164 changes: 164 additions & 0 deletions esmvalcore/intake/_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
"""Import datasets using Intake-ESM."""

import logging
from numbers import Number
from pathlib import Path
from typing import Any, Sequence

Check warning on line 6 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L3-L6

Added lines #L3 - L6 were not covered by tests

# import isodate
import intake
import intake_esm

Check warning on line 10 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L9-L10

Added lines #L9 - L10 were not covered by tests

from esmvalcore.config import CFG
from esmvalcore.config._config import get_project_config
from esmvalcore.dataset import Dataset, File
from esmvalcore.local import LocalFile

Check warning on line 15 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L12-L15

Added lines #L12 - L15 were not covered by tests

__all__ = ["IntakeDataset", "load_catalogs", "clear_catalog_cache"]

Check warning on line 17 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L17

Added line #L17 was not covered by tests

logger = logging.getLogger(__name__)

Check warning on line 19 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L19

Added line #L19 was not covered by tests

_CACHE: dict[Path, intake_esm.core.esm_datastore] = {}

Check warning on line 21 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L21

Added line #L21 was not covered by tests


def clear_catalog_cache():

Check warning on line 24 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L24

Added line #L24 was not covered by tests

Check notice on line 24 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

esmvalcore/intake/_dataset.py#L24

Function is missing a return type annotation. Use "-> None" if function does not return a value. (no-untyped-def)
"""Clear the catalog cache."""
_CACHE.clear()

Check warning on line 26 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L26

Added line #L26 was not covered by tests


def load_catalogs(

Check warning on line 29 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L29

Added line #L29 was not covered by tests
project: str, drs: dict

Check notice on line 30 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

esmvalcore/intake/_dataset.py#L30

Missing type parameters for generic type "dict". (type-arg)
) -> tuple[list[intake_esm.core.esm_datastore], list[dict[str, str]]]:
"""Load all intake-esm catalogs for a project and their associated facet mappings.
Parameters
----------
project : str
The project name, eg. 'CMIP6'.
drs : dict
The DRS configuration. Can be obtained from the global configuration drs
field, eg. CFG['drs'].
Returns
-------
intake_esm.core.esm_datastore
The catalog.
dict
The facet mapping - a dictionary mapping ESMVlCore dataset facet names
to the fields in the intake-esm datastore.
"""
catalog_info: dict[str, Any] = get_project_config(project).get(

Check warning on line 50 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L50

Added line #L50 was not covered by tests

Check notice on line 50 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

esmvalcore/intake/_dataset.py#L50

Call to untyped function "get_project_config" in typed context. (no-untyped-call)
"catalogs", {}
)
site = drs.get(project, "default")
if site not in catalog_info:
return [None], [{}]

Check warning on line 55 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L53-L55

Added lines #L53 - L55 were not covered by tests

catalog_urls = [

Check warning on line 57 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L57

Added line #L57 was not covered by tests
Path(catalog.get("file")).expanduser()
for catalog in catalog_info[site]
]
facet_list = [catalog.get("facets") for catalog in catalog_info[site]]

Check warning on line 61 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L61

Added line #L61 was not covered by tests

for catalog_url in catalog_urls:
if catalog_url not in _CACHE:
logger.info(

Check warning on line 65 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L63-L65

Added lines #L63 - L65 were not covered by tests
"Loading intake-esm catalog (this may take some time): %s",
catalog_url,
)
_CACHE[catalog_url] = intake.open_esm_datastore(catalog_url)
logger.info("Successfully loaded catalog %s", catalog_url)

Check warning on line 70 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L69-L70

Added lines #L69 - L70 were not covered by tests

return ([_CACHE[cat_url] for cat_url in catalog_urls], facet_list)

Check warning on line 72 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L72

Added line #L72 was not covered by tests


class IntakeDataset(Dataset):

Check warning on line 75 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L75

Added line #L75 was not covered by tests
Copy link
Member

@bouweandela bouweandela Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having some reservations about subclassing the Dataset class for this purpose:

  • A typical use case for many of our users will be that they have most data available from a central catalog that is managed by a central administrator, but want to augment that with the ability to download some files themselves. In that case, it is really useful to have the ability to deduplicate (e.g. pick the latest version of a file). I'm not sure if this can be achieved by subclassing the Dataset object.
  • We will likely want to add support for other catalogs as well, e.g. intake-esgf, xcube, and STAC. If we need a new Dataset class for each of these, it may become confusing to users.
  • How will this work from the recipe?

As an alternative, would it be an option to load the available data sources from the configuration / Dataset.session and then make the Dataset.files method loop over the available sources and deduplicate input files?

"""Load data using Intake-ESM."""

def __init__(self, **facets):

Check notice on line 78 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

esmvalcore/intake/_dataset.py#L78

Function is missing a type annotation. (no-untyped-def)
project = facets["project"]
self.catalog, self._facets = load_catalogs(project, CFG["drs"])
self._unmapped_facets = {}
super().__init__(**facets)

Check warning on line 82 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L78-L82

Added lines #L78 - L82 were not covered by tests

@property
def files(self) -> Sequence[File]:

Check notice on line 85 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

esmvalcore/intake/_dataset.py#L85

Missing docstring in public method (D102)
if self._files is None:
self._files = self._find_files(self.facets, CFG["drs"])
return self._files

Check warning on line 88 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L84-L88

Added lines #L84 - L88 were not covered by tests

@files.setter
def files(self, value: Sequence[File]):

Check warning on line 91 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L90-L91

Added lines #L90 - L91 were not covered by tests

Check notice on line 91 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

esmvalcore/intake/_dataset.py#L91

Function is missing a return type annotation. (no-untyped-def)
"""Manually set the files for the dataset."""
self._files = value

Check warning on line 93 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L93

Added line #L93 was not covered by tests

@property
def filenames(self) -> Sequence[str]:

Check warning on line 96 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L95-L96

Added lines #L95 - L96 were not covered by tests
"""String representation of the filenames in the dataset."""

Check notice on line 97 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

esmvalcore/intake/_dataset.py#L97

First line should be in imperative mood; try rephrasing (found 'String') (D401)
return [str(f) for f in self.files]

Check warning on line 98 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L98

Added line #L98 was not covered by tests

def _find_files( # type: ignore[override]

Check warning on line 100 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L100

Added line #L100 was not covered by tests

Check notice on line 100 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

esmvalcore/intake/_dataset.py#L100

Number of parameters was 1 in 'Dataset._find_files' and is now 3 in overriding 'IntakeDataset._find_files' method (arguments-differ)
self,
facet_map: dict[str, str | Sequence[str] | Number],
drs: dict[str, Any],
) -> Sequence[File]:
"""Find files for variable in all intake-esm catalogs associated with a project.

Check notice on line 105 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

esmvalcore/intake/_dataset.py#L105

Missing argument descriptions in the docstring (argument(s) facet_map are missing descriptions in '_find_files' docstring) (D417)
As a side effect, sets the unmapped_facets attribute - this is used to
cache facets which are not in the datastore.
Parameters
----------
variable : dict
A dict mapping the variable names used to initialise the IntakeDataset
object to their ESMValCore facet names. For example,
```
ACCESS_ESM1_5 = IntakeDataset(
short_name='tos',
project='CMIP6',
)
```
would result in a variable dict of {'short_name': 'tos', 'project': 'CMIP6'}.
drs : dict
The DRS configuration. Can be obtained from the global configuration drs
field, eg. CFG['drs'].
"""
if not isinstance(facet_map["project"], str):
raise TypeError(

Check warning on line 127 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L126-L127

Added lines #L126 - L127 were not covered by tests
"The project facet must be a string for Intake Datasets."
)

catalogs, facets_list = load_catalogs(facet_map["project"], drs)
if not catalogs:
return []

Check warning on line 133 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L131-L133

Added lines #L131 - L133 were not covered by tests

files = []

Check warning on line 135 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L135

Added line #L135 was not covered by tests

for catalog, facets in zip(catalogs, facets_list, strict=False):
query = {val: facet_map.get(key) for key, val in facets.items()}
query = {key: val for key, val in query.items() if val is not None}

Check warning on line 139 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L137-L139

Added lines #L137 - L139 were not covered by tests

unmapped = {

Check warning on line 141 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L141

Added line #L141 was not covered by tests
key: val for key, val in facet_map.items() if key not in facets
}
unmapped.pop("project", None)

Check warning on line 144 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L144

Added line #L144 was not covered by tests

self._unmapped_facets = unmapped

Check warning on line 146 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L146

Added line #L146 was not covered by tests

selection = catalog.search(**query)

Check warning on line 148 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L148

Added line #L148 was not covered by tests

# Select latest version
if "version" in facets and "version" not in facet_map:
latest_version = max(

Check warning on line 152 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L151-L152

Added lines #L151 - L152 were not covered by tests
selection.unique().version
) # These are strings - need to double check the sorting here.
facet_map["version"] = latest_version
query = {

Check warning on line 156 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L155-L156

Added lines #L155 - L156 were not covered by tests
facets["version"]: latest_version,
}
selection = selection.search(**query)

Check warning on line 159 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L159

Added line #L159 was not covered by tests

files += [LocalFile(f) for f in selection.unique().path]

Check warning on line 161 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L161

Added line #L161 was not covered by tests

self.augment_facets()
return files

Check warning on line 164 in esmvalcore/intake/_dataset.py

View check run for this annotation

Codecov / codecov/patch

esmvalcore/intake/_dataset.py#L163-L164

Added lines #L163 - L164 were not covered by tests
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@ dependencies = [
"fire",
"geopy",
"humanfriendly",
"intake>=2.0.0",
"intake-esm>=2025.2.3",
"iris-grib>=0.20.0", # github.com/ESMValGroup/ESMValCore/issues/2535
"isodate>=0.7.0",
"jinja2",
Expand Down