Skip to content

Commit 346b33b

Browse files
authored
Merge pull request #227 from Climate-REF/pmp-reference
2 parents f00a5db + 85f7f9e commit 346b33b

31 files changed

Lines changed: 483 additions & 264 deletions

File tree

.github/workflows/ci-integration.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ jobs:
4141
- name: Run tests
4242
run: |
4343
make fetch-test-data
44-
uv run python scripts/fetch-ilamb-data.py ilamb.txt
44+
uv run ref datasets fetch-data ilamb
4545
make test
4646
# Upload the scratch and results directories as artifacts
4747
- name: Upload scratch artifacts

Makefile

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ test-metrics-esmvaltool: ## run the tests
9494

9595
.PHONY: test-metrics-ilamb
9696
test-metrics-ilamb: ## run the tests
97-
uv run --package cmip_ref_metrics_ilamb python ./scripts/fetch-ilamb-data.py test.txt
97+
uv run ref datasets fetch-data --registry ilamb-test
9898
uv run --package cmip_ref_metrics_ilamb \
9999
pytest packages/ref-metrics-ilamb \
100100
-r a -v --doctest-modules --cov=packages/ref-metrics-ilamb/src --cov-report=term --cov-append
@@ -170,13 +170,13 @@ virtual-environment: ## update virtual environment, create a new one if it does
170170
.PHONY: fetch-test-data
171171
fetch-test-data: ## Download any data needed by the test suite
172172
uv run ref datasets fetch-sample-data
173-
uv run python ./scripts/fetch-ilamb-data.py test.txt
173+
uv run ref datasets fetch-data --registry ilamb-test
174174

175175
.PHONY: fetch-ref-data
176176
fetch-ref-data: ## Download reference data needed by providers and (temporarily) not in obs4mips
177-
uv run python ./scripts/fetch-ilamb-data.py ilamb.txt
178-
uv run python ./scripts/fetch-ilamb-data.py iomb.txt
177+
uv run ref datasets fetch-data --registry ilamb
178+
uv run ref datasets fetch-data --registry iomb
179179

180180
.PHONY: update-sample-data-registry
181181
update-sample-data-registry: ## Update the sample data registry
182-
curl --output packages/ref/src/cmip_ref/datasets/sample_data.txt https://raw.githubusercontent.com/Climate-REF/ref-sample-data/refs/heads/main/registry.txt
182+
curl --output packages/ref/src/cmip_ref/dataset_registry/sample_data.txt https://raw.githubusercontent.com/Climate-REF/ref-sample-data/refs/heads/main/registry.txt

README.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -73,23 +73,22 @@ The REF is designed to enable Modelling Centers to quickly evaluate their data a
7373
The data under test here may not be published to ESGF yet,
7474
but the REF can still be used to evaluate it.
7575

76-
For the tutorials and test suite,
77-
we maintain a set of test data that can be used to evaluate the REF.
78-
These datasets can be fetched using
76+
The REF requires some reference data to be available to run the metrics. Some of the reference datasets needed by the REF are available on ESGF yet. The following command will download the reference datasets needed by the REF and store them in a local directory (`datasets/obs4ref`) as well as some sample CMIP6 datasets that we used in our test suite:
7977

8078
```bash
81-
ref datasets fetch-sample-data
79+
ref datasets fetch-data --registry obs4ref --output-dir datasets/obs4ref
80+
ref datasets fetch-data --registry sample-data --output-dir datasets/sample-data
8281
```
8382

8483
These datasets can then be ingested into the REF and the metrics solved using:
8584

8685
```bash
87-
uv run ref datasets ingest --source-type cmip6 ./tests/test-data/sample-data/CMIP6/
88-
uv run ref datasets ingest --source-type obs4mips ./tests/test-data/sample-data/obs4MIPs/
86+
uv run ref datasets ingest --source-type cmip6 datasets/sample-data/CMIP6/
87+
uv run ref datasets ingest --source-type obs4mips datasets/obs4ref
8988
ref solve
9089
```
9190

92-
The executed metrics can then be viewed using the `ref executions list-groups` command.
91+
The executed metrics can then be viewed using the `ref executions list-groups` and `ref executions inspect` commands.
9392

9493
### As a devops engineer
9594

changelog/227.feature.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Enabled metric providers to register registries of datasets for download.
2+
This unifies the fetching of datasets across the REF via the `ref datasets fetch-data` CLI command.
3+
Added registries for the datasets that haven't been published to obs4MIPs yet (`obs4REF`) as well as PMP annual cycle datasets.
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
"""
2+
Data registries for non-published reference data
3+
4+
These data are placeholders until these data have been added to obs4MIPs.
5+
The AR7 FT REF requires that reference datasets are openly licensed before it is included
6+
in any published data catalogs.
7+
"""
8+
9+
import importlib.resources
10+
import os
11+
import pathlib
12+
import shutil
13+
14+
import pooch
15+
from loguru import logger
16+
17+
18+
def fetch_all_files(registry: pooch.Pooch, output_dir: pathlib.Path | None, symlink: bool = False) -> None:
19+
"""
20+
Fetch all files associated with a pooch registry and write them to an output directory.
21+
22+
Pooch fetches, caches and validates the downloaded files.
23+
Subsequent calls to this function will not refetch any previously downloaded files.
24+
25+
Parameters
26+
----------
27+
registry
28+
Pooch directory containing a set of files that should be fetched.
29+
output_dir
30+
The root directory to write the files to.
31+
32+
The directory will be created if it doesn't exist,
33+
and matching files will be overwritten.
34+
35+
If no directory is provided, the files will be fetched from the remote server,
36+
but not copied anywhere.
37+
symlink
38+
If True, symlink all files to this directory.
39+
Otherwise, perform a copy.
40+
"""
41+
if output_dir:
42+
output_dir.mkdir(parents=True, exist_ok=True)
43+
44+
for key in registry.registry.keys():
45+
fetch_file = registry.fetch(key)
46+
47+
if output_dir is None:
48+
# Just warm the cache and move onto the next file
49+
continue
50+
51+
linked_file = output_dir / key
52+
linked_file.parent.mkdir(parents=True, exist_ok=True)
53+
if not linked_file.exists(): # pragma: no cover
54+
if symlink:
55+
logger.info(f"Linking {key} to {linked_file}")
56+
57+
os.symlink(fetch_file, linked_file)
58+
else:
59+
logger.info(f"Copying {key} to {linked_file}")
60+
shutil.copy(fetch_file, linked_file)
61+
else:
62+
logger.info(f"File {linked_file} already exists. Skipping.")
63+
64+
65+
class DatasetRegistryManager:
66+
"""
67+
A collection of reference datasets registries
68+
69+
The REF requires additional reference datasets
70+
in addition to obs4MIPs data which can be downloaded via ESGF.
71+
Each provider may have different sets of reference data that are needed.
72+
These are provider-specific datasets are datasets not yet available in obs4MIPs,
73+
or are post-processed from obs4MIPs.
74+
75+
A dataset registry consists of a file that contains a list of files and checksums,
76+
in combination with a base URL that is used to fetch the files.
77+
[Pooch](https://www.fatiando.org/pooch/latest/) is used within the DataRegistry
78+
to manage the caching, downloading and validation of the files.
79+
80+
All datasets that are registered here are expected to be openly licensed and freely available.
81+
"""
82+
83+
def __init__(self) -> None:
84+
self._registries: dict[str, pooch.Pooch] = {}
85+
86+
def __getitem__(self, item: str) -> pooch.Pooch:
87+
"""
88+
Get a registry by name
89+
"""
90+
return self._registries[item]
91+
92+
def keys(self) -> list[str]:
93+
"""
94+
Get the list of registry names
95+
"""
96+
return list(self._registries.keys())
97+
98+
def register( # noqa: PLR0913
99+
self,
100+
name: str,
101+
base_url: str,
102+
package: str,
103+
resource: str,
104+
cache_name: str | None = None,
105+
version: str | None = None,
106+
) -> None:
107+
"""
108+
Register a new dataset registry
109+
110+
This will create a new Pooch registry and add it to the list of registries.
111+
This is typically used by a provider to register a new collections of datasets at runtime.
112+
113+
Parameters
114+
----------
115+
name
116+
Name of the registry
117+
118+
This is used to identify the registry
119+
base_url
120+
Commmon URL prefix for the files
121+
package
122+
Name of the package containing the registry resource.
123+
resource
124+
Name of the resource in the package that contains a list of files and checksums.
125+
126+
This must be formatted in a way that is expected by pooch.
127+
version
128+
The version of the data.
129+
130+
Changing the version will invalidate the cache and force a re-download of the data.
131+
cache_name
132+
Name to use to generate the cache directory.
133+
134+
This defaults to the value of `name` if not provided.
135+
"""
136+
if cache_name is None:
137+
cache_name = "ref"
138+
139+
registry = pooch.create(
140+
path=pooch.os_cache(cache_name),
141+
base_url=base_url,
142+
version=version,
143+
env="REF_METRICS_DATA_DIR",
144+
)
145+
registry.load_registry(str(importlib.resources.files(package) / resource))
146+
self._registries[name] = registry
147+
148+
149+
dataset_registry_manager = DatasetRegistryManager()

packages/ref-core/src/cmip_ref_core/dataset_registry/__init__.py

Lines changed: 0 additions & 94 deletions
This file was deleted.

packages/ref-core/src/cmip_ref_core/dataset_registry/pmp_reference.txt

Lines changed: 0 additions & 2 deletions
This file was deleted.

packages/ref-core/src/cmip_ref_core/logging.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ def capture_logging() -> None:
5454
logger.disable("matplotlib.ticker")
5555
logger.disable("matplotlib.font_manager")
5656
logger.disable("pyproj.transformer")
57+
logger.disable("pint.facets.plain.registry")
5758

5859

5960
def add_log_handler(**kwargs: Any) -> None:

packages/ref-core/tests/unit/pycmec/test_controlled_vocabulary.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,12 @@ def test_load_from_file(datadir):
1616
assert len(cv.dimensions)
1717

1818

19+
def test_load_from_file_failed(tmp_path):
20+
(tmp_path / "cv_sample.yaml").touch()
21+
with pytest.raises(AttributeError, match="'NoneType' object has no attribute 'keys'"):
22+
CV.load_from_file(tmp_path / "cv_sample.yaml")
23+
24+
1925
def test_validate(cv, cmec_metric):
2026
cv.validate_metrics(cmec_metric)
2127

0 commit comments

Comments
 (0)