Summary
A persistent `ref.toml` (on a shared volume, shared between workers, CLI, API) should be durable across image upgrades. Today two separate things can take every process down after an otherwise-routine upgrade:
- Fields that reference packaged data files (currently: `dimensions_cv`) get serialized as absolute venv paths that break when the Python version or venv layout changes.
- Fields that reference named modules (currently: `[[diagnostic_providers]]`) hard-crash the whole process if any entry no longer resolves.
Both were hit during an upgrade from `climate-ref-frontend:v0.2.x` (py3.11) to `v0.3.0` (py3.13). The full trail is in Climate-REF/ref-app#29.
1. Packaged data should behave like `ignore_datasets_file`, not a venv path
`ignore_datasets_file` already solves this cleanly (`packages/climate-ref/src/climate_ref/config.py::_get_default_ignore_datasets_file`): the default materializes a cached copy under `REF_CONFIGURATION/cache/climate_ref/`, i.e. on the persistent volume, not in the venv. The path written to `ref.toml` stays valid across image changes, and the loader re-materializes if the cached file is missing.
`dimensions_cv` uses a different pattern (`DimensionsCV._dimensions_cv_factory`): it resolves `importlib.resources.files("climate_ref_core.pycmec") / "cv_cmip7_aft.yaml"`, gets the resolved absolute path (e.g. `/app/.venv/lib/python3.11/site-packages/...`), and that path gets persisted into `ref.toml`. When the image moves to py3.13 the file is still present at the new venv path, but the stale py3.11 path in `ref.toml` wins and the service fails at startup with `FileNotFoundError`.
Generalize the `ignore_datasets_file` approach to any packaged data file referenced from `ref.toml`. Options:
- Introduce a helper (e.g. `_get_default_packaged_file(pkg, filename) -> Path`) that materializes the package resource into `REF_CONFIGURATION/cache//` on first read, the way `_get_default_ignore_datasets_file` does for the grey list. `dimensions_cv` (and any future sibling) uses this instead of `importlib.resources` directly.
- Or don't serialize packaged-data fields to `ref.toml` at all when the value equals the packaged default; let the loader re-resolve from `importlib.resources` every start.
- Or store a sentinel / `packaged:/` URI rather than an absolute filesystem path; loader expands it at load time.
Any of these fixes the symptom. The first matches the existing grey-list pattern and keeps a stable on-disk artefact under `REF_CONFIGURATION`, which is the shape the deployment already expects.
2. Unknown providers should degrade, not crash
`climate_ref_core.providers.import_provider` raises `InvalidProviderException` on `ModuleNotFoundError`, and `ProviderRegistry.build_from_config` propagates it, so one stale entry in `[[diagnostic_providers]]` kills every process (API, workers, CLI) at boot.
Trigger: `ref.toml` still carried `provider = "climate_ref_example:provider"` after the image stopped shipping that package. Nothing worked until the entry was manually `sed`'d out.
Fix options (pick at least one):
- In `build_from_config` (or the config loader), `try/except ModuleNotFoundError` per provider: log a warning, skip the entry, continue. A missing provider should not take down the whole stack.
- Add a `ref config providers prune` command (and run it implicitly from `ref providers setup`) that removes entries whose module doesn't resolve, so `ref.toml` self-heals after upgrades that drop a provider.
- Discover providers by entry-point / distribution name rather than dotted module path, so `ref.toml` doesn't have to redeclare what's installed.
Acceptance
- Path durability: a `ref.toml` written by one Python / image loads cleanly under a different Python / image without operator intervention for any field that currently references package data. Covered by a round-trip test that simulates the py3.11 → py3.13 move for `dimensions_cv`.
- Provider resilience: a `ref.toml` with one unknown provider and two valid ones starts successfully; the two valid providers register, the unknown one is logged and skipped. Covered by a unit test against `build_from_config`.
- Existing behavior unchanged when everything listed is installed.
Context
Workarounds during this deploy (so they aren't part of the fix):
```
sed -i '/^dimensions_cv = /d' /ref/test-lazy/ref.toml
and
remove the [[diagnostic_providers]] block for climate_ref_example
```
Related: Climate-REF/ref-app#29, Climate-REF/ref-app#31.
Summary
A persistent `ref.toml` (on a shared volume, shared between workers, CLI, API) should be durable across image upgrades. Today two separate things can take every process down after an otherwise-routine upgrade:
Both were hit during an upgrade from `climate-ref-frontend:v0.2.x` (py3.11) to `v0.3.0` (py3.13). The full trail is in Climate-REF/ref-app#29.
1. Packaged data should behave like `ignore_datasets_file`, not a venv path
`ignore_datasets_file` already solves this cleanly (`packages/climate-ref/src/climate_ref/config.py::_get_default_ignore_datasets_file`): the default materializes a cached copy under `REF_CONFIGURATION/cache/climate_ref/`, i.e. on the persistent volume, not in the venv. The path written to `ref.toml` stays valid across image changes, and the loader re-materializes if the cached file is missing.
`dimensions_cv` uses a different pattern (`DimensionsCV._dimensions_cv_factory`): it resolves `importlib.resources.files("climate_ref_core.pycmec") / "cv_cmip7_aft.yaml"`, gets the resolved absolute path (e.g. `/app/.venv/lib/python3.11/site-packages/...`), and that path gets persisted into `ref.toml`. When the image moves to py3.13 the file is still present at the new venv path, but the stale py3.11 path in `ref.toml` wins and the service fails at startup with `FileNotFoundError`.
Generalize the `ignore_datasets_file` approach to any packaged data file referenced from `ref.toml`. Options:
Any of these fixes the symptom. The first matches the existing grey-list pattern and keeps a stable on-disk artefact under `REF_CONFIGURATION`, which is the shape the deployment already expects.
2. Unknown providers should degrade, not crash
`climate_ref_core.providers.import_provider` raises `InvalidProviderException` on `ModuleNotFoundError`, and `ProviderRegistry.build_from_config` propagates it, so one stale entry in `[[diagnostic_providers]]` kills every process (API, workers, CLI) at boot.
Trigger: `ref.toml` still carried `provider = "climate_ref_example:provider"` after the image stopped shipping that package. Nothing worked until the entry was manually `sed`'d out.
Fix options (pick at least one):
Acceptance
Context
Workarounds during this deploy (so they aren't part of the fix):
```
sed -i '/^dimensions_cv = /d' /ref/test-lazy/ref.toml
and
remove the [[diagnostic_providers]] block for climate_ref_example
```
Related: Climate-REF/ref-app#29, Climate-REF/ref-app#31.