Skip to content

ref.toml doesn't survive image/python upgrades: bake packaged-data paths and unknown providers #627

@lewisjared

Description

@lewisjared

Summary

A persistent `ref.toml` (on a shared volume, shared between workers, CLI, API) should be durable across image upgrades. Today two separate things can take every process down after an otherwise-routine upgrade:

  1. Fields that reference packaged data files (currently: `dimensions_cv`) get serialized as absolute venv paths that break when the Python version or venv layout changes.
  2. Fields that reference named modules (currently: `[[diagnostic_providers]]`) hard-crash the whole process if any entry no longer resolves.

Both were hit during an upgrade from `climate-ref-frontend:v0.2.x` (py3.11) to `v0.3.0` (py3.13). The full trail is in Climate-REF/ref-app#29.

1. Packaged data should behave like `ignore_datasets_file`, not a venv path

`ignore_datasets_file` already solves this cleanly (`packages/climate-ref/src/climate_ref/config.py::_get_default_ignore_datasets_file`): the default materializes a cached copy under `REF_CONFIGURATION/cache/climate_ref/`, i.e. on the persistent volume, not in the venv. The path written to `ref.toml` stays valid across image changes, and the loader re-materializes if the cached file is missing.

`dimensions_cv` uses a different pattern (`DimensionsCV._dimensions_cv_factory`): it resolves `importlib.resources.files("climate_ref_core.pycmec") / "cv_cmip7_aft.yaml"`, gets the resolved absolute path (e.g. `/app/.venv/lib/python3.11/site-packages/...`), and that path gets persisted into `ref.toml`. When the image moves to py3.13 the file is still present at the new venv path, but the stale py3.11 path in `ref.toml` wins and the service fails at startup with `FileNotFoundError`.

Generalize the `ignore_datasets_file` approach to any packaged data file referenced from `ref.toml`. Options:

  • Introduce a helper (e.g. `_get_default_packaged_file(pkg, filename) -> Path`) that materializes the package resource into `REF_CONFIGURATION/cache//` on first read, the way `_get_default_ignore_datasets_file` does for the grey list. `dimensions_cv` (and any future sibling) uses this instead of `importlib.resources` directly.
  • Or don't serialize packaged-data fields to `ref.toml` at all when the value equals the packaged default; let the loader re-resolve from `importlib.resources` every start.
  • Or store a sentinel / `packaged:/` URI rather than an absolute filesystem path; loader expands it at load time.

Any of these fixes the symptom. The first matches the existing grey-list pattern and keeps a stable on-disk artefact under `REF_CONFIGURATION`, which is the shape the deployment already expects.

2. Unknown providers should degrade, not crash

`climate_ref_core.providers.import_provider` raises `InvalidProviderException` on `ModuleNotFoundError`, and `ProviderRegistry.build_from_config` propagates it, so one stale entry in `[[diagnostic_providers]]` kills every process (API, workers, CLI) at boot.

Trigger: `ref.toml` still carried `provider = "climate_ref_example:provider"` after the image stopped shipping that package. Nothing worked until the entry was manually `sed`'d out.

Fix options (pick at least one):

  • In `build_from_config` (or the config loader), `try/except ModuleNotFoundError` per provider: log a warning, skip the entry, continue. A missing provider should not take down the whole stack.
  • Add a `ref config providers prune` command (and run it implicitly from `ref providers setup`) that removes entries whose module doesn't resolve, so `ref.toml` self-heals after upgrades that drop a provider.
  • Discover providers by entry-point / distribution name rather than dotted module path, so `ref.toml` doesn't have to redeclare what's installed.

Acceptance

  • Path durability: a `ref.toml` written by one Python / image loads cleanly under a different Python / image without operator intervention for any field that currently references package data. Covered by a round-trip test that simulates the py3.11 → py3.13 move for `dimensions_cv`.
  • Provider resilience: a `ref.toml` with one unknown provider and two valid ones starts successfully; the two valid providers register, the unknown one is logged and skipped. Covered by a unit test against `build_from_config`.
  • Existing behavior unchanged when everything listed is installed.

Context

Workarounds during this deploy (so they aren't part of the fix):

```
sed -i '/^dimensions_cv = /d' /ref/test-lazy/ref.toml

and

remove the [[diagnostic_providers]] block for climate_ref_example
```

Related: Climate-REF/ref-app#29, Climate-REF/ref-app#31.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions