Skip to content

SimCat Prototype#1417

Draft
pgierz wants to merge 62 commits intoesm-tools-plus/simcat/mainfrom
esm-tools-plus/simcat/prototype
Draft

SimCat Prototype#1417
pgierz wants to merge 62 commits intoesm-tools-plus/simcat/mainfrom
esm-tools-plus/simcat/prototype

Conversation

@pgierz
Copy link
Member

@pgierz pgierz commented Jan 13, 2026

Prototype of the Simulation Catalog work. This is meant to be only exploratory at the moment.

Closes #1416

@pgierz pgierz linked an issue Jan 13, 2026 that may be closed by this pull request
siligam and others added 20 commits January 19, 2026 11:08
- Add browser/ directory with three-tier STAC Browser customisation
  (config overrides, net-new src additions, reference patches)
- Document stacbrowser2 as a proper GitHub fork (esm-tools/stac-browser)
  with two-remote setup (origin=fork, upstream=radiantearth)
- Add scan/context.py and fix the collection assignment design hole:
  resolve_context() must run before item creation; NULL collection
  silently breaks API navigation
- Add integration/ module (esm_tools.py, config.py) for add_files()
  bridge and finished_config.yaml loading
- Fix DuckDB schema: add catalogs table, collection_item_props table,
  experiment column + index, fix DOUBLE[] syntax
- Fix SQL WHERE bug in federation example (wrap in CTE)
- Add Collection Search section: two search modes (items/collections),
  conformsTo requirements, CQL2 filter parsing, item-derived property
  index, and differing response shapes
- Document PythonCodeBox searchType prop and collection codegen mode
  (template_collections.py, requests.get vs pystac-client)
- Add CORS middleware, fix cp -r for theme/, update dependencies
- Update Phase Plan: uncheck all, add Phase 5 (Hardening)
- Update Collaboration Notes to past tense

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Tools output

Adds the full esm_catalog package (Phase 1 complete):

Scanning
- scan/netcdf.py: xarray-based NetCDF scanner (CF dims, bbox, datetime)
- scan/grib.py: eccodes structure scan + per-hypercube cfgrib opening;
  handles extension-less ECHAM output via magic-byte detection in detect.py;
  0-360° longitude normalisation for ECHAM Gaussian grids;
  .codes file enrichment for ECHAM paramId → shortName/units mapping;
  numeric (decode_times=False) timestamp handling for pre-1900 dates
- scan/detect.py: format dispatch by extension with magic-byte fallback
- scan/context.py: collection context resolution (ESM-Tools config or path)

STAC
- stac/item.py, stac/collection.py: STAC 1.0 Item and Collection builders
- stac/extensions/: datacube, contacts, hpc, cf extension support

Storage
- storage/duckdb.py: DuckDB-backed catalog; SET TimeZone='UTC' to avoid
  historical LMT offsets on TIMESTAMPTZ readback for pre-1900 data
- storage/export.py: Parquet and GeoJSON export for batch workflows

Integration & CLI
- integration/esm_tools.py: add_files() bridge; resolves symlinks and
  skips zero-byte files to prevent duplicate items from ESM-Tools output
- integration/config.py: finished_config.yaml loader
- cli.py: scan, scan-batch, merge-parquet, serve commands; directory
  walker deduplicates by resolved real path

Documentation & Tests
- CLI.md: command reference with examples
- ARCHITECTURE.md: Phase 1 marked complete; pytest + user docs required
  per phase
- tests/: 137 passing tests across hpc, scan, stac, storage, integration

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- integration/config.py: add find_finished_configs(), get_outdata_files(),
  and extract_stac_metadata() helpers for working with finished_config.yaml
  files (date-range-suffixed, without .yaml extension)
- scan/context.py: fix _find_component_for_path() to check
  experiment_outdata_dir (the key used in real finished_config files;
  outdata_dir is None in practice)
- tests/test_integration.py: 33 tests covering all new helpers and the
  context resolution bug fix (159 total tests passing)
- docs/esm_tools_integration.md: integration guide covering live tidy-phase
  usage, batch scan, add_files() API, finished_config.yaml keys, and all
  three config helper functions
- ARCHITECTURE.md: mark Phase 2 as complete

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The file_operations_tidy YAML (written by ESM-Tools during tidy) is now
the preferred source for catalog construction because it carries MD5
checksums computed at tidy time — no extra I/O needed.

- integration/config.py: add find_file_operations_log() and
  get_outdata_from_file_operations(); only the outdata category is
  returned (log, restart_out, unknown are excluded)
- integration/esm_tools.py: add checksums kwarg to add_files() that
  stores file:checksum in the STAC asset and adds the file extension URL;
  add add_run() implementing the priority chain:
  file_operations_tidy → finished_config outdata_targets
- tests/test_integration.py: 177 tests passing; new classes cover
  find_file_operations_log, get_outdata_from_file_operations,
  add_files checksum injection, and add_run priority chain

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
siligam and others added 30 commits March 11, 2026 16:43
The root cause of both issues was that stac-fastapi's BaseSearchPostRequest
drops unknown fields, so the 'filter'/'filter-lang' sent by STAC Browser
were silently discarded.

Changes:
- _parse_cql2_json(): parse CQL2-JSON expressions (=, !=, <, <=, >, >=,
  LIKE, AND) into filter_props dicts; OR/NOT silently ignored (AND-only DB)
- FilteredSearchPostRequest: subclass of BaseSearchPostRequest that captures
  the 'filter' and 'filter-lang' fields from the POST body
- post_search(): apply CQL2 filter from FilteredSearchPostRequest.filter
- all_collections(): parse ?filter=<cql2-json> query param so "Search for
  Collections → Additional filters" in STAC Browser also works
- create_app(): passes search_post_request_model=FilteredSearchPostRequest
- 6 new tests covering equality, no-match, AND, OGC rel, queryables schema,
  and CQL2 conformance classes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The summaries field on scanned collections is empty, so enum lists for
variable/experiment/component were never populated. Replace the summaries
lookup with direct DISTINCT queries on the items table using DuckDB's
json_extract_string(), giving STAC Browser proper dropdown pickers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Items stored in DuckDB have fragment-only collection links (#collection-id).
STAC Browser requires absolute self/root/collection links on items to render
item detail cards. Added _inject_item_links() and wire it into all item-
returning paths: item_collection, get_item, get_search, post_search, _run_search.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…arch results

STAC Browser shows no item count and disabled First/Previous/Next buttons
because the ItemCollection response lacked the numberMatched/numberReturned
top-level fields and had no pagination links.

- _make_item_collection: add numberMatched, numberReturned, and first/prev/next links
- _run_search: accept offset parameter, thread it to db.search_items and _make_item_collection
- post_search: extract offset from token field, preserve filter/collections in next link body
- get_search: extract token from **kwargs for GET-based pagination
- item_collection: pass offset/path so collection-level items are paginated correctly
- FilteredSearchPostRequest: add token field for pagination token (offset)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
STAC Browser sends datetime values as CQL2 literal objects
{"timestamp": "2026-03-11T18:45:11Z"} rather than plain strings.
_parse_cql2_json was passing the dict through unchanged, causing DuckDB
to fail with "Unimplemented type for cast (STRUCT -> TIMESTAMPTZ)".

Added _cql2_value() helper that extracts the inner string from
{"timestamp": "..."} and {"date": "..."} objects before passing
the value to search_items.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ions and items

Collections stored in DuckDB have a fragment-only parent link (e.g.
href: "#basic-001") which fails STAC IRI-reference format validation and
renders the STAC Browser "Up" button non-functional at collection level.
Items lacked a parent link entirely, also failing validation.

- _inject_collection_links: strip parent links and replace with {base_url}/
  so "Up" from collection navigates to the landing page
- _inject_item_links: add parent link pointing to {base_url}/collections/{cid}
  so "Up" from item navigates to the parent collection and validation passes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tension

ECHAM items were missing cube:dimensions because scan_grib() returned an
empty dimensions dict ("GRIB dims are implicit per-hypercube"). This caused
datacube extension validation to fail in STAC Browser.

Added _extract_dimensions_grib() which iterates over all open xr.Dataset
hypercubes before they are closed and extracts:
  - hybrid/level/lev/depth: spatial z-axis with level number extent
  - latitude/lat: spatial y-axis with coordinate extent
  - longitude/lon: spatial x-axis (normalised to -180/180)
  - values: reduced Gaussian grid spatial dimension (index extent)
  - time/valid_time: temporal with ISO extent
  - other dims: ordinal with coordinate or index extent

Tested on basic-001_185001.01_echam: produces hybrid(1-47), values(0-4159),
latitude(-88.6/+88.6), longitude(-178/+180) — covering all dimensions
referenced by cube:variables.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ons URL

The HPC extension schema URL (https://esm-tools.github.io/stac-hpc-extension/
v0.1.0/schema.json) has never been published, causing STAC Browser to report
"Schema not found" and mark items as Invalid.

- app.py: add GET /stac-extensions/hpc/v0.1.0/schema.json endpoint that
  serves a JSON Schema covering all hpc:* properties and asset fields
- client.py (_inject_item_links): rewrite the github URL in stac_extensions
  to the local endpoint at serve time, so STAC Browser can fetch and
  validate against it

Also adds file:// prefix to bare asset hrefs (local filesystem paths)
so they satisfy the iri-reference format required by STAC JSON schema.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce CollectionContextError(ValueError) so that the normal
outcome of scanning non-outdata paths (work/, restart/, input/, etc.)
is caught separately in the scan loop and logged at DEBUG level instead
of ERROR.  A subsequent ValueError (unexpected) is still logged at ERROR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ECHAM accumulation files (_accw, _co2) store all parameters under
paramId=0, causing cfgrib to collapse them into a single variable
named "unknown".  When this occurs and a .codes table is available,
expand the single entry into one variable per codes table parameter
(they all share the same grid and dimensions).

This gives item IDs like "runoff.echam.185001.xxx" instead of
"unknown.echam.185001.xxx" and populates cube:variables with the
complete variable list.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
stac-fastapi does not forward unknown query parameters through the
item_collection method signature, so ?token=N was silently ignored
and every page returned the same first N items.

Fix by reading token (and limit) directly from request.query_params
when the method-level token is None, matching the pattern already
used for filter extraction in all_collections.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 3 additions:
- CQL2-JSON filter for POST /search and GET /collections
- GET /queryables with live enum lists from catalog
- GET /stac-extensions/hpc/v0.1.0/schema.json (local schema + URL rewrite)
- POST /format stub for OGC format-negotiation probe
- Absolute link injection for collections and items (fixes STAC validation)
- Asset href file:// normalisation
- Pagination: numberMatched/numberReturned + first/prev/next links for POST /search
- Pagination: token/limit read from request.query_params in item_collection
- CQL2 temporal literal unwrapping (_cql2_value)

Phase 5: mark ECHAM GRIB support as complete:
- _extract_dimensions_grib() for cube:dimensions
- paramId=0 expansion via codes table for _accw/_co2 files
- CollectionContextError at DEBUG level for expected path-skip

Open questions: document ECHAM GRIB remaining gap (residual unknowns in
mixed hypercubes where some paramIds are not in standard eccodes tables).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GRIB files (e.g. ECHAM _echam/_accw/_co2) contain many variables in a
single file.  Previously only the primary variable was searchable; now
all variables are indexed.

Changes:
- stac/item.py: populate properties.variables JSON array with all
  variable names when a file has more than one variable
- storage/duckdb.py: handle variables field in search_items() via
  list_contains() on the JSON array; index each variable name
  separately in collection_item_props for collection-level search
- api/app.py: add variables queryable to /queryables, populated from
  all items' properties.variables arrays (487 distinct values vs 68
  for the primary variable)

Semantics:
  variable == 'rsdscs'   → items where rsdscs is the primary variable
  variables == 'rsdscs'  → items containing rsdscs (233 matches for
                           _echam GRIB files that bundle rsdscs with
                           st, svo, sd, lsp, q, ...)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
STAC Browser uses properties.title as the item card heading, falling
back to the item ID if no title is set.  Without a title, cross-
collection search results show only item IDs (e.g. "st.echam.185001")
with no indication of which collection each item belongs to.

Inject title = "{collection} · {variable}" (e.g. "basic-001-echam · st")
at serve time in _inject_item_links() — no catalog rescan required.
The item ID is still shown below the title in STAC Browser.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the title-based collection label (which made all items with
the same variable look identical) with a keywords injection.  STAC
Browser v3 renders properties.keywords as colored chip badges on item
cards — the item ID remains the primary heading and the collection
name appears as a small label alongside the Grib2/NetCDF badge.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
STAC Browser's SearchFilter.vue fetches queryables for the Collections
tab from a rel=queryables link embedded in the GET /collections response
body — not from the landing page queryables link used by the Items tab.
Without this link the CQL2 filter builder was silently absent even though
the API correctly declared the collection-search#filter conformance class.

Fix POST /format to accept a raw Request instead of dict|None so that
plain-text CQL2 bodies no longer trigger a 422/400 validation error;
the endpoint now always returns 200.

- api/client.py: inject OGC queryables link into GET /collections links[]
- api/app.py: cql2_format accepts Request, returns {} unconditionally
- tests/test_api.py: assert GET /collections contains queryables rel link
- ARCHITECTURE.md: document both fixes and the two-path queryables loading
  behaviour in STAC Browser

249 tests passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a rel="queryables" link pointing to /collections/{id}/queryables in
every collection object returned by _inject_collection_links.

Without this link, STAC Browser's SearchFilter.vue (type="Items") calls
getQueryablesLink() on the collection STAC object and gets null, so
loadQueryables() is never called, queryables stays empty, and
showAdditionalFilters remains false — the "Additional Filters" CQL2
section is invisible on the "Show Filters" panel for collection pages.

With the link present, STAC Browser loads queryables correctly and the
Additional Filters section appears in the Show Filters panel exactly
as it does in the Search for Items tab.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds GET /collections/{id}/queryables so STAC Browser can load the
CQL2 schema when browsing a specific collection's items and show the
"Additional Filters" section under "Show Filters".

The endpoint returns the same JSON Schema structure as the global
/queryables but with enum values scoped to items in the requested
collection only.  Returns 404 for unknown collection IDs.

The collection link object already injects a rel=queryables link
pointing to this new URL, so STAC Browser picks it up automatically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ting

Two additions to enable "Additional Filters" in the collection items view:

1. GET /collections/{id}/queryables
   New endpoint returning queryable properties scoped to a single
   collection. STAC Browser fetches this when opening the filter panel
   on a collection page; without it the CQL2 builder section was absent.
   Returns the same schema as GET /queryables but with enum values
   filtered to items belonging to that collection only.

2. CQL2-JSON filter support in GET /collections/{id}/items
   The item_collection() handler now reads ?filter and ?filter-lang
   query params and passes them through _parse_cql2_json(), the same
   path used by POST /search and GET /collections. This makes the
   "Submit" button in the collection filter panel actually apply the
   CQL2 expression to the item results.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
STAC Browser sends GET requests with ?filter-lang=cql2-text&filter=...
(e.g. variable = 'ssh') when the Additional Filters CQL2 builder is used
on a collection's items page. The API only handled cql2-json, so filters
were silently ignored and results were unfiltered.

Adds _parse_cql2_text() — a regex-based parser for simple equality and
comparison expressions — ported from the fesom_stac2 reference impl.
Adds _parse_cql2_filter() dispatcher that routes to the correct parser
based on filter-lang, defaulting to cql2-text when the header is absent.

Updates item_collection() and all_collections() to use the dispatcher so
both cql2-text and cql2-json are accepted in GET filter requests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The "Negate filter" checkbox in STAC Browser wraps the entire CQL2
expression in NOT(). Both parsers previously ignored NOT silently:
- cql2-text: NOT was lost in text serialization (bug in stac-browser)
- cql2-json: NOT node returned empty dict (no-op)

Fixes both parsers to invert all comparison operators when a NOT wrapper
is encountered:  = → !=,  < → >=,  <= → >,  > → <=,  >= → <

Also adds _CQL2_OP_INVERT at module level (shared by both parsers) and
handles nested NOT by toggling the negate flag recursively.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the user adds two filter rows for the same property (e.g. Variable =
ssh AND Variable = sst), the parsers previously overwrote the first value
with the second via dict.update(). The stored filter_props then had only
the last value, so the result showed sst items instead of 0 items.

Fix: both _parse_cql2_json and _parse_cql2_text now collect multiple
(op, val) tuples in a list when the same key appears in two AND branches.
search_items and _collection_matches iterate over the list and emit one
SQL condition per entry, so all constraints are ANDed correctly at the
DB level (variable = 'ssh' AND variable = 'sst' → 0 rows, as expected).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add support for OR-connected filter conditions in both CQL2-text and
CQL2-JSON formats. The 'Match any filters (or)' toggle in STAC Browser
now correctly returns items matching any of the selected values instead
of all items (no-op).

- _parse_cql2_json: handle 'or' op by collecting same-field equality
  values as plain lists ['v1', 'v2'] (distinct from AND tuple lists)
- _parse_cql2_text: detect top-level OR keyword and collect plain value
  lists from each OR branch
- search_items (duckdb): detect plain-value lists and emit IN clause
  vs AND conditions (tuple lists)
- _collection_matches: use any() for OR list matching

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add per-collection queryables endpoint to API endpoints table
- Expand CQL2 filter parsing section with filter_props dict format,
  all supported combinations (AND/OR/NOT), and how each is stored
  and queried (IN clause, operator inversion, tuple vs plain lists)
- Update STAC Browser section: replace "no fork needed" note with
  description of the fork changes (CqlNot.toText fix, Item.vue badge)
  and activation requirements for each "Additional Filters" feature
- Update Phase 3 checklist with all completed filter work:
  per-collection queryables, cql2-text parser, NOT/AND/OR support,
  collection badge injection, CLI tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Design proposal for a persistent STAC data portal where researchers
publish a catalog with a single command, no port management required:

- New section: Data Portal & Self-Registration covering architecture,
  hot-reload mechanism, registry.json format, complete lifecycle,
  and open design questions
- Phase 6 checklist: register/deregister CLI, registry-aware client,
  Apptainer portal image, department port conventions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Design note for Paul covering paleodatetime integration:
- Why TIMESTAMPTZ fails for geological timescales
- paleo_year integer property approach and naming rationale
- Three candidate sources at scan time (finished_config.yaml,
  CLI flag, NetCDF time coordinate auto-detection)
- CQL2 filter examples and STAC Browser queryables integration
- Change surface table — filter machinery requires no changes
- Open question on config file placement for Paul to answer

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… note

Documents the decision to not add upath as a dependency yet, with
rationale and zero-cost preparation steps:
- What UPath solves (transport layer) vs what hpc/ solves (accessibility)
- Three concrete use cases where it would help (asset hrefs, federation, scanning)
- Why tape/HSM and Lustre performance are out of scope for upath
- Interface guidelines: os.PathLike type hints, full URI storage, keep
  storage detection separate from path handling
- Trigger conditions for when to actually add the dependency

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ARCHITECTURE.md:
- New section documenting VirtualiZarr as a future enhancement
- Explains STAC (discovery) + VirtualiZarr (access) layering
- Integration points table with trigger conditions
- Current HPC limitations (Lustre/tape scope)

docs/virtualizarr_workflow.md (new):
- Workflow 1: in-memory virtual cube from STAC query results
- Workflow 2: persist manifest as Kerchunk JSON for fast re-opening
- Workflow 3: persist to Icechunk for versioned, append-friendly storage
- Workflow 4: multi-variable multi-collection query and grouping
- HPC tips: tape state checking, file:// prefix stripping, parallel
  manifest creation with joblib, NetCDF3/GRIB parser notes
- Future: Kerchunk manifest as a second STAC asset

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The per-department port-per-group model was a carry-over from the old
"one API process per DuckDB" thinking. Once the catalog list is dynamic
(registry.json read per request), a single STAC API process serves all
catalogs — splitting by department rebuilds the old model unnecessarily.

- Default: one API, one shared registry.json, one port
- Per-department processes documented as opt-in for access-control /
  admin-autonomy cases only
- Removed "cross-department search" open question (no longer needed)
- Phase 6 checklist updated to match

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Catalog Example Runs for AWI-ESM

3 participants