SimCat Prototype by pgierz · Pull Request #1417 · esm-tools/esm_tools

pgierz · 2026-01-13T09:02:02Z

Prototype of the Simulation Catalog work. This is meant to be only exploratory at the moment.

- Add browser/ directory with three-tier STAC Browser customisation (config overrides, net-new src additions, reference patches) - Document stacbrowser2 as a proper GitHub fork (esm-tools/stac-browser) with two-remote setup (origin=fork, upstream=radiantearth) - Add scan/context.py and fix the collection assignment design hole: resolve_context() must run before item creation; NULL collection silently breaks API navigation - Add integration/ module (esm_tools.py, config.py) for add_files() bridge and finished_config.yaml loading - Fix DuckDB schema: add catalogs table, collection_item_props table, experiment column + index, fix DOUBLE[] syntax - Fix SQL WHERE bug in federation example (wrap in CTE) - Add Collection Search section: two search modes (items/collections), conformsTo requirements, CQL2 filter parsing, item-derived property index, and differing response shapes - Document PythonCodeBox searchType prop and collection codegen mode (template_collections.py, requests.get vs pystac-client) - Add CORS middleware, fix cp -r for theme/, update dependencies - Update Phase Plan: uncheck all, add Phase 5 (Hardening) - Update Collaboration Notes to past tense Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…Tools output Adds the full esm_catalog package (Phase 1 complete): Scanning - scan/netcdf.py: xarray-based NetCDF scanner (CF dims, bbox, datetime) - scan/grib.py: eccodes structure scan + per-hypercube cfgrib opening; handles extension-less ECHAM output via magic-byte detection in detect.py; 0-360° longitude normalisation for ECHAM Gaussian grids; .codes file enrichment for ECHAM paramId → shortName/units mapping; numeric (decode_times=False) timestamp handling for pre-1900 dates - scan/detect.py: format dispatch by extension with magic-byte fallback - scan/context.py: collection context resolution (ESM-Tools config or path) STAC - stac/item.py, stac/collection.py: STAC 1.0 Item and Collection builders - stac/extensions/: datacube, contacts, hpc, cf extension support Storage - storage/duckdb.py: DuckDB-backed catalog; SET TimeZone='UTC' to avoid historical LMT offsets on TIMESTAMPTZ readback for pre-1900 data - storage/export.py: Parquet and GeoJSON export for batch workflows Integration & CLI - integration/esm_tools.py: add_files() bridge; resolves symlinks and skips zero-byte files to prevent duplicate items from ESM-Tools output - integration/config.py: finished_config.yaml loader - cli.py: scan, scan-batch, merge-parquet, serve commands; directory walker deduplicates by resolved real path Documentation & Tests - CLI.md: command reference with examples - ARCHITECTURE.md: Phase 1 marked complete; pytest + user docs required per phase - tests/: 137 passing tests across hpc, scan, stac, storage, integration Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- integration/config.py: add find_finished_configs(), get_outdata_files(), and extract_stac_metadata() helpers for working with finished_config.yaml files (date-range-suffixed, without .yaml extension) - scan/context.py: fix _find_component_for_path() to check experiment_outdata_dir (the key used in real finished_config files; outdata_dir is None in practice) - tests/test_integration.py: 33 tests covering all new helpers and the context resolution bug fix (159 total tests passing) - docs/esm_tools_integration.md: integration guide covering live tidy-phase usage, batch scan, add_files() API, finished_config.yaml keys, and all three config helper functions - ARCHITECTURE.md: mark Phase 2 as complete Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The file_operations_tidy YAML (written by ESM-Tools during tidy) is now the preferred source for catalog construction because it carries MD5 checksums computed at tidy time — no extra I/O needed. - integration/config.py: add find_file_operations_log() and get_outdata_from_file_operations(); only the outdata category is returned (log, restart_out, unknown are excluded) - integration/esm_tools.py: add checksums kwarg to add_files() that stores file:checksum in the STAC asset and adds the file extension URL; add add_run() implementing the priority chain: file_operations_tidy → finished_config outdata_targets - tests/test_integration.py: 177 tests passing; new classes cover find_file_operations_log, get_outdata_from_file_operations, add_files checksum injection, and add_run priority chain Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The root cause of both issues was that stac-fastapi's BaseSearchPostRequest drops unknown fields, so the 'filter'/'filter-lang' sent by STAC Browser were silently discarded. Changes: - _parse_cql2_json(): parse CQL2-JSON expressions (=, !=, <, <=, >, >=, LIKE, AND) into filter_props dicts; OR/NOT silently ignored (AND-only DB) - FilteredSearchPostRequest: subclass of BaseSearchPostRequest that captures the 'filter' and 'filter-lang' fields from the POST body - post_search(): apply CQL2 filter from FilteredSearchPostRequest.filter - all_collections(): parse ?filter=<cql2-json> query param so "Search for Collections → Additional filters" in STAC Browser also works - create_app(): passes search_post_request_model=FilteredSearchPostRequest - 6 new tests covering equality, no-match, AND, OGC rel, queryables schema, and CQL2 conformance classes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The summaries field on scanned collections is empty, so enum lists for variable/experiment/component were never populated. Replace the summaries lookup with direct DISTINCT queries on the items table using DuckDB's json_extract_string(), giving STAC Browser proper dropdown pickers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Items stored in DuckDB have fragment-only collection links (#collection-id). STAC Browser requires absolute self/root/collection links on items to render item detail cards. Added _inject_item_links() and wire it into all item- returning paths: item_collection, get_item, get_search, post_search, _run_search. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…arch results STAC Browser shows no item count and disabled First/Previous/Next buttons because the ItemCollection response lacked the numberMatched/numberReturned top-level fields and had no pagination links. - _make_item_collection: add numberMatched, numberReturned, and first/prev/next links - _run_search: accept offset parameter, thread it to db.search_items and _make_item_collection - post_search: extract offset from token field, preserve filter/collections in next link body - get_search: extract token from **kwargs for GET-based pagination - item_collection: pass offset/path so collection-level items are paginated correctly - FilteredSearchPostRequest: add token field for pagination token (offset) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

STAC Browser sends datetime values as CQL2 literal objects {"timestamp": "2026-03-11T18:45:11Z"} rather than plain strings. _parse_cql2_json was passing the dict through unchanged, causing DuckDB to fail with "Unimplemented type for cast (STRUCT -> TIMESTAMPTZ)". Added _cql2_value() helper that extracts the inner string from {"timestamp": "..."} and {"date": "..."} objects before passing the value to search_items. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ions and items Collections stored in DuckDB have a fragment-only parent link (e.g. href: "#basic-001") which fails STAC IRI-reference format validation and renders the STAC Browser "Up" button non-functional at collection level. Items lacked a parent link entirely, also failing validation. - _inject_collection_links: strip parent links and replace with {base_url}/ so "Up" from collection navigates to the landing page - _inject_item_links: add parent link pointing to {base_url}/collections/{cid} so "Up" from item navigates to the parent collection and validation passes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tension ECHAM items were missing cube:dimensions because scan_grib() returned an empty dimensions dict ("GRIB dims are implicit per-hypercube"). This caused datacube extension validation to fail in STAC Browser. Added _extract_dimensions_grib() which iterates over all open xr.Dataset hypercubes before they are closed and extracts: - hybrid/level/lev/depth: spatial z-axis with level number extent - latitude/lat: spatial y-axis with coordinate extent - longitude/lon: spatial x-axis (normalised to -180/180) - values: reduced Gaussian grid spatial dimension (index extent) - time/valid_time: temporal with ISO extent - other dims: ordinal with coordinate or index extent Tested on basic-001_185001.01_echam: produces hybrid(1-47), values(0-4159), latitude(-88.6/+88.6), longitude(-178/+180) — covering all dimensions referenced by cube:variables. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ons URL The HPC extension schema URL (https://esm-tools.github.io/stac-hpc-extension/ v0.1.0/schema.json) has never been published, causing STAC Browser to report "Schema not found" and mark items as Invalid. - app.py: add GET /stac-extensions/hpc/v0.1.0/schema.json endpoint that serves a JSON Schema covering all hpc:* properties and asset fields - client.py (_inject_item_links): rewrite the github URL in stac_extensions to the local endpoint at serve time, so STAC Browser can fetch and validate against it Also adds file:// prefix to bare asset hrefs (local filesystem paths) so they satisfy the iri-reference format required by STAC JSON schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Introduce CollectionContextError(ValueError) so that the normal outcome of scanning non-outdata paths (work/, restart/, input/, etc.) is caught separately in the scan loop and logged at DEBUG level instead of ERROR. A subsequent ValueError (unexpected) is still logged at ERROR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ECHAM accumulation files (_accw, _co2) store all parameters under paramId=0, causing cfgrib to collapse them into a single variable named "unknown". When this occurs and a .codes table is available, expand the single entry into one variable per codes table parameter (they all share the same grid and dimensions). This gives item IDs like "runoff.echam.185001.xxx" instead of "unknown.echam.185001.xxx" and populates cube:variables with the complete variable list. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

stac-fastapi does not forward unknown query parameters through the item_collection method signature, so ?token=N was silently ignored and every page returned the same first N items. Fix by reading token (and limit) directly from request.query_params when the method-level token is None, matching the pattern already used for filter extraction in all_collections. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Phase 3 additions: - CQL2-JSON filter for POST /search and GET /collections - GET /queryables with live enum lists from catalog - GET /stac-extensions/hpc/v0.1.0/schema.json (local schema + URL rewrite) - POST /format stub for OGC format-negotiation probe - Absolute link injection for collections and items (fixes STAC validation) - Asset href file:// normalisation - Pagination: numberMatched/numberReturned + first/prev/next links for POST /search - Pagination: token/limit read from request.query_params in item_collection - CQL2 temporal literal unwrapping (_cql2_value) Phase 5: mark ECHAM GRIB support as complete: - _extract_dimensions_grib() for cube:dimensions - paramId=0 expansion via codes table for _accw/_co2 files - CollectionContextError at DEBUG level for expected path-skip Open questions: document ECHAM GRIB remaining gap (residual unknowns in mixed hypercubes where some paramIds are not in standard eccodes tables). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GRIB files (e.g. ECHAM _echam/_accw/_co2) contain many variables in a single file. Previously only the primary variable was searchable; now all variables are indexed. Changes: - stac/item.py: populate properties.variables JSON array with all variable names when a file has more than one variable - storage/duckdb.py: handle variables field in search_items() via list_contains() on the JSON array; index each variable name separately in collection_item_props for collection-level search - api/app.py: add variables queryable to /queryables, populated from all items' properties.variables arrays (487 distinct values vs 68 for the primary variable) Semantics: variable == 'rsdscs' → items where rsdscs is the primary variable variables == 'rsdscs' → items containing rsdscs (233 matches for _echam GRIB files that bundle rsdscs with st, svo, sd, lsp, q, ...) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

STAC Browser uses properties.title as the item card heading, falling back to the item ID if no title is set. Without a title, cross- collection search results show only item IDs (e.g. "st.echam.185001") with no indication of which collection each item belongs to. Inject title = "{collection} · {variable}" (e.g. "basic-001-echam · st") at serve time in _inject_item_links() — no catalog rescan required. The item ID is still shown below the title in STAC Browser. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the title-based collection label (which made all items with the same variable look identical) with a keywords injection. STAC Browser v3 renders properties.keywords as colored chip badges on item cards — the item ID remains the primary heading and the collection name appears as a small label alongside the Grib2/NetCDF badge. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

STAC Browser's SearchFilter.vue fetches queryables for the Collections tab from a rel=queryables link embedded in the GET /collections response body — not from the landing page queryables link used by the Items tab. Without this link the CQL2 filter builder was silently absent even though the API correctly declared the collection-search#filter conformance class. Fix POST /format to accept a raw Request instead of dict|None so that plain-text CQL2 bodies no longer trigger a 422/400 validation error; the endpoint now always returns 200. - api/client.py: inject OGC queryables link into GET /collections links[] - api/app.py: cql2_format accepts Request, returns {} unconditionally - tests/test_api.py: assert GET /collections contains queryables rel link - ARCHITECTURE.md: document both fixes and the two-path queryables loading behaviour in STAC Browser 249 tests passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add a rel="queryables" link pointing to /collections/{id}/queryables in every collection object returned by _inject_collection_links. Without this link, STAC Browser's SearchFilter.vue (type="Items") calls getQueryablesLink() on the collection STAC object and gets null, so loadQueryables() is never called, queryables stays empty, and showAdditionalFilters remains false — the "Additional Filters" CQL2 section is invisible on the "Show Filters" panel for collection pages. With the link present, STAC Browser loads queryables correctly and the Additional Filters section appears in the Show Filters panel exactly as it does in the Search for Items tab. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds GET /collections/{id}/queryables so STAC Browser can load the CQL2 schema when browsing a specific collection's items and show the "Additional Filters" section under "Show Filters". The endpoint returns the same JSON Schema structure as the global /queryables but with enum values scoped to items in the requested collection only. Returns 404 for unknown collection IDs. The collection link object already injects a rel=queryables link pointing to this new URL, so STAC Browser picks it up automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ting Two additions to enable "Additional Filters" in the collection items view: 1. GET /collections/{id}/queryables New endpoint returning queryable properties scoped to a single collection. STAC Browser fetches this when opening the filter panel on a collection page; without it the CQL2 builder section was absent. Returns the same schema as GET /queryables but with enum values filtered to items belonging to that collection only. 2. CQL2-JSON filter support in GET /collections/{id}/items The item_collection() handler now reads ?filter and ?filter-lang query params and passes them through _parse_cql2_json(), the same path used by POST /search and GET /collections. This makes the "Submit" button in the collection filter panel actually apply the CQL2 expression to the item results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

STAC Browser sends GET requests with ?filter-lang=cql2-text&filter=... (e.g. variable = 'ssh') when the Additional Filters CQL2 builder is used on a collection's items page. The API only handled cql2-json, so filters were silently ignored and results were unfiltered. Adds _parse_cql2_text() — a regex-based parser for simple equality and comparison expressions — ported from the fesom_stac2 reference impl. Adds _parse_cql2_filter() dispatcher that routes to the correct parser based on filter-lang, defaulting to cql2-text when the header is absent. Updates item_collection() and all_collections() to use the dispatcher so both cql2-text and cql2-json are accepted in GET filter requests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The "Negate filter" checkbox in STAC Browser wraps the entire CQL2 expression in NOT(). Both parsers previously ignored NOT silently: - cql2-text: NOT was lost in text serialization (bug in stac-browser) - cql2-json: NOT node returned empty dict (no-op) Fixes both parsers to invert all comparison operators when a NOT wrapper is encountered: = → !=, < → >=, <= → >, > → <=, >= → < Also adds _CQL2_OP_INVERT at module level (shared by both parsers) and handles nested NOT by toggling the negate flag recursively. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When the user adds two filter rows for the same property (e.g. Variable = ssh AND Variable = sst), the parsers previously overwrote the first value with the second via dict.update(). The stored filter_props then had only the last value, so the result showed sst items instead of 0 items. Fix: both _parse_cql2_json and _parse_cql2_text now collect multiple (op, val) tuples in a list when the same key appears in two AND branches. search_items and _collection_matches iterate over the list and emit one SQL condition per entry, so all constraints are ANDed correctly at the DB level (variable = 'ssh' AND variable = 'sst' → 0 rows, as expected). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add support for OR-connected filter conditions in both CQL2-text and CQL2-JSON formats. The 'Match any filters (or)' toggle in STAC Browser now correctly returns items matching any of the selected values instead of all items (no-op). - _parse_cql2_json: handle 'or' op by collecting same-field equality values as plain lists ['v1', 'v2'] (distinct from AND tuple lists) - _parse_cql2_text: detect top-level OR keyword and collect plain value lists from each OR branch - search_items (duckdb): detect plain-value lists and emit IN clause vs AND conditions (tuple lists) - _collection_matches: use any() for OR list matching Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add per-collection queryables endpoint to API endpoints table - Expand CQL2 filter parsing section with filter_props dict format, all supported combinations (AND/OR/NOT), and how each is stored and queried (IN clause, operator inversion, tuple vs plain lists) - Update STAC Browser section: replace "no fork needed" note with description of the fork changes (CqlNot.toText fix, Item.vue badge) and activation requirements for each "Additional Filters" feature - Update Phase 3 checklist with all completed filter work: per-collection queryables, cql2-text parser, NOT/AND/OR support, collection badge injection, CLI tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Design proposal for a persistent STAC data portal where researchers publish a catalog with a single command, no port management required: - New section: Data Portal & Self-Registration covering architecture, hot-reload mechanism, registry.json format, complete lifecycle, and open design questions - Phase 6 checklist: register/deregister CLI, registry-aware client, Apptainer portal image, department port conventions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Design note for Paul covering paleodatetime integration: - Why TIMESTAMPTZ fails for geological timescales - paleo_year integer property approach and naming rationale - Three candidate sources at scan time (finished_config.yaml, CLI flag, NetCDF time coordinate auto-detection) - CQL2 filter examples and STAC Browser queryables integration - Change surface table — filter machinery requires no changes - Open question on config file placement for Paul to answer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… note Documents the decision to not add upath as a dependency yet, with rationale and zero-cost preparation steps: - What UPath solves (transport layer) vs what hpc/ solves (accessibility) - Three concrete use cases where it would help (asset hrefs, federation, scanning) - Why tape/HSM and Lustre performance are out of scope for upath - Interface guidelines: os.PathLike type hints, full URI storage, keep storage detection separate from path handling - Trigger conditions for when to actually add the dependency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ARCHITECTURE.md: - New section documenting VirtualiZarr as a future enhancement - Explains STAC (discovery) + VirtualiZarr (access) layering - Integration points table with trigger conditions - Current HPC limitations (Lustre/tape scope) docs/virtualizarr_workflow.md (new): - Workflow 1: in-memory virtual cube from STAC query results - Workflow 2: persist manifest as Kerchunk JSON for fast re-opening - Workflow 3: persist to Icechunk for versioned, append-friendly storage - Workflow 4: multi-variable multi-collection query and grouping - HPC tips: tape state checking, file:// prefix stripping, parallel manifest creation with joblib, NetCDF3/GRIB parser notes - Future: Kerchunk manifest as a second STAC asset Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The per-department port-per-group model was a carry-over from the old "one API process per DuckDB" thinking. Once the catalog list is dynamic (registry.json read per request), a single STAC API process serves all catalogs — splitting by department rebuilds the old model unnecessarily. - Default: one API, one shared registry.json, one port - Per-department processes documented as opt-in for access-control / admin-autonomy cases only - Removed "cross-department search" open question (no longer needed) - Phase 6 checklist updated to match Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wip: first example notebook, just define some paths

ec6301c

pgierz linked an issue Jan 13, 2026 that may be closed by this pull request

Catalog Example Runs for AWI-ESM #1416

Open

pgierz added 2 commits January 13, 2026 11:12

basic plot from hard-coded paths

852e607

scatter instead of tripcolor

41d0bf4

pgierz added ESM-Tools+ WP SimCat labels Jan 13, 2026

pgierz requested a review from mandresm January 13, 2026 12:25

pgierz assigned siligam Jan 13, 2026

prototype: stac prototypes

69a4ea9

mandresm added this to the ESM-Tools+ Metadata Collection milestone Jan 19, 2026

siligam and others added 20 commits January 19, 2026 11:08

add comment

df8ca51

wip

8c2fde2

...

5bd1af6

some more

42c7dd6

...

43401e1

...

8e0d503

with echam metadata

86d0143

...

5168b82

...

15fd48f

notebook is getting closer

0ce5a24

...

b135eac

allows for appendig

9952448

with joblib

fcc3fed

...

cbcc94c

OK, sometimes you still need to code on your own

bb6b090

wip: some things for pavan

d76d766

siligam and others added 30 commits March 11, 2026 16:43

test(cli): add 31 CLI tests covering all four commands

291253c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimCat Prototype#1417

SimCat Prototype#1417
pgierz wants to merge 62 commits intoesm-tools-plus/simcat/mainfrom
esm-tools-plus/simcat/prototype

pgierz commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pgierz commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants