Draft
Conversation
- Add browser/ directory with three-tier STAC Browser customisation (config overrides, net-new src additions, reference patches) - Document stacbrowser2 as a proper GitHub fork (esm-tools/stac-browser) with two-remote setup (origin=fork, upstream=radiantearth) - Add scan/context.py and fix the collection assignment design hole: resolve_context() must run before item creation; NULL collection silently breaks API navigation - Add integration/ module (esm_tools.py, config.py) for add_files() bridge and finished_config.yaml loading - Fix DuckDB schema: add catalogs table, collection_item_props table, experiment column + index, fix DOUBLE[] syntax - Fix SQL WHERE bug in federation example (wrap in CTE) - Add Collection Search section: two search modes (items/collections), conformsTo requirements, CQL2 filter parsing, item-derived property index, and differing response shapes - Document PythonCodeBox searchType prop and collection codegen mode (template_collections.py, requests.get vs pystac-client) - Add CORS middleware, fix cp -r for theme/, update dependencies - Update Phase Plan: uncheck all, add Phase 5 (Hardening) - Update Collaboration Notes to past tense Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Tools output Adds the full esm_catalog package (Phase 1 complete): Scanning - scan/netcdf.py: xarray-based NetCDF scanner (CF dims, bbox, datetime) - scan/grib.py: eccodes structure scan + per-hypercube cfgrib opening; handles extension-less ECHAM output via magic-byte detection in detect.py; 0-360° longitude normalisation for ECHAM Gaussian grids; .codes file enrichment for ECHAM paramId → shortName/units mapping; numeric (decode_times=False) timestamp handling for pre-1900 dates - scan/detect.py: format dispatch by extension with magic-byte fallback - scan/context.py: collection context resolution (ESM-Tools config or path) STAC - stac/item.py, stac/collection.py: STAC 1.0 Item and Collection builders - stac/extensions/: datacube, contacts, hpc, cf extension support Storage - storage/duckdb.py: DuckDB-backed catalog; SET TimeZone='UTC' to avoid historical LMT offsets on TIMESTAMPTZ readback for pre-1900 data - storage/export.py: Parquet and GeoJSON export for batch workflows Integration & CLI - integration/esm_tools.py: add_files() bridge; resolves symlinks and skips zero-byte files to prevent duplicate items from ESM-Tools output - integration/config.py: finished_config.yaml loader - cli.py: scan, scan-batch, merge-parquet, serve commands; directory walker deduplicates by resolved real path Documentation & Tests - CLI.md: command reference with examples - ARCHITECTURE.md: Phase 1 marked complete; pytest + user docs required per phase - tests/: 137 passing tests across hpc, scan, stac, storage, integration Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- integration/config.py: add find_finished_configs(), get_outdata_files(), and extract_stac_metadata() helpers for working with finished_config.yaml files (date-range-suffixed, without .yaml extension) - scan/context.py: fix _find_component_for_path() to check experiment_outdata_dir (the key used in real finished_config files; outdata_dir is None in practice) - tests/test_integration.py: 33 tests covering all new helpers and the context resolution bug fix (159 total tests passing) - docs/esm_tools_integration.md: integration guide covering live tidy-phase usage, batch scan, add_files() API, finished_config.yaml keys, and all three config helper functions - ARCHITECTURE.md: mark Phase 2 as complete Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The file_operations_tidy YAML (written by ESM-Tools during tidy) is now the preferred source for catalog construction because it carries MD5 checksums computed at tidy time — no extra I/O needed. - integration/config.py: add find_file_operations_log() and get_outdata_from_file_operations(); only the outdata category is returned (log, restart_out, unknown are excluded) - integration/esm_tools.py: add checksums kwarg to add_files() that stores file:checksum in the STAC asset and adds the file extension URL; add add_run() implementing the priority chain: file_operations_tidy → finished_config outdata_targets - tests/test_integration.py: 177 tests passing; new classes cover find_file_operations_log, get_outdata_from_file_operations, add_files checksum injection, and add_run priority chain Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The root cause of both issues was that stac-fastapi's BaseSearchPostRequest drops unknown fields, so the 'filter'/'filter-lang' sent by STAC Browser were silently discarded. Changes: - _parse_cql2_json(): parse CQL2-JSON expressions (=, !=, <, <=, >, >=, LIKE, AND) into filter_props dicts; OR/NOT silently ignored (AND-only DB) - FilteredSearchPostRequest: subclass of BaseSearchPostRequest that captures the 'filter' and 'filter-lang' fields from the POST body - post_search(): apply CQL2 filter from FilteredSearchPostRequest.filter - all_collections(): parse ?filter=<cql2-json> query param so "Search for Collections → Additional filters" in STAC Browser also works - create_app(): passes search_post_request_model=FilteredSearchPostRequest - 6 new tests covering equality, no-match, AND, OGC rel, queryables schema, and CQL2 conformance classes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The summaries field on scanned collections is empty, so enum lists for variable/experiment/component were never populated. Replace the summaries lookup with direct DISTINCT queries on the items table using DuckDB's json_extract_string(), giving STAC Browser proper dropdown pickers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Items stored in DuckDB have fragment-only collection links (#collection-id). STAC Browser requires absolute self/root/collection links on items to render item detail cards. Added _inject_item_links() and wire it into all item- returning paths: item_collection, get_item, get_search, post_search, _run_search. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…arch results STAC Browser shows no item count and disabled First/Previous/Next buttons because the ItemCollection response lacked the numberMatched/numberReturned top-level fields and had no pagination links. - _make_item_collection: add numberMatched, numberReturned, and first/prev/next links - _run_search: accept offset parameter, thread it to db.search_items and _make_item_collection - post_search: extract offset from token field, preserve filter/collections in next link body - get_search: extract token from **kwargs for GET-based pagination - item_collection: pass offset/path so collection-level items are paginated correctly - FilteredSearchPostRequest: add token field for pagination token (offset) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
STAC Browser sends datetime values as CQL2 literal objects
{"timestamp": "2026-03-11T18:45:11Z"} rather than plain strings.
_parse_cql2_json was passing the dict through unchanged, causing DuckDB
to fail with "Unimplemented type for cast (STRUCT -> TIMESTAMPTZ)".
Added _cql2_value() helper that extracts the inner string from
{"timestamp": "..."} and {"date": "..."} objects before passing
the value to search_items.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ions and items
Collections stored in DuckDB have a fragment-only parent link (e.g.
href: "#basic-001") which fails STAC IRI-reference format validation and
renders the STAC Browser "Up" button non-functional at collection level.
Items lacked a parent link entirely, also failing validation.
- _inject_collection_links: strip parent links and replace with {base_url}/
so "Up" from collection navigates to the landing page
- _inject_item_links: add parent link pointing to {base_url}/collections/{cid}
so "Up" from item navigates to the parent collection and validation passes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tension
ECHAM items were missing cube:dimensions because scan_grib() returned an
empty dimensions dict ("GRIB dims are implicit per-hypercube"). This caused
datacube extension validation to fail in STAC Browser.
Added _extract_dimensions_grib() which iterates over all open xr.Dataset
hypercubes before they are closed and extracts:
- hybrid/level/lev/depth: spatial z-axis with level number extent
- latitude/lat: spatial y-axis with coordinate extent
- longitude/lon: spatial x-axis (normalised to -180/180)
- values: reduced Gaussian grid spatial dimension (index extent)
- time/valid_time: temporal with ISO extent
- other dims: ordinal with coordinate or index extent
Tested on basic-001_185001.01_echam: produces hybrid(1-47), values(0-4159),
latitude(-88.6/+88.6), longitude(-178/+180) — covering all dimensions
referenced by cube:variables.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ons URL The HPC extension schema URL (https://esm-tools.github.io/stac-hpc-extension/ v0.1.0/schema.json) has never been published, causing STAC Browser to report "Schema not found" and mark items as Invalid. - app.py: add GET /stac-extensions/hpc/v0.1.0/schema.json endpoint that serves a JSON Schema covering all hpc:* properties and asset fields - client.py (_inject_item_links): rewrite the github URL in stac_extensions to the local endpoint at serve time, so STAC Browser can fetch and validate against it Also adds file:// prefix to bare asset hrefs (local filesystem paths) so they satisfy the iri-reference format required by STAC JSON schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce CollectionContextError(ValueError) so that the normal outcome of scanning non-outdata paths (work/, restart/, input/, etc.) is caught separately in the scan loop and logged at DEBUG level instead of ERROR. A subsequent ValueError (unexpected) is still logged at ERROR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ECHAM accumulation files (_accw, _co2) store all parameters under paramId=0, causing cfgrib to collapse them into a single variable named "unknown". When this occurs and a .codes table is available, expand the single entry into one variable per codes table parameter (they all share the same grid and dimensions). This gives item IDs like "runoff.echam.185001.xxx" instead of "unknown.echam.185001.xxx" and populates cube:variables with the complete variable list. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
stac-fastapi does not forward unknown query parameters through the item_collection method signature, so ?token=N was silently ignored and every page returned the same first N items. Fix by reading token (and limit) directly from request.query_params when the method-level token is None, matching the pattern already used for filter extraction in all_collections. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 3 additions: - CQL2-JSON filter for POST /search and GET /collections - GET /queryables with live enum lists from catalog - GET /stac-extensions/hpc/v0.1.0/schema.json (local schema + URL rewrite) - POST /format stub for OGC format-negotiation probe - Absolute link injection for collections and items (fixes STAC validation) - Asset href file:// normalisation - Pagination: numberMatched/numberReturned + first/prev/next links for POST /search - Pagination: token/limit read from request.query_params in item_collection - CQL2 temporal literal unwrapping (_cql2_value) Phase 5: mark ECHAM GRIB support as complete: - _extract_dimensions_grib() for cube:dimensions - paramId=0 expansion via codes table for _accw/_co2 files - CollectionContextError at DEBUG level for expected path-skip Open questions: document ECHAM GRIB remaining gap (residual unknowns in mixed hypercubes where some paramIds are not in standard eccodes tables). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GRIB files (e.g. ECHAM _echam/_accw/_co2) contain many variables in a
single file. Previously only the primary variable was searchable; now
all variables are indexed.
Changes:
- stac/item.py: populate properties.variables JSON array with all
variable names when a file has more than one variable
- storage/duckdb.py: handle variables field in search_items() via
list_contains() on the JSON array; index each variable name
separately in collection_item_props for collection-level search
- api/app.py: add variables queryable to /queryables, populated from
all items' properties.variables arrays (487 distinct values vs 68
for the primary variable)
Semantics:
variable == 'rsdscs' → items where rsdscs is the primary variable
variables == 'rsdscs' → items containing rsdscs (233 matches for
_echam GRIB files that bundle rsdscs with
st, svo, sd, lsp, q, ...)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
STAC Browser uses properties.title as the item card heading, falling
back to the item ID if no title is set. Without a title, cross-
collection search results show only item IDs (e.g. "st.echam.185001")
with no indication of which collection each item belongs to.
Inject title = "{collection} · {variable}" (e.g. "basic-001-echam · st")
at serve time in _inject_item_links() — no catalog rescan required.
The item ID is still shown below the title in STAC Browser.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the title-based collection label (which made all items with the same variable look identical) with a keywords injection. STAC Browser v3 renders properties.keywords as colored chip badges on item cards — the item ID remains the primary heading and the collection name appears as a small label alongside the Grib2/NetCDF badge. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
STAC Browser's SearchFilter.vue fetches queryables for the Collections
tab from a rel=queryables link embedded in the GET /collections response
body — not from the landing page queryables link used by the Items tab.
Without this link the CQL2 filter builder was silently absent even though
the API correctly declared the collection-search#filter conformance class.
Fix POST /format to accept a raw Request instead of dict|None so that
plain-text CQL2 bodies no longer trigger a 422/400 validation error;
the endpoint now always returns 200.
- api/client.py: inject OGC queryables link into GET /collections links[]
- api/app.py: cql2_format accepts Request, returns {} unconditionally
- tests/test_api.py: assert GET /collections contains queryables rel link
- ARCHITECTURE.md: document both fixes and the two-path queryables loading
behaviour in STAC Browser
249 tests passing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a rel="queryables" link pointing to /collections/{id}/queryables in
every collection object returned by _inject_collection_links.
Without this link, STAC Browser's SearchFilter.vue (type="Items") calls
getQueryablesLink() on the collection STAC object and gets null, so
loadQueryables() is never called, queryables stays empty, and
showAdditionalFilters remains false — the "Additional Filters" CQL2
section is invisible on the "Show Filters" panel for collection pages.
With the link present, STAC Browser loads queryables correctly and the
Additional Filters section appears in the Show Filters panel exactly
as it does in the Search for Items tab.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds GET /collections/{id}/queryables so STAC Browser can load the
CQL2 schema when browsing a specific collection's items and show the
"Additional Filters" section under "Show Filters".
The endpoint returns the same JSON Schema structure as the global
/queryables but with enum values scoped to items in the requested
collection only. Returns 404 for unknown collection IDs.
The collection link object already injects a rel=queryables link
pointing to this new URL, so STAC Browser picks it up automatically.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ting
Two additions to enable "Additional Filters" in the collection items view:
1. GET /collections/{id}/queryables
New endpoint returning queryable properties scoped to a single
collection. STAC Browser fetches this when opening the filter panel
on a collection page; without it the CQL2 builder section was absent.
Returns the same schema as GET /queryables but with enum values
filtered to items belonging to that collection only.
2. CQL2-JSON filter support in GET /collections/{id}/items
The item_collection() handler now reads ?filter and ?filter-lang
query params and passes them through _parse_cql2_json(), the same
path used by POST /search and GET /collections. This makes the
"Submit" button in the collection filter panel actually apply the
CQL2 expression to the item results.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
STAC Browser sends GET requests with ?filter-lang=cql2-text&filter=... (e.g. variable = 'ssh') when the Additional Filters CQL2 builder is used on a collection's items page. The API only handled cql2-json, so filters were silently ignored and results were unfiltered. Adds _parse_cql2_text() — a regex-based parser for simple equality and comparison expressions — ported from the fesom_stac2 reference impl. Adds _parse_cql2_filter() dispatcher that routes to the correct parser based on filter-lang, defaulting to cql2-text when the header is absent. Updates item_collection() and all_collections() to use the dispatcher so both cql2-text and cql2-json are accepted in GET filter requests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The "Negate filter" checkbox in STAC Browser wraps the entire CQL2 expression in NOT(). Both parsers previously ignored NOT silently: - cql2-text: NOT was lost in text serialization (bug in stac-browser) - cql2-json: NOT node returned empty dict (no-op) Fixes both parsers to invert all comparison operators when a NOT wrapper is encountered: = → !=, < → >=, <= → >, > → <=, >= → < Also adds _CQL2_OP_INVERT at module level (shared by both parsers) and handles nested NOT by toggling the negate flag recursively. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the user adds two filter rows for the same property (e.g. Variable = ssh AND Variable = sst), the parsers previously overwrote the first value with the second via dict.update(). The stored filter_props then had only the last value, so the result showed sst items instead of 0 items. Fix: both _parse_cql2_json and _parse_cql2_text now collect multiple (op, val) tuples in a list when the same key appears in two AND branches. search_items and _collection_matches iterate over the list and emit one SQL condition per entry, so all constraints are ANDed correctly at the DB level (variable = 'ssh' AND variable = 'sst' → 0 rows, as expected). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add support for OR-connected filter conditions in both CQL2-text and CQL2-JSON formats. The 'Match any filters (or)' toggle in STAC Browser now correctly returns items matching any of the selected values instead of all items (no-op). - _parse_cql2_json: handle 'or' op by collecting same-field equality values as plain lists ['v1', 'v2'] (distinct from AND tuple lists) - _parse_cql2_text: detect top-level OR keyword and collect plain value lists from each OR branch - search_items (duckdb): detect plain-value lists and emit IN clause vs AND conditions (tuple lists) - _collection_matches: use any() for OR list matching Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add per-collection queryables endpoint to API endpoints table - Expand CQL2 filter parsing section with filter_props dict format, all supported combinations (AND/OR/NOT), and how each is stored and queried (IN clause, operator inversion, tuple vs plain lists) - Update STAC Browser section: replace "no fork needed" note with description of the fork changes (CqlNot.toText fix, Item.vue badge) and activation requirements for each "Additional Filters" feature - Update Phase 3 checklist with all completed filter work: per-collection queryables, cql2-text parser, NOT/AND/OR support, collection badge injection, CLI tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Design proposal for a persistent STAC data portal where researchers publish a catalog with a single command, no port management required: - New section: Data Portal & Self-Registration covering architecture, hot-reload mechanism, registry.json format, complete lifecycle, and open design questions - Phase 6 checklist: register/deregister CLI, registry-aware client, Apptainer portal image, department port conventions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Design note for Paul covering paleodatetime integration: - Why TIMESTAMPTZ fails for geological timescales - paleo_year integer property approach and naming rationale - Three candidate sources at scan time (finished_config.yaml, CLI flag, NetCDF time coordinate auto-detection) - CQL2 filter examples and STAC Browser queryables integration - Change surface table — filter machinery requires no changes - Open question on config file placement for Paul to answer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… note Documents the decision to not add upath as a dependency yet, with rationale and zero-cost preparation steps: - What UPath solves (transport layer) vs what hpc/ solves (accessibility) - Three concrete use cases where it would help (asset hrefs, federation, scanning) - Why tape/HSM and Lustre performance are out of scope for upath - Interface guidelines: os.PathLike type hints, full URI storage, keep storage detection separate from path handling - Trigger conditions for when to actually add the dependency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ARCHITECTURE.md: - New section documenting VirtualiZarr as a future enhancement - Explains STAC (discovery) + VirtualiZarr (access) layering - Integration points table with trigger conditions - Current HPC limitations (Lustre/tape scope) docs/virtualizarr_workflow.md (new): - Workflow 1: in-memory virtual cube from STAC query results - Workflow 2: persist manifest as Kerchunk JSON for fast re-opening - Workflow 3: persist to Icechunk for versioned, append-friendly storage - Workflow 4: multi-variable multi-collection query and grouping - HPC tips: tape state checking, file:// prefix stripping, parallel manifest creation with joblib, NetCDF3/GRIB parser notes - Future: Kerchunk manifest as a second STAC asset Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The per-department port-per-group model was a carry-over from the old "one API process per DuckDB" thinking. Once the catalog list is dynamic (registry.json read per request), a single STAC API process serves all catalogs — splitting by department rebuilds the old model unnecessarily. - Default: one API, one shared registry.json, one port - Per-department processes documented as opt-in for access-control / admin-autonomy cases only - Removed "cross-department search" open question (no longer needed) - Phase 6 checklist updated to match Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Prototype of the Simulation Catalog work. This is meant to be only exploratory at the moment.
Closes #1416