spiraldb · mprammer · May 29, 2026 · May 27, 2026 · May 27, 2026 · May 27, 2026
diff --git a/.agents/README.md b/.agents/README.md
@@ -8,7 +8,7 @@ Everything an AI coding agent needs to operate inside this repo. The `.claude
 |---|---|
 | [`settings.json`](settings.json) | **Tracked** allow-list of safe, read-only Bash / git / WebFetch / pipeline commands. A fresh-clone agent gets these pre-approved so it doesn't burn turns on permission prompts. Side-effecting stages (build / fetch / extract / convert / tighten_variant) are intentionally *not* pre-approved here. |
 | `settings.local.json` | **Gitignored** per-machine override — additional permissions specific to the local agent's session. Don't commit. |
-| [`skills/`](skills/) | 16 invokable skills following the [Agent Skills](https://agentskills.io) standard — wrappers around pipeline entrypoints (`/raincloud-build`, `/raincloud-fetch`, `/raincloud-status`, `/raincloud-validate-manifest`, `/raincloud-list-datasets`, …) and procedural playbooks (`/raincloud-add-dataset`, `/raincloud-add-handler`, `/raincloud-debug-build`, …). See [`skills/README.md`](skills/README.md). |
+| [`skills/`](skills/) | 21 invokable skills following the [Agent Skills](https://agentskills.io) standard — wrappers around pipeline entrypoints (`/raincloud-build`, `/raincloud-fetch`, `/raincloud-status`, `/raincloud-validate-manifest`, `/raincloud-list-datasets`, `/raincloud-load`, `/raincloud-publish`, …) and procedural playbooks (`/raincloud-add-dataset`, `/raincloud-add-handler`, `/raincloud-debug-build`, …). See [`skills/README.md`](skills/README.md). |
 | [`context/`](context/) | Symlinks back to the repo-root canonical docs (`AGENTS.md`, `SKILLS.md`, `README.md`, `sources.schema.md`) so each `SKILL.md` can pull authoritative guidance via a stable relative path without copying. |
 | `scheduled_tasks.lock` | Gitignored — agent-runtime state. |
 

diff --git a/.agents/skills/README.md b/.agents/skills/README.md
@@ -20,6 +20,10 @@ Wrappers around `python -m scripts.pipeline.<module>`. Side-effecting ones set `
 | `/raincloud-status` | `scripts.pipeline.status` | Per-slug filesystem state (raw / workdir / parquet / vortex / variant-pending). *(read-only, model-invocable.)* |
 | `/raincloud-validate-manifest` | `scripts.pipeline.validate_manifest` | Static checks for `sources.json` — JSON Schema + handler-registry / slug-uniqueness / fetch-auth cross-checks. *(read-only, model-invocable.)* |
 | `/raincloud-list-datasets` | `scripts.pipeline.list_datasets` | Filter/list slugs by handler / license / fetch-type / reader / vortex / tag / showcase / size / regex. *(read-only, model-invocable.)* |
+| `/raincloud-discover` | `scripts.pipeline.list_datasets` | Find "interesting" datasets via the discoverability flags — tag / showcase / size / trait / view. *(read-only, model-invocable.)* |
+| `/raincloud-profile` | `scripts.pipeline.profile` | Compute per-column statistics → `outputs/v1/<slug>/profile.json` (opt-in; feeds the TUI detail pane + `list_datasets --inspect`). *(writes `profile.json`; model-invocable.)* |
+| `/raincloud-load` | `raincloud.load` (loader API) | Load a prepared dataset (cache → mirror → local build) as a lazy `Dataset`; inspect metadata or materialize. *(`disable-model-invocation: true`.)* |
+| `/raincloud-publish` | `scripts.pipeline.publish` | Sync built `outputs/v1/...` artefacts to a mirror, gated on the snapshot sha256. *(side-effecting — `disable-model-invocation: true`.)* |
 
 ## Procedural playbooks (model-invocable)
 

diff --git a/.agents/skills/raincloud-add-dataset/SKILL.md b/.agents/skills/raincloud-add-dataset/SKILL.md
@@ -13,7 +13,7 @@ Steps:
    - The license — must permit redistribution-of-derivatives. Check SPDX ID and `source_url`.
    - Approximate row count (used for `expect.rows`; can be `null` on first build).
 
-2. **Append a `DatasetSpec` to `sources.json`** using the Python load-edit-dump pattern from [AGENTS.md](../../context/AGENTS.md#safe-ways-to-edit-sourcesjson) — never `sed`. Start from [`examples/minimal_spec.json`](../../../examples/minimal_spec.json) (every field present with placeholder values) rather than typing one from scratch. Minimal direct-HTTP shape:
+2. **Append a `DatasetSpec` to `sources.json`** using the Python load-edit-dump pattern from [AGENTS.md](../../context/AGENTS.md#safe-ways-to-edit-sourcesjson) — never `sed`. Start from [`templates/minimal_spec.json`](../../../templates/minimal_spec.json) (every field present with placeholder values) rather than typing one from scratch. Minimal direct-HTTP shape:
 
    ```jsonc
    {

diff --git a/.agents/skills/raincloud-add-handler/SKILL.md b/.agents/skills/raincloud-add-handler/SKILL.md
@@ -20,7 +20,7 @@ Steps:
 
    - `parsed` contains one `(path, table)` tuple per parsed file; `table` is `None` when `parse.reader = "custom"`.
    - Return `[(output_slug, table), ...]` — one tuple per output parquet. Multi-output handlers emit several slugs from one source (see `glove_split`, `osm_pbf_split`, `stack_exchange_split`).
-   - **Streaming handlers** (write direct to parquet, bypass the write stage) return `[]`. Copy [`examples/streaming_handler.py.tmpl`](../../../examples/streaming_handler.py.tmpl) as the starting point — it has the `outputs_root()` / `duckdb_connect()` / cleanup wiring already shaped — and study `factbook_variant_parse`, `wikipedia_variant_parse`, `lichess_pgn_parse` for upstream-shape variations.
+   - **Streaming handlers** (write direct to parquet, bypass the write stage) return `[]`. Copy [`templates/streaming_handler.py.tmpl`](../../../templates/streaming_handler.py.tmpl) as the starting point — it has the `outputs_root()` / `duckdb_connect()` / cleanup wiring already shaped — and study `factbook_variant_parse`, `wikipedia_variant_parse`, `lichess_pgn_parse` for upstream-shape variations.
 
 2. **Register** in `scripts/pipeline/handlers/__init__.py`:
 

diff --git a/.agents/skills/raincloud-load/SKILL.md b/.agents/skills/raincloud-load/SKILL.md
@@ -0,0 +1,73 @@
+---
+name: raincloud-load
+description: Load a Raincloud dataset via the lightweight Python loader API (`raincloud.load(slug)`). Use when the user wants to read a prepared parquet/vortex artifact, inspect a slug's metadata, or pull a dataset for downstream analysis from cache → mirror → local build.
+argument-hint: <slug> [--format vortex|parquet] [--materialize to_arrow|scan|to_pandas]
+disable-model-invocation: true
+allowed-tools: Bash(python -c *), Bash(python -m raincloud *), Bash(python examples/use_loader.py *)
+---
+
+The Raincloud loader is a separate, lightweight Python package (`raincloud`) for reading **already-prepared** artifacts. Resolution order is `local cache → mirror → local build`; nothing is fetched until you call an accessor.
+
+## Most common shape
+
+```bash
+python -c "
+import raincloud
+ds = raincloud.load('$ARGUMENTS')   # default format='vortex' with parquet fallback
+print('rows   :', ds.num_rows)
+print('cols   :', ds.column_names[:5])
+print('source :', ds.info.get('source_url'))
+print('path   :', ds.path())        # triggers cache/mirror/build resolution
+"
+```
+
+## Materialization
+
+| Accessor | Returns | Notes |
+|---|---|---|
+| `ds.path()` | `pathlib.Path` to the on-disk artifact | First call resolves; subsequent calls are cache hits. |
+| `ds.to_arrow()` | `pyarrow.Table` | Materializes the whole table; expensive on multi-GB slugs. |
+| `ds.to_vortex()` | `vortex.VortexFile` (lazy) | Vortex-native handle. |
+| `ds.scan()` | `duckdb.DuckDBPyRelation` | Requires `raincloud[duckdb]`. Always reads the parquet sibling — if the slug was loaded as vortex, you'll see a `[raincloud]` stderr note before the parquet is resolved. |
+| `ds.to_pandas()` | `pandas.DataFrame` | Requires `raincloud[pandas]`. |
+| `ds.schema` | `pyarrow.Schema` | Footer-only read for parquet; opens the file for vortex. |
+
+## Config via env vars
+
+| Env var | Effect |
+|---|---|
+| `RAINCLOUD_CACHE` | Local artifact cache dir (default `~/.cache/raincloud`). |
+| `RAINCLOUD_MIRROR` | Mirror URL (`s3://…`, `https://…`, `file://…`). Tried before the local build. |
+| `RAINCLOUD_OFFLINE=1` | Block mirror + build; raise `OfflineMiss` on cache miss. |
+| `RAINCLOUD_SNAPSHOT` | Override `docs/v1/snapshot.json` (catalog). |
+| `RAINCLOUD_MANIFEST` | Override `sources.json`. |
+
+## Drift semantics
+
+When a slug has a `sha256` pinned in the snapshot and the mirror or local build produces bytes that disagree, the loader prints a `[raincloud] WARN: <slug> from <mirror|build> sha256 drifted ...` to stderr and adopts the new bytes anyway. Upstream content drifts; that's not a panic case. Catch + escalate via `raincloud.ChecksumMismatch` only if you want a hard gate (e.g. `_cache.adopt(..., strict=True)` — used by `python -m scripts.pipeline.publish`).
+
+## Errors (all subclass `raincloud.RaincloudError`)
+
+`UnknownSlug`, `FormatUnavailable`, `ArtifactNotFound`, `OfflineMiss`, `BuildToolingMissing`, `MissingDependency`, `ChecksumMismatch`.
+
+## Worked example
+
+```bash
+python examples/use_loader.py --slug $ARGUMENTS              # metadata only
+python examples/use_loader.py --slug $ARGUMENTS --materialize  # full path
+```
+
+[`examples/use_loader.py`](../../../examples/use_loader.py) walks the full API end-to-end against the packaged catalog with no network.
+
+## Install tiers
+
+```bash
+pip install raincloud                # lightweight loader
+pip install 'raincloud[duckdb]'      # adds .scan()
+pip install 'raincloud[pandas]'      # adds .to_pandas()
+pip install 'raincloud[s3]'          # s3:// mirror transport
+pip install 'raincloud[http]'        # https:// mirror transport
+pip install 'raincloud[build]'       # adds local-build fallback (heavyweight)
+```
+
+Context: [AGENTS.md "The loader package"](../../../AGENTS.md), [`examples/use_loader.py`](../../../examples/use_loader.py).
diff --git a/.agents/skills/raincloud-publish/SKILL.md b/.agents/skills/raincloud-publish/SKILL.md
@@ -0,0 +1,56 @@
+---
+name: raincloud-publish
+description: Push locally-built Raincloud artifacts to a mirror (`scripts.pipeline.publish`). Use when the maintainer wants to upload one or more slugs' parquet/vortex bytes to a configured mirror after a successful build, gated on the snapshot's recorded sha256.
+argument-hint: <slug>... | --all  --mirror <s3://… | https://… | file://…>  [--dry-run]
+disable-model-invocation: true
+allowed-tools: Bash(python -m scripts.pipeline.publish *)
+---
+
+Sync locally-built artifacts to a mirror so downstream `raincloud.load()` users can fetch them. The CLI verifies each artifact's sha256 against `docs/v1/snapshot.json` before upload — mismatches block the push (publish is the one place an integrity gate is correct: corrupted bytes should never reach a shared mirror).
+
+## Most common shape
+
+```bash
+python -m scripts.pipeline.publish $ARGUMENTS
+```
+
+Selection (one required):
+- `<slug>...` — positional dataset slugs (any number).
+- `--all` — every slug whose parquet OR vortex artifact exists locally under `outputs/v{n}/<slug>/`.
+
+Required:
+- `--mirror <url>` — destination root. Examples:
+  - `s3://my-bucket/raincloud` (needs `pip install raincloud[s3]`)
+  - `https://artifacts.example.com/raincloud` (needs `raincloud[http]`)
+  - `file:///mnt/shared/raincloud-mirror` (built-in)
+- `RAINCLOUD_MIRROR=<url>` env var works as an alternative to the flag.
+
+Modifiers:
+- `--dry-run` — print the upload plan (paths + keys) without writing. Always preview large publishes this way first.
+
+## Before publishing
+
+1. **Build the slug locally first** (`/raincloud-build <slug>` or `python -m scripts.pipeline.build <slug>`). Publish does not build; it only pushes what's already in `outputs/v{n}/<slug>/`.
+2. **Regenerate the snapshot** (`/raincloud-docs` or `python -m scripts.pipeline.docs`). The publish gate compares local bytes against `parquet_sha256` / `vortex_sha256` in `docs/v1/snapshot.json` — a stale snapshot causes false `PublishMismatch` failures.
+3. **Verify the mirror URL** with a dry-run first.
+
+## What gets uploaded
+
+For each slug × format pair (`parquet`, `vortex`) where a local artifact exists:
+- Key: `v1/<slug>/<fmt>/<slug>.<ext>`
+- Body: the raw artifact bytes
+- Gate: `sha256(local) == snapshot[<fmt>_sha256]` (raises `PublishMismatch` on disagree; an unpinned slug — `sha256` is `null` — uploads without verification, so prefer to regen the snapshot first).
+
+## Failure modes
+
+| Error | Meaning | Fix |
+|---|---|---|
+| `PublishMismatch: ...` | Local bytes diverge from the snapshot's recorded sha. | Re-run `/raincloud-docs` to refresh the snapshot, OR confirm the local artifact is correct and commit the new snapshot. |
+| `FileNotFoundError: outputs/v1/<slug>/...` | Slug isn't built locally. | Run `/raincloud-build <slug>` first. |
+| `ImportError: Install s3fs ...` | `s3://` mirror without `[s3]` extra. | `pip install 'raincloud[s3]'`. |
+
+## After publishing
+
+Downstream users can now `raincloud.load(<slug>)` and the loader will pull from the configured mirror (cache → mirror → build).
+
+Context: [AGENTS.md "The loader package"](../../../AGENTS.md), [`scripts/pipeline/publish.py`](../../../scripts/pipeline/publish.py).
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -6,6 +6,12 @@ on:
   pull_request:
     branches: [develop]
 
+# Minimal token scope: every job is read-only (checkout + deps + lint/tests);
+# none posts comments, uploads artifacts, or writes to the repo. Without this,
+# the GITHUB_TOKEN defaults to broad write scope (flagged by CodeQL).
+permissions:
+  contents: read
+
 # Cancel in-progress runs when a new commit lands on the same branch.
 concurrency:
   group: ci-${{ github.ref }}
@@ -26,11 +32,14 @@ jobs:
         run: uv python install 3.11
 
       # `--extra dev` brings in pytest + ruff. `--extra tui` brings in
-      # textual so the test_browse suite runs instead of skipping. Skip
-      # kaggle/huggingface — those deps are only needed at fetch time and
-      # neither test path imports them.
+      # textual so the test_browse suite runs instead of skipping. `--extra
+      # build` brings in the heavy pipeline deps (duckdb, osmium, pyreadstat,
+      # zstandard, jsonschema, …) — required because validate_manifest +
+      # pytest collection import the handler registry, which transitively
+      # pulls those. Skip kaggle/huggingface — those deps are only needed at
+      # fetch time and neither test path imports them.
       - name: Install dependencies
-        run: uv sync --extra dev --extra tui
+        run: uv sync --extra dev --extra tui --extra build
 
       - name: Lint (ruff)
         run: uv run ruff check
@@ -40,3 +49,38 @@ jobs:
 
       - name: Test (pytest)
         run: uv run pytest -q
+
+  wheel:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          enable-cache: true
+      - name: Pin Python
+        run: uv python install 3.11
+      - name: Install dev deps (for pytest + vortex base dep used by fixtures)
+        run: uv sync --extra dev
+      # Scope collection to test_wheel.py: this env intentionally omits the
+      # [build] extra, but pytest imports every collected module before -m
+      # filtering, and test_manifest/test_profile import jsonschema (a [build]
+      # dep) at module top. All wheel-marked tests live in test_wheel.py.
+      - name: Run wheel tests
+        run: uv run pytest --run-wheel -m wheel -v tests/test_wheel.py
+
+  realbuild:
+    runs-on: ubuntu-latest
+    continue-on-error: true   # non-blocking: upstream flakiness never reds the build
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          enable-cache: true
+      - name: Pin Python
+        run: uv python install 3.11
+      - name: Install dev + build deps
+        run: uv sync --extra dev --extra build
+      - name: Run real-build network tests
+        run: uv run pytest --run-network -m network -v