Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
418a055
feat(loader): package skeleton, exceptions, lightweight packaging
mprammer May 27, 2026
3a7d57e
feat(loader): catalog module reading snapshot + manifest
mprammer May 27, 2026
aee0652
feat(loader): checksum-gated local cache with atomic adopt
mprammer May 27, 2026
1a6c0be
feat(loader): fsspec transport-only fetch
mprammer May 27, 2026
04abeed
feat(loader): cache->mirror->build resolver
mprammer May 27, 2026
f9a0189
feat(loader): lazy Dataset handle and load()/load_dataset
mprammer May 27, 2026
10ea912
feat(docs): carry parquet/vortex sha256 in snapshot for the loader
mprammer May 27, 2026
8b1f255
chore(loader): add SPDX headers to new loader files
mprammer May 27, 2026
11225fb
feat(publish): mirror-sync CLI gated on snapshot sha256
mprammer May 27, 2026
a8175a0
test(loader): hermetic file:// mirror e2e; document loader usage
mprammer May 27, 2026
3588686
docs(loader): document loader + publish, layered install, 0.2.0 chang…
mprammer May 27, 2026
12a6c66
fix(loader): clean BuildToolingMissing on loader-only install; doc + …
mprammer May 27, 2026
5683929
docs(loader): install from GitHub, not PyPI (name squatted)
mprammer May 27, 2026
48beac7
feat(pipeline): configurable data_root() for wheel builds
mprammer May 27, 2026
734f91f
feat(pipeline): root fetch/extract/build/status on data_root() helpers
mprammer May 27, 2026
ff3f68b
fix(pipeline): use display_path for logs so wheel builds don't ValueE…
mprammer May 28, 2026
4f8b06b
fix(handlers): use display_path for output logs (wheel-build safe)
mprammer May 28, 2026
11ef84e
fix(handlers): root scratch workdir on workdir_root() (wheel-safe)
mprammer May 28, 2026
86da060
feat(pipeline): ship + resolve sources.schema.json for wheel installs
mprammer May 28, 2026
5d23066
docs(pipeline): document data-area env vars; note wheel build-fallback
mprammer May 28, 2026
cded8a8
fix(loader): expose formats from manifest+snapshot union (seamless fa…
mprammer May 28, 2026
956a559
test: --run-wheel/--run-network gates + markers; CI lint-and-test --e…
mprammer May 28, 2026
da9bd6d
test(wheel): session-built wheel + venv helpers + smoke test
mprammer May 28, 2026
f9262a1
test(wheel): install-tier regression guard (base lightweight, [build]…
mprammer May 28, 2026
1f7b551
test(wheel): loader API matrix against installed wheel (happy + error…
mprammer May 28, 2026
7773567
test(wheel): build-proof via load('synth') end-to-end in [build] venv
mprammer May 28, 2026
68dcbab
test(pipeline): always-on hermetic e2e build via build.run_one + hand…
mprammer May 28, 2026
45bd111
test(pipeline): real network build via load(slug) across 3 tiny HTTP …
mprammer May 28, 2026
0b15738
ci: add wheel (blocking) + realbuild (non-blocking) jobs
mprammer May 28, 2026
886f6f0
test(pipeline): pin RAINCLOUD_HOME hermeticity in real-build test
mprammer May 28, 2026
5418f35
feat(loader): runnable examples, load/publish skills, integrity + doc…
mprammer May 29, 2026
882a931
ci: scope wheel job to test_wheel.py + restrict GITHUB_TOKEN to conte…
mprammer May 29, 2026
dd37ce3
docs(changelog): stamp 0.2.0 release date
mprammer May 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .agents/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Everything an AI coding agent needs to operate inside this repo. The `.claude
|---|---|
| [`settings.json`](settings.json) | **Tracked** allow-list of safe, read-only Bash / git / WebFetch / pipeline commands. A fresh-clone agent gets these pre-approved so it doesn't burn turns on permission prompts. Side-effecting stages (build / fetch / extract / convert / tighten_variant) are intentionally *not* pre-approved here. |
| `settings.local.json` | **Gitignored** per-machine override — additional permissions specific to the local agent's session. Don't commit. |
| [`skills/`](skills/) | 16 invokable skills following the [Agent Skills](https://agentskills.io) standard — wrappers around pipeline entrypoints (`/raincloud-build`, `/raincloud-fetch`, `/raincloud-status`, `/raincloud-validate-manifest`, `/raincloud-list-datasets`, …) and procedural playbooks (`/raincloud-add-dataset`, `/raincloud-add-handler`, `/raincloud-debug-build`, …). See [`skills/README.md`](skills/README.md). |
| [`skills/`](skills/) | 21 invokable skills following the [Agent Skills](https://agentskills.io) standard — wrappers around pipeline entrypoints (`/raincloud-build`, `/raincloud-fetch`, `/raincloud-status`, `/raincloud-validate-manifest`, `/raincloud-list-datasets`, `/raincloud-load`, `/raincloud-publish`, …) and procedural playbooks (`/raincloud-add-dataset`, `/raincloud-add-handler`, `/raincloud-debug-build`, …). See [`skills/README.md`](skills/README.md). |
| [`context/`](context/) | Symlinks back to the repo-root canonical docs (`AGENTS.md`, `SKILLS.md`, `README.md`, `sources.schema.md`) so each `SKILL.md` can pull authoritative guidance via a stable relative path without copying. |
| `scheduled_tasks.lock` | Gitignored — agent-runtime state. |

Expand Down
4 changes: 4 additions & 0 deletions .agents/skills/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ Wrappers around `python -m scripts.pipeline.<module>`. Side-effecting ones set `
| `/raincloud-status` | `scripts.pipeline.status` | Per-slug filesystem state (raw / workdir / parquet / vortex / variant-pending). *(read-only, model-invocable.)* |
| `/raincloud-validate-manifest` | `scripts.pipeline.validate_manifest` | Static checks for `sources.json` — JSON Schema + handler-registry / slug-uniqueness / fetch-auth cross-checks. *(read-only, model-invocable.)* |
| `/raincloud-list-datasets` | `scripts.pipeline.list_datasets` | Filter/list slugs by handler / license / fetch-type / reader / vortex / tag / showcase / size / regex. *(read-only, model-invocable.)* |
| `/raincloud-discover` | `scripts.pipeline.list_datasets` | Find "interesting" datasets via the discoverability flags — tag / showcase / size / trait / view. *(read-only, model-invocable.)* |
| `/raincloud-profile` | `scripts.pipeline.profile` | Compute per-column statistics → `outputs/v1/<slug>/profile.json` (opt-in; feeds the TUI detail pane + `list_datasets --inspect`). *(writes `profile.json`; model-invocable.)* |
| `/raincloud-load` | `raincloud.load` (loader API) | Load a prepared dataset (cache → mirror → local build) as a lazy `Dataset`; inspect metadata or materialize. *(`disable-model-invocation: true`.)* |
| `/raincloud-publish` | `scripts.pipeline.publish` | Sync built `outputs/v1/...` artefacts to a mirror, gated on the snapshot sha256. *(side-effecting — `disable-model-invocation: true`.)* |

## Procedural playbooks (model-invocable)

Expand Down
2 changes: 1 addition & 1 deletion .agents/skills/raincloud-add-dataset/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Steps:
- The license — must permit redistribution-of-derivatives. Check SPDX ID and `source_url`.
- Approximate row count (used for `expect.rows`; can be `null` on first build).

2. **Append a `DatasetSpec` to `sources.json`** using the Python load-edit-dump pattern from [AGENTS.md](../../context/AGENTS.md#safe-ways-to-edit-sourcesjson) — never `sed`. Start from [`examples/minimal_spec.json`](../../../examples/minimal_spec.json) (every field present with placeholder values) rather than typing one from scratch. Minimal direct-HTTP shape:
2. **Append a `DatasetSpec` to `sources.json`** using the Python load-edit-dump pattern from [AGENTS.md](../../context/AGENTS.md#safe-ways-to-edit-sourcesjson) — never `sed`. Start from [`templates/minimal_spec.json`](../../../templates/minimal_spec.json) (every field present with placeholder values) rather than typing one from scratch. Minimal direct-HTTP shape:

```jsonc
{
Expand Down
2 changes: 1 addition & 1 deletion .agents/skills/raincloud-add-handler/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Steps:

- `parsed` contains one `(path, table)` tuple per parsed file; `table` is `None` when `parse.reader = "custom"`.
- Return `[(output_slug, table), ...]` — one tuple per output parquet. Multi-output handlers emit several slugs from one source (see `glove_split`, `osm_pbf_split`, `stack_exchange_split`).
- **Streaming handlers** (write direct to parquet, bypass the write stage) return `[]`. Copy [`examples/streaming_handler.py.tmpl`](../../../examples/streaming_handler.py.tmpl) as the starting point — it has the `outputs_root()` / `duckdb_connect()` / cleanup wiring already shaped — and study `factbook_variant_parse`, `wikipedia_variant_parse`, `lichess_pgn_parse` for upstream-shape variations.
- **Streaming handlers** (write direct to parquet, bypass the write stage) return `[]`. Copy [`templates/streaming_handler.py.tmpl`](../../../templates/streaming_handler.py.tmpl) as the starting point — it has the `outputs_root()` / `duckdb_connect()` / cleanup wiring already shaped — and study `factbook_variant_parse`, `wikipedia_variant_parse`, `lichess_pgn_parse` for upstream-shape variations.

2. **Register** in `scripts/pipeline/handlers/__init__.py`:

Expand Down
73 changes: 73 additions & 0 deletions .agents/skills/raincloud-load/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
name: raincloud-load
description: Load a Raincloud dataset via the lightweight Python loader API (`raincloud.load(slug)`). Use when the user wants to read a prepared parquet/vortex artifact, inspect a slug's metadata, or pull a dataset for downstream analysis from cache → mirror → local build.
argument-hint: <slug> [--format vortex|parquet] [--materialize to_arrow|scan|to_pandas]
disable-model-invocation: true
allowed-tools: Bash(python -c *), Bash(python -m raincloud *), Bash(python examples/use_loader.py *)
---

The Raincloud loader is a separate, lightweight Python package (`raincloud`) for reading **already-prepared** artifacts. Resolution order is `local cache → mirror → local build`; nothing is fetched until you call an accessor.

## Most common shape

```bash
python -c "
import raincloud
ds = raincloud.load('$ARGUMENTS') # default format='vortex' with parquet fallback
print('rows :', ds.num_rows)
print('cols :', ds.column_names[:5])
print('source :', ds.info.get('source_url'))
print('path :', ds.path()) # triggers cache/mirror/build resolution
"
```

## Materialization

| Accessor | Returns | Notes |
|---|---|---|
| `ds.path()` | `pathlib.Path` to the on-disk artifact | First call resolves; subsequent calls are cache hits. |
| `ds.to_arrow()` | `pyarrow.Table` | Materializes the whole table; expensive on multi-GB slugs. |
| `ds.to_vortex()` | `vortex.VortexFile` (lazy) | Vortex-native handle. |
| `ds.scan()` | `duckdb.DuckDBPyRelation` | Requires `raincloud[duckdb]`. Always reads the parquet sibling — if the slug was loaded as vortex, you'll see a `[raincloud]` stderr note before the parquet is resolved. |
| `ds.to_pandas()` | `pandas.DataFrame` | Requires `raincloud[pandas]`. |
| `ds.schema` | `pyarrow.Schema` | Footer-only read for parquet; opens the file for vortex. |

## Config via env vars

| Env var | Effect |
|---|---|
| `RAINCLOUD_CACHE` | Local artifact cache dir (default `~/.cache/raincloud`). |
| `RAINCLOUD_MIRROR` | Mirror URL (`s3://…`, `https://…`, `file://…`). Tried before the local build. |
| `RAINCLOUD_OFFLINE=1` | Block mirror + build; raise `OfflineMiss` on cache miss. |
| `RAINCLOUD_SNAPSHOT` | Override `docs/v1/snapshot.json` (catalog). |
| `RAINCLOUD_MANIFEST` | Override `sources.json`. |

## Drift semantics

When a slug has a `sha256` pinned in the snapshot and the mirror or local build produces bytes that disagree, the loader prints a `[raincloud] WARN: <slug> from <mirror|build> sha256 drifted ...` to stderr and adopts the new bytes anyway. Upstream content drifts; that's not a panic case. Catch + escalate via `raincloud.ChecksumMismatch` only if you want a hard gate (e.g. `_cache.adopt(..., strict=True)` — used by `python -m scripts.pipeline.publish`).

## Errors (all subclass `raincloud.RaincloudError`)

`UnknownSlug`, `FormatUnavailable`, `ArtifactNotFound`, `OfflineMiss`, `BuildToolingMissing`, `MissingDependency`, `ChecksumMismatch`.

## Worked example

```bash
python examples/use_loader.py --slug $ARGUMENTS # metadata only
python examples/use_loader.py --slug $ARGUMENTS --materialize # full path
```

[`examples/use_loader.py`](../../../examples/use_loader.py) walks the full API end-to-end against the packaged catalog with no network.

## Install tiers

```bash
pip install raincloud # lightweight loader
pip install 'raincloud[duckdb]' # adds .scan()
pip install 'raincloud[pandas]' # adds .to_pandas()
pip install 'raincloud[s3]' # s3:// mirror transport
pip install 'raincloud[http]' # https:// mirror transport
pip install 'raincloud[build]' # adds local-build fallback (heavyweight)
```

Context: [AGENTS.md "The loader package"](../../../AGENTS.md), [`examples/use_loader.py`](../../../examples/use_loader.py).
56 changes: 56 additions & 0 deletions .agents/skills/raincloud-publish/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
name: raincloud-publish
description: Push locally-built Raincloud artifacts to a mirror (`scripts.pipeline.publish`). Use when the maintainer wants to upload one or more slugs' parquet/vortex bytes to a configured mirror after a successful build, gated on the snapshot's recorded sha256.
argument-hint: <slug>... | --all --mirror <s3://… | https://… | file://…> [--dry-run]
disable-model-invocation: true
allowed-tools: Bash(python -m scripts.pipeline.publish *)
---

Sync locally-built artifacts to a mirror so downstream `raincloud.load()` users can fetch them. The CLI verifies each artifact's sha256 against `docs/v1/snapshot.json` before upload — mismatches block the push (publish is the one place an integrity gate is correct: corrupted bytes should never reach a shared mirror).

## Most common shape

```bash
python -m scripts.pipeline.publish $ARGUMENTS
```

Selection (one required):
- `<slug>...` — positional dataset slugs (any number).
- `--all` — every slug whose parquet OR vortex artifact exists locally under `outputs/v{n}/<slug>/`.

Required:
- `--mirror <url>` — destination root. Examples:
- `s3://my-bucket/raincloud` (needs `pip install raincloud[s3]`)
- `https://artifacts.example.com/raincloud` (needs `raincloud[http]`)
- `file:///mnt/shared/raincloud-mirror` (built-in)
- `RAINCLOUD_MIRROR=<url>` env var works as an alternative to the flag.

Modifiers:
- `--dry-run` — print the upload plan (paths + keys) without writing. Always preview large publishes this way first.

## Before publishing

1. **Build the slug locally first** (`/raincloud-build <slug>` or `python -m scripts.pipeline.build <slug>`). Publish does not build; it only pushes what's already in `outputs/v{n}/<slug>/`.
2. **Regenerate the snapshot** (`/raincloud-docs` or `python -m scripts.pipeline.docs`). The publish gate compares local bytes against `parquet_sha256` / `vortex_sha256` in `docs/v1/snapshot.json` — a stale snapshot causes false `PublishMismatch` failures.
3. **Verify the mirror URL** with a dry-run first.

## What gets uploaded

For each slug × format pair (`parquet`, `vortex`) where a local artifact exists:
- Key: `v1/<slug>/<fmt>/<slug>.<ext>`
- Body: the raw artifact bytes
- Gate: `sha256(local) == snapshot[<fmt>_sha256]` (raises `PublishMismatch` on disagree; an unpinned slug — `sha256` is `null` — uploads without verification, so prefer to regen the snapshot first).

## Failure modes

| Error | Meaning | Fix |
|---|---|---|
| `PublishMismatch: ...` | Local bytes diverge from the snapshot's recorded sha. | Re-run `/raincloud-docs` to refresh the snapshot, OR confirm the local artifact is correct and commit the new snapshot. |
| `FileNotFoundError: outputs/v1/<slug>/...` | Slug isn't built locally. | Run `/raincloud-build <slug>` first. |
| `ImportError: Install s3fs ...` | `s3://` mirror without `[s3]` extra. | `pip install 'raincloud[s3]'`. |

## After publishing

Downstream users can now `raincloud.load(<slug>)` and the loader will pull from the configured mirror (cache → mirror → build).

Context: [AGENTS.md "The loader package"](../../../AGENTS.md), [`scripts/pipeline/publish.py`](../../../scripts/pipeline/publish.py).
52 changes: 48 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,12 @@ on:
pull_request:
branches: [develop]

# Minimal token scope: every job is read-only (checkout + deps + lint/tests);
# none posts comments, uploads artifacts, or writes to the repo. Without this,
# the GITHUB_TOKEN defaults to broad write scope (flagged by CodeQL).
permissions:
contents: read

# Cancel in-progress runs when a new commit lands on the same branch.
concurrency:
group: ci-${{ github.ref }}
Expand All @@ -26,11 +32,14 @@ jobs:
run: uv python install 3.11

# `--extra dev` brings in pytest + ruff. `--extra tui` brings in
# textual so the test_browse suite runs instead of skipping. Skip
# kaggle/huggingface — those deps are only needed at fetch time and
# neither test path imports them.
# textual so the test_browse suite runs instead of skipping. `--extra
# build` brings in the heavy pipeline deps (duckdb, osmium, pyreadstat,
# zstandard, jsonschema, …) — required because validate_manifest +
# pytest collection import the handler registry, which transitively
# pulls those. Skip kaggle/huggingface — those deps are only needed at
# fetch time and neither test path imports them.
- name: Install dependencies
run: uv sync --extra dev --extra tui
run: uv sync --extra dev --extra tui --extra build

- name: Lint (ruff)
run: uv run ruff check
Expand All @@ -40,3 +49,38 @@ jobs:

- name: Test (pytest)
run: uv run pytest -q

wheel:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v5
with:
enable-cache: true
- name: Pin Python
run: uv python install 3.11
- name: Install dev deps (for pytest + vortex base dep used by fixtures)
run: uv sync --extra dev
# Scope collection to test_wheel.py: this env intentionally omits the
# [build] extra, but pytest imports every collected module before -m
# filtering, and test_manifest/test_profile import jsonschema (a [build]
# dep) at module top. All wheel-marked tests live in test_wheel.py.
- name: Run wheel tests
run: uv run pytest --run-wheel -m wheel -v tests/test_wheel.py

realbuild:
runs-on: ubuntu-latest
continue-on-error: true # non-blocking: upstream flakiness never reds the build
steps:
- uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v5
with:
enable-cache: true
- name: Pin Python
run: uv python install 3.11
- name: Install dev + build deps
run: uv sync --extra dev --extra build
- name: Run real-build network tests
run: uv run pytest --run-network -m network -v
Comment thread
github-advanced-security[bot] marked this conversation as resolved.
Fixed
Loading
Loading