-
Notifications
You must be signed in to change notification settings - Fork 0
Add lightweight raincloud loader, publish CLI, and wheel-installable build pipeline #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
418a055
feat(loader): package skeleton, exceptions, lightweight packaging
mprammer 3a7d57e
feat(loader): catalog module reading snapshot + manifest
mprammer aee0652
feat(loader): checksum-gated local cache with atomic adopt
mprammer 1a6c0be
feat(loader): fsspec transport-only fetch
mprammer 04abeed
feat(loader): cache->mirror->build resolver
mprammer f9a0189
feat(loader): lazy Dataset handle and load()/load_dataset
mprammer 10ea912
feat(docs): carry parquet/vortex sha256 in snapshot for the loader
mprammer 8b1f255
chore(loader): add SPDX headers to new loader files
mprammer 11225fb
feat(publish): mirror-sync CLI gated on snapshot sha256
mprammer a8175a0
test(loader): hermetic file:// mirror e2e; document loader usage
mprammer 3588686
docs(loader): document loader + publish, layered install, 0.2.0 chang…
mprammer 12a6c66
fix(loader): clean BuildToolingMissing on loader-only install; doc + …
mprammer 5683929
docs(loader): install from GitHub, not PyPI (name squatted)
mprammer 48beac7
feat(pipeline): configurable data_root() for wheel builds
mprammer 734f91f
feat(pipeline): root fetch/extract/build/status on data_root() helpers
mprammer ff3f68b
fix(pipeline): use display_path for logs so wheel builds don't ValueE…
mprammer 4f8b06b
fix(handlers): use display_path for output logs (wheel-build safe)
mprammer 11ef84e
fix(handlers): root scratch workdir on workdir_root() (wheel-safe)
mprammer 86da060
feat(pipeline): ship + resolve sources.schema.json for wheel installs
mprammer 5d23066
docs(pipeline): document data-area env vars; note wheel build-fallback
mprammer cded8a8
fix(loader): expose formats from manifest+snapshot union (seamless fa…
mprammer 956a559
test: --run-wheel/--run-network gates + markers; CI lint-and-test --e…
mprammer da9bd6d
test(wheel): session-built wheel + venv helpers + smoke test
mprammer f9262a1
test(wheel): install-tier regression guard (base lightweight, [build]…
mprammer 1f7b551
test(wheel): loader API matrix against installed wheel (happy + error…
mprammer 7773567
test(wheel): build-proof via load('synth') end-to-end in [build] venv
mprammer 68dcbab
test(pipeline): always-on hermetic e2e build via build.run_one + hand…
mprammer 45bd111
test(pipeline): real network build via load(slug) across 3 tiny HTTP …
mprammer 0b15738
ci: add wheel (blocking) + realbuild (non-blocking) jobs
mprammer 886f6f0
test(pipeline): pin RAINCLOUD_HOME hermeticity in real-build test
mprammer 5418f35
feat(loader): runnable examples, load/publish skills, integrity + doc…
mprammer 882a931
ci: scope wheel job to test_wheel.py + restrict GITHUB_TOKEN to conte…
mprammer dd37ce3
docs(changelog): stamp 0.2.0 release date
mprammer File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| --- | ||
| name: raincloud-load | ||
| description: Load a Raincloud dataset via the lightweight Python loader API (`raincloud.load(slug)`). Use when the user wants to read a prepared parquet/vortex artifact, inspect a slug's metadata, or pull a dataset for downstream analysis from cache → mirror → local build. | ||
| argument-hint: <slug> [--format vortex|parquet] [--materialize to_arrow|scan|to_pandas] | ||
| disable-model-invocation: true | ||
| allowed-tools: Bash(python -c *), Bash(python -m raincloud *), Bash(python examples/use_loader.py *) | ||
| --- | ||
|
|
||
| The Raincloud loader is a separate, lightweight Python package (`raincloud`) for reading **already-prepared** artifacts. Resolution order is `local cache → mirror → local build`; nothing is fetched until you call an accessor. | ||
|
|
||
| ## Most common shape | ||
|
|
||
| ```bash | ||
| python -c " | ||
| import raincloud | ||
| ds = raincloud.load('$ARGUMENTS') # default format='vortex' with parquet fallback | ||
| print('rows :', ds.num_rows) | ||
| print('cols :', ds.column_names[:5]) | ||
| print('source :', ds.info.get('source_url')) | ||
| print('path :', ds.path()) # triggers cache/mirror/build resolution | ||
| " | ||
| ``` | ||
|
|
||
| ## Materialization | ||
|
|
||
| | Accessor | Returns | Notes | | ||
| |---|---|---| | ||
| | `ds.path()` | `pathlib.Path` to the on-disk artifact | First call resolves; subsequent calls are cache hits. | | ||
| | `ds.to_arrow()` | `pyarrow.Table` | Materializes the whole table; expensive on multi-GB slugs. | | ||
| | `ds.to_vortex()` | `vortex.VortexFile` (lazy) | Vortex-native handle. | | ||
| | `ds.scan()` | `duckdb.DuckDBPyRelation` | Requires `raincloud[duckdb]`. Always reads the parquet sibling — if the slug was loaded as vortex, you'll see a `[raincloud]` stderr note before the parquet is resolved. | | ||
| | `ds.to_pandas()` | `pandas.DataFrame` | Requires `raincloud[pandas]`. | | ||
| | `ds.schema` | `pyarrow.Schema` | Footer-only read for parquet; opens the file for vortex. | | ||
|
|
||
| ## Config via env vars | ||
|
|
||
| | Env var | Effect | | ||
| |---|---| | ||
| | `RAINCLOUD_CACHE` | Local artifact cache dir (default `~/.cache/raincloud`). | | ||
| | `RAINCLOUD_MIRROR` | Mirror URL (`s3://…`, `https://…`, `file://…`). Tried before the local build. | | ||
| | `RAINCLOUD_OFFLINE=1` | Block mirror + build; raise `OfflineMiss` on cache miss. | | ||
| | `RAINCLOUD_SNAPSHOT` | Override `docs/v1/snapshot.json` (catalog). | | ||
| | `RAINCLOUD_MANIFEST` | Override `sources.json`. | | ||
|
|
||
| ## Drift semantics | ||
|
|
||
| When a slug has a `sha256` pinned in the snapshot and the mirror or local build produces bytes that disagree, the loader prints a `[raincloud] WARN: <slug> from <mirror|build> sha256 drifted ...` to stderr and adopts the new bytes anyway. Upstream content drifts; that's not a panic case. Catch + escalate via `raincloud.ChecksumMismatch` only if you want a hard gate (e.g. `_cache.adopt(..., strict=True)` — used by `python -m scripts.pipeline.publish`). | ||
|
|
||
| ## Errors (all subclass `raincloud.RaincloudError`) | ||
|
|
||
| `UnknownSlug`, `FormatUnavailable`, `ArtifactNotFound`, `OfflineMiss`, `BuildToolingMissing`, `MissingDependency`, `ChecksumMismatch`. | ||
|
|
||
| ## Worked example | ||
|
|
||
| ```bash | ||
| python examples/use_loader.py --slug $ARGUMENTS # metadata only | ||
| python examples/use_loader.py --slug $ARGUMENTS --materialize # full path | ||
| ``` | ||
|
|
||
| [`examples/use_loader.py`](../../../examples/use_loader.py) walks the full API end-to-end against the packaged catalog with no network. | ||
|
|
||
| ## Install tiers | ||
|
|
||
| ```bash | ||
| pip install raincloud # lightweight loader | ||
| pip install 'raincloud[duckdb]' # adds .scan() | ||
| pip install 'raincloud[pandas]' # adds .to_pandas() | ||
| pip install 'raincloud[s3]' # s3:// mirror transport | ||
| pip install 'raincloud[http]' # https:// mirror transport | ||
| pip install 'raincloud[build]' # adds local-build fallback (heavyweight) | ||
| ``` | ||
|
|
||
| Context: [AGENTS.md "The loader package"](../../../AGENTS.md), [`examples/use_loader.py`](../../../examples/use_loader.py). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| --- | ||
| name: raincloud-publish | ||
| description: Push locally-built Raincloud artifacts to a mirror (`scripts.pipeline.publish`). Use when the maintainer wants to upload one or more slugs' parquet/vortex bytes to a configured mirror after a successful build, gated on the snapshot's recorded sha256. | ||
| argument-hint: <slug>... | --all --mirror <s3://… | https://… | file://…> [--dry-run] | ||
| disable-model-invocation: true | ||
| allowed-tools: Bash(python -m scripts.pipeline.publish *) | ||
| --- | ||
|
|
||
| Sync locally-built artifacts to a mirror so downstream `raincloud.load()` users can fetch them. The CLI verifies each artifact's sha256 against `docs/v1/snapshot.json` before upload — mismatches block the push (publish is the one place an integrity gate is correct: corrupted bytes should never reach a shared mirror). | ||
|
|
||
| ## Most common shape | ||
|
|
||
| ```bash | ||
| python -m scripts.pipeline.publish $ARGUMENTS | ||
| ``` | ||
|
|
||
| Selection (one required): | ||
| - `<slug>...` — positional dataset slugs (any number). | ||
| - `--all` — every slug whose parquet OR vortex artifact exists locally under `outputs/v{n}/<slug>/`. | ||
|
|
||
| Required: | ||
| - `--mirror <url>` — destination root. Examples: | ||
| - `s3://my-bucket/raincloud` (needs `pip install raincloud[s3]`) | ||
| - `https://artifacts.example.com/raincloud` (needs `raincloud[http]`) | ||
| - `file:///mnt/shared/raincloud-mirror` (built-in) | ||
| - `RAINCLOUD_MIRROR=<url>` env var works as an alternative to the flag. | ||
|
|
||
| Modifiers: | ||
| - `--dry-run` — print the upload plan (paths + keys) without writing. Always preview large publishes this way first. | ||
|
|
||
| ## Before publishing | ||
|
|
||
| 1. **Build the slug locally first** (`/raincloud-build <slug>` or `python -m scripts.pipeline.build <slug>`). Publish does not build; it only pushes what's already in `outputs/v{n}/<slug>/`. | ||
| 2. **Regenerate the snapshot** (`/raincloud-docs` or `python -m scripts.pipeline.docs`). The publish gate compares local bytes against `parquet_sha256` / `vortex_sha256` in `docs/v1/snapshot.json` — a stale snapshot causes false `PublishMismatch` failures. | ||
| 3. **Verify the mirror URL** with a dry-run first. | ||
|
|
||
| ## What gets uploaded | ||
|
|
||
| For each slug × format pair (`parquet`, `vortex`) where a local artifact exists: | ||
| - Key: `v1/<slug>/<fmt>/<slug>.<ext>` | ||
| - Body: the raw artifact bytes | ||
| - Gate: `sha256(local) == snapshot[<fmt>_sha256]` (raises `PublishMismatch` on disagree; an unpinned slug — `sha256` is `null` — uploads without verification, so prefer to regen the snapshot first). | ||
|
|
||
| ## Failure modes | ||
|
|
||
| | Error | Meaning | Fix | | ||
| |---|---|---| | ||
| | `PublishMismatch: ...` | Local bytes diverge from the snapshot's recorded sha. | Re-run `/raincloud-docs` to refresh the snapshot, OR confirm the local artifact is correct and commit the new snapshot. | | ||
| | `FileNotFoundError: outputs/v1/<slug>/...` | Slug isn't built locally. | Run `/raincloud-build <slug>` first. | | ||
| | `ImportError: Install s3fs ...` | `s3://` mirror without `[s3]` extra. | `pip install 'raincloud[s3]'`. | | ||
|
|
||
| ## After publishing | ||
|
|
||
| Downstream users can now `raincloud.load(<slug>)` and the loader will pull from the configured mirror (cache → mirror → build). | ||
|
|
||
| Context: [AGENTS.md "The loader package"](../../../AGENTS.md), [`scripts/pipeline/publish.py`](../../../scripts/pipeline/publish.py). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.