feat: add as_collection re-entry path and align lifecycle docs

print-sid8 · print-sid8 · commit 3609b529f17f · 2026-02-27T11:57:12.000Z
- add rasteret.as_collection() as a lightweight wrapper for read-ready pyarrow Table/Dataset inputs without rebuild/enrichment/persist

- enforce strict upfront schema validation for read contract columns and band metadata structs, plus memory-size warning for large in-memory tables

- infer data_source from dataset before Collection construction (no post-construction mutation)

- add/expand API surface tests for as_collection error paths and large-table warning

- update Major TOM on-the-fly example to use as_collection() for enrichment round-trip and keep year/month partition columns

- clarify build/load/as_collection semantics across docstrings and docs (home, getting started, tutorials, collection management, build-from-parquet, enriched workflows, reference, contributing)

Signed-off-by: print-sid8 &lt;sidsub94@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -25,22 +25,17 @@ Rasteret parses those headers **once**, caches them in Parquet, and its
 own reader fetches pixels concurrently with no GDAL in the path.
 **Up to 20x faster** on cold starts.
 
-Because the index is Parquet, it's not just a cache - it's a table you
-work with. Filter by cloud cover or date range, join with your own labels
-or AOI polygons, add train/val/test splits as columns, query with DuckDB
-or PyArrow. When you need pixels, Rasteret fetches them on demand from the
-same table.
-
 - **Easy** - three lines from STAC search or Parquet file to a TorchGeo-compatible dataset
 - **Zero downloads** - work with terabytes of imagery while storing only megabytes of metadata
 - **No STAC at training time** - query once at setup; zero API calls during training
 - **Reproducible** - same Parquet index = same records = same results
 - **Native dtypes** - uint16 stays uint16 in tensors; xarray promotes only when NaN fill requires it
-- **Your dataset is a table** - filter, enrich, version, and share a few MB Parquet file. The selection logic lives next to the data references.
+- **Shareable cache** - a few MB index can capture scene selection, band metadata, and split assignments
 
-Rasteret integrates with TorchGeo by returning a standard `GeoDataset`.
-Your samplers, DataLoader, xarray workflows, and analysis tools stay the
-same - Rasteret handles the async tile I/O underneath.
+Rasteret is an **opt-in accelerator** that integrates with TorchGeo by
+returning a standard `GeoDataset`. Your samplers, DataLoader, xarray
+workflows, and analysis tools stay the same - Rasteret handles the async
+tile I/O underneath.
 
 ---
 
diff --git a/docs/contributing.md b/docs/contributing.md
@@ -28,17 +28,18 @@ export ASYNC_TIFF_FIXTURES=/path/to/async-tiff/fixtures/image-tiff
 
 ## Architecture overview
 
-Rasteret has three layers. Every user interaction flows through them
+Rasteret has four layers. Every user interaction flows through them
 top-to-bottom:
 
 ```text
-BUILD                          QUERY                       READ
-─────                          ─────                       ────
-build_from_stac()         Collection.subset()         COGReader
-build_from_table()        Collection.where()          RasterAccessor
-rasteret collections build Collection.select_split()   header_parser
-
-ingest/                   core/collection.py          fetch/cog.py
+BUILD                       QUERY                     READ                     RE-ENTRY
+─────                       ─────                     ────                     ────────
+build()                     Collection.subset()       COGReader                load()
+build_from_stac()           Collection.where()        RasterAccessor           as_collection()
+build_from_table()          Collection.select_split() header_parser
+rasteret collections build
+
+ingest/                     core/collection.py        fetch/cog.py             __init__.py
   stac_indexer.py                                     fetch/header_parser.py
   parquet_record_table.py                             core/raster_accessor.py
   normalize.py                                        core/execution.py
@@ -55,6 +56,10 @@ No network access, no pixel reads.
 **READ** uses cached COG metadata from the Parquet index to fetch only the
 exact tiles needed. This is where the up to 20x speedup comes from.
 
+**RE-ENTRY** reuses already-ingested data. `load()` reopens persisted
+artifacts; `as_collection()` wraps read-ready Arrow tables/datasets without
+rebuilding.
+
 ## Correctness contract
 
 Rasteret's user-visible correctness guarantees are documented in
@@ -232,7 +237,7 @@ these catalogs with client-side bbox/date filtering.
 ## Public API discipline
 
 - Keep the top-level `rasteret` surface small and intentional
-  (`build`, `build_from_stac`, `build_from_table`, `load`, `register`,
+  (`build`, `build_from_stac`, `build_from_table`, `load`, `as_collection`, `register`,
   `register_local`, `create_backend`, `version`, `Collection`,
   `CloudConfig`, `BandRegistry`, `DatasetDescriptor`, `DatasetRegistry`).
 - New user-facing APIs need a docstring and a smoke test.
diff --git a/docs/getting-started/index.md b/docs/getting-started/index.md
@@ -231,10 +231,13 @@ That's it for the basics. Two calls: `build()` to index, then read pixels.
     | Custom STAC API not in the catalog | `rasteret.build_from_stac(stac_api="...", ...)` |
     | Existing Parquet with COG URLs ([Source Cooperative](https://source.coop), STAC GeoParquet, custom) | `rasteret.build_from_table("s3://...parquet", ...)` |
     | Raw local/S3 COG files (no STAC/Parquet index yet) | First create a Parquet record table (`id`, `datetime`, `geometry`, `assets`), then `build_from_table(..., enrich_cog=True)` |
+    | You already have a read-ready Arrow table from an existing Collection | `rasteret.as_collection(table, data_source=collection.data_source)` |
     | Someone shared a Collection with you | `rasteret.load("path/to/collection/")` |
 
 **Sharing**: `collection.export("path/")` writes a portable copy. Your teammate runs `rasteret.load("path/")`.
 
+`build*` functions ingest/normalize external data, `as_collection()` re-wraps read-ready in-memory Arrow objects, and `load()` reopens persisted artifacts.
+
 ---
 
 ## Going further
diff --git a/docs/how-to/build-from-parquet.md b/docs/how-to/build-from-parquet.md
@@ -9,6 +9,10 @@ Rasteret validates the schema, derives per-record bounding boxes from the
 GeoParquet `geometry` column, and produces a standard Collection backed by
 Arrow.
 
+Use this path for **first-time ingest** of external Parquet. If your table
+already came from a Rasteret Collection and you only appended columns, use
+`rasteret.as_collection(...)` to re-wrap it without rebuilding.
+
 ---
 
 ## Supported sources
diff --git a/docs/how-to/collection-management.md b/docs/how-to/collection-management.md
@@ -105,6 +105,14 @@ To make a local Collection appear in `rasteret datasets list`, see
 
 ## Python API
 
+### Choose the right entry point
+
+| Goal | Use | What it does |
+|---|---|---|
+| Ingest external STAC/Parquet into Rasteret schema | `build()`, `build_from_stac()`, `build_from_table()` | Normalizes schema, can enrich COG metadata, can persist cache |
+| Reopen an existing Collection artifact | `load(path)` | Opens an already-materialized Parquet collection |
+| Re-wrap an in-memory read-ready Arrow object | `as_collection(table)` | Validates required columns and wraps without rebuild |
+
 ### Build from STAC
 
 See [`build_from_stac()`](../reference/rasteret.md) API reference.
diff --git a/docs/how-to/enriched-parquet-workflows.md b/docs/how-to/enriched-parquet-workflows.md
@@ -21,14 +21,13 @@ from pathlib import Path
 
 import numpy as np
 import pyarrow as pa
-import pyarrow.parquet as pq
 import shapely
 from shapely.geometry import Polygon
 
 import rasteret
 
 # Build collection from STAC
-collection = rasteret.build_from_stac(
+base = rasteret.build_from_stac(
     name="bangalore",
     stac_api="https://earth-search.aws.element84.com/v1",
     collection="sentinel-2-l2a",
@@ -38,7 +37,7 @@ collection = rasteret.build_from_stac(
 )
 
 # Get the Arrow table
-enriched = collection.dataset.to_table()
+enriched = base.dataset.to_table()
 n = enriched.num_rows
 ```
 
@@ -75,8 +74,15 @@ enriched = enriched.append_column("label", pa.array(labels, type=pa.int32()))
 ### Save
 
 ```python
-pq.write_table(enriched, "./experiment_v1.parquet")
-collection = rasteret.load("./experiment_v1.parquet")
+# Re-wrap in-memory (no rebuild), then persist if needed
+collection = rasteret.as_collection(
+    enriched,
+    name="experiment-v1",
+    data_source=base.data_source,
+)
+collection.export("./experiment_v1")
+# Reload from disk so later sections use the persisted, lazy dataset path.
+collection = rasteret.load("./experiment_v1")
 ```
 
 The enriched Parquet now contains scene metadata, COG tile cache, AOI
@@ -94,7 +100,7 @@ import duckdb
 
 import rasteret
 
-collection = rasteret.load("./experiment_v1.parquet")
+collection = rasteret.load("./experiment_v1")
 enriched = collection.dataset.to_table()
 
 con = duckdb.connect()
@@ -150,7 +156,7 @@ import pyarrow.compute as pc
 
 import rasteret
 
-collection = rasteret.load("./experiment_v1.parquet")
+collection = rasteret.load("./experiment_v1")
 enriched = collection.dataset.to_table()
 
 # Filter: train split, low cloud
@@ -180,7 +186,7 @@ import geopandas as gpd
 
 import rasteret
 
-collection = rasteret.load("./experiment_v1.parquet")
+collection = rasteret.load("./experiment_v1")
 enriched = collection.dataset.to_table()
 
 gdf = gpd.GeoDataFrame(
diff --git a/docs/index.md b/docs/index.md
@@ -17,13 +17,11 @@
 !!! success "What Rasteret does"
 
     Parse headers **once**, cache in Parquet, read pixels concurrently
-    with no GDAL in the path. Because the index is Parquet, it's also
-    the table you work with - filter, join, enrich, and query with
-    standard tools before you ever fetch a pixel.
+    with no GDAL in the path.
 
     ```text
-    STAC API / GeoParquet  -->  Collection (Parquet)  -->  Tile-level byte reads
-           (once)              (queryable, enrichable)       (no GDAL, no headers)
+    STAC API / GeoParquet  -->  Parquet Index  -->  Tile-level byte reads
+           (once)                 (queryable)          (no GDAL, no headers)
     ```
 
 ---
@@ -58,21 +56,6 @@
     Same Parquet index = same records = same results.
     Share a few MB file and collaborators skip re-indexing.
 
--   :material-table-edit:{ .lg .middle } **Your dataset is a table**
-
-    ---
-
-    Filter, join, enrich with DuckDB or PyArrow. Add splits,
-    labels, and quality flags as columns. The index is the dataset.
-
--   :material-swap-horizontal:{ .lg .middle } **Any Parquet with COG URLs**
-
-    ---
-
-    `build_from_table()` turns existing GeoParquet into a
-    Collection. Source Cooperative exports, STAC GeoParquet,
-    custom catalogs - if it has URLs, Rasteret can read it.
-
 </div>
 
 ---
@@ -99,7 +82,8 @@ One PR adds a dataset and every user gets access on the next release. No proprie
 [Design Decisions](explanation/design-decisions.md) for the thinking behind it.
 
 Pick any ID and pass it to `build()`. For datasets not in the catalog, use
-`build_from_stac()` or `build_from_table()`.
+`build_from_stac()` or `build_from_table()`. Reopen persisted collections with
+`load()`, or re-wrap read-ready Arrow tables with `as_collection()`.
 
 ---
 
diff --git a/docs/reference/rasteret.md b/docs/reference/rasteret.md
@@ -7,10 +7,19 @@ Most users need only a few of these:
 - **`build()`** - build a Collection from the [catalog](../how-to/dataset-catalog.md) by ID.
 - **`build_from_stac()`** / **`build_from_table()`** - full-control builders for STAC APIs or existing Parquet.
 - **`load()`** - reload a previously built Collection from Parquet.
+- **`as_collection()`** - wrap a read-ready Arrow table/dataset as a Collection (no enrich/persist).
 - **`register()`** - add a custom catalog entry to the in-memory registry.
 - **`register_local()`** - register a local Collection as a catalog entry (persists to `~/.rasteret/datasets.local.json`).
 - **`create_backend()`** - create an authenticated I/O backend for [multi-cloud reads](../how-to/custom-cloud-provider.md).
 
+Entrypoint semantics:
+
+| Function | Use when | Rebuild/enrich? |
+|---|---|---|
+| `build()` / `build_from_stac()` / `build_from_table()` | Ingesting external sources into Rasteret schema | Yes |
+| `load()` | Reopening an existing persisted Collection | No |
+| `as_collection()` | Re-wrapping a read-ready Arrow table/dataset in memory | No |
+
 See [Getting Started](../getting-started/index.md) for usage examples.
 
 ::: rasteret
@@ -19,6 +28,7 @@ See [Getting Started](../getting-started/index.md) for usage examples.
         - build
         - build_from_stac
         - build_from_table
+        - as_collection
         - create_backend
         - load
         - register
diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
@@ -16,10 +16,13 @@ Recommended sequence (matches sidebar order and notebook “Next” links):
 | [Configuring Custom Collections](04_custom_cloud_and_bands.ipynb) | Cloud configs, band mappings, and storage backends for non-built-in datasets |
 | [TorchGeo Benchmark](05_torchgeo_comparison.ipynb) | Side-by-side performance comparison with native TorchGeo |
 
-!!! tip "Which build function?"
+!!! tip "Which entry point?"
     **Quickstart** and **TorchGeo Integration** use `build()`, which looks up
     STAC API details from the [dataset catalog](../how-to/dataset-catalog.md).
     **Building from Parquet** uses `build_from_table()` for existing Parquet files.
+    For in-memory Arrow tables that are already read-ready, use
+    `as_collection()`.
+    For previously exported collections, use `load()`.
     For custom STAC APIs not in the catalog, use `build_from_stac()`; see
     [Collection Management](../how-to/collection-management.md).
 
diff --git a/examples/major_tom_on_the_fly_collection.py b/examples/major_tom_on_the_fly_collection.py
@@ -179,6 +179,8 @@ def enrich_major_tom_columns(base: "rasteret.Collection", grid_km: int) -> pa.Ta
         "bbox_miny",
         "bbox_maxx",
         "bbox_maxy",
+        "year",
+        "month",
         "s2:product_uri",
         "collection",
     ]:
@@ -364,17 +366,14 @@ def main() -> None:
     print(f"base_rows={base.dataset.count_rows()}")
 
     enriched_table = enrich_major_tom_columns(base, args.grid_km)
-    enriched = rasteret.build_from_table(
+    enriched = rasteret.as_collection(
         enriched_table,
         name=args.name,
-        data_source="sentinel-2-l2a",
-        workspace_dir=args.workspace,
-        enrich_cog=False,
-        max_concurrent=args.max_concurrent,
-        force=True,
     )
+    collection_path = args.workspace / f"{args.name}_records"
+    enriched.export(collection_path)
     print(f"enriched_rows={enriched.dataset.count_rows()}")
-    print(f"collection_path={args.workspace / f'{args.name}_records'}")
+    print(f"collection_path={collection_path}")
 
     if args.metadata_path:
         report_metadata_overlap(enriched, args.metadata_path)
diff --git a/src/rasteret/__init__.py b/src/rasteret/__init__.py
diff --git a/src/rasteret/core/utils.py b/src/rasteret/core/utils.py
diff --git a/src/rasteret/tests/test_public_api_surface.py b/src/rasteret/tests/test_public_api_surface.py