Skip to content

Commit 3609b52

Browse files
committed
feat: add as_collection re-entry path and align lifecycle docs
- add rasteret.as_collection() as a lightweight wrapper for read-ready pyarrow Table/Dataset inputs without rebuild/enrichment/persist - enforce strict upfront schema validation for read contract columns and band metadata structs, plus memory-size warning for large in-memory tables - infer data_source from dataset before Collection construction (no post-construction mutation) - add/expand API surface tests for as_collection error paths and large-table warning - update Major TOM on-the-fly example to use as_collection() for enrichment round-trip and keep year/month partition columns - clarify build/load/as_collection semantics across docstrings and docs (home, getting started, tutorials, collection management, build-from-parquet, enriched workflows, reference, contributing) Signed-off-by: print-sid8 <sidsub94@gmail.com>
1 parent 25d778b commit 3609b52

File tree

13 files changed

+365
-63
lines changed

13 files changed

+365
-63
lines changed

README.md

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -25,22 +25,17 @@ Rasteret parses those headers **once**, caches them in Parquet, and its
2525
own reader fetches pixels concurrently with no GDAL in the path.
2626
**Up to 20x faster** on cold starts.
2727

28-
Because the index is Parquet, it's not just a cache - it's a table you
29-
work with. Filter by cloud cover or date range, join with your own labels
30-
or AOI polygons, add train/val/test splits as columns, query with DuckDB
31-
or PyArrow. When you need pixels, Rasteret fetches them on demand from the
32-
same table.
33-
3428
- **Easy** - three lines from STAC search or Parquet file to a TorchGeo-compatible dataset
3529
- **Zero downloads** - work with terabytes of imagery while storing only megabytes of metadata
3630
- **No STAC at training time** - query once at setup; zero API calls during training
3731
- **Reproducible** - same Parquet index = same records = same results
3832
- **Native dtypes** - uint16 stays uint16 in tensors; xarray promotes only when NaN fill requires it
39-
- **Your dataset is a table** - filter, enrich, version, and share a few MB Parquet file. The selection logic lives next to the data references.
33+
- **Shareable cache** - a few MB index can capture scene selection, band metadata, and split assignments
4034

41-
Rasteret integrates with TorchGeo by returning a standard `GeoDataset`.
42-
Your samplers, DataLoader, xarray workflows, and analysis tools stay the
43-
same - Rasteret handles the async tile I/O underneath.
35+
Rasteret is an **opt-in accelerator** that integrates with TorchGeo by
36+
returning a standard `GeoDataset`. Your samplers, DataLoader, xarray
37+
workflows, and analysis tools stay the same - Rasteret handles the async
38+
tile I/O underneath.
4439

4540
---
4641

docs/contributing.md

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -28,17 +28,18 @@ export ASYNC_TIFF_FIXTURES=/path/to/async-tiff/fixtures/image-tiff
2828

2929
## Architecture overview
3030

31-
Rasteret has three layers. Every user interaction flows through them
31+
Rasteret has four layers. Every user interaction flows through them
3232
top-to-bottom:
3333

3434
```text
35-
BUILD QUERY READ
36-
───── ───── ────
37-
build_from_stac() Collection.subset() COGReader
38-
build_from_table() Collection.where() RasterAccessor
39-
rasteret collections build Collection.select_split() header_parser
40-
41-
ingest/ core/collection.py fetch/cog.py
35+
BUILD QUERY READ RE-ENTRY
36+
───── ───── ──── ────────
37+
build() Collection.subset() COGReader load()
38+
build_from_stac() Collection.where() RasterAccessor as_collection()
39+
build_from_table() Collection.select_split() header_parser
40+
rasteret collections build
41+
42+
ingest/ core/collection.py fetch/cog.py __init__.py
4243
stac_indexer.py fetch/header_parser.py
4344
parquet_record_table.py core/raster_accessor.py
4445
normalize.py core/execution.py
@@ -55,6 +56,10 @@ No network access, no pixel reads.
5556
**READ** uses cached COG metadata from the Parquet index to fetch only the
5657
exact tiles needed. This is where the up to 20x speedup comes from.
5758

59+
**RE-ENTRY** reuses already-ingested data. `load()` reopens persisted
60+
artifacts; `as_collection()` wraps read-ready Arrow tables/datasets without
61+
rebuilding.
62+
5863
## Correctness contract
5964

6065
Rasteret's user-visible correctness guarantees are documented in
@@ -232,7 +237,7 @@ these catalogs with client-side bbox/date filtering.
232237
## Public API discipline
233238

234239
- Keep the top-level `rasteret` surface small and intentional
235-
(`build`, `build_from_stac`, `build_from_table`, `load`, `register`,
240+
(`build`, `build_from_stac`, `build_from_table`, `load`, `as_collection`, `register`,
236241
`register_local`, `create_backend`, `version`, `Collection`,
237242
`CloudConfig`, `BandRegistry`, `DatasetDescriptor`, `DatasetRegistry`).
238243
- New user-facing APIs need a docstring and a smoke test.

docs/getting-started/index.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -231,10 +231,13 @@ That's it for the basics. Two calls: `build()` to index, then read pixels.
231231
| Custom STAC API not in the catalog | `rasteret.build_from_stac(stac_api="...", ...)` |
232232
| Existing Parquet with COG URLs ([Source Cooperative](https://source.coop), STAC GeoParquet, custom) | `rasteret.build_from_table("s3://...parquet", ...)` |
233233
| Raw local/S3 COG files (no STAC/Parquet index yet) | First create a Parquet record table (`id`, `datetime`, `geometry`, `assets`), then `build_from_table(..., enrich_cog=True)` |
234+
| You already have a read-ready Arrow table from an existing Collection | `rasteret.as_collection(table, data_source=collection.data_source)` |
234235
| Someone shared a Collection with you | `rasteret.load("path/to/collection/")` |
235236

236237
**Sharing**: `collection.export("path/")` writes a portable copy. Your teammate runs `rasteret.load("path/")`.
237238

239+
`build*` functions ingest/normalize external data, `as_collection()` re-wraps read-ready in-memory Arrow objects, and `load()` reopens persisted artifacts.
240+
238241
---
239242

240243
## Going further

docs/how-to/build-from-parquet.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@ Rasteret validates the schema, derives per-record bounding boxes from the
99
GeoParquet `geometry` column, and produces a standard Collection backed by
1010
Arrow.
1111

12+
Use this path for **first-time ingest** of external Parquet. If your table
13+
already came from a Rasteret Collection and you only appended columns, use
14+
`rasteret.as_collection(...)` to re-wrap it without rebuilding.
15+
1216
---
1317

1418
## Supported sources

docs/how-to/collection-management.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,14 @@ To make a local Collection appear in `rasteret datasets list`, see
105105

106106
## Python API
107107

108+
### Choose the right entry point
109+
110+
| Goal | Use | What it does |
111+
|---|---|---|
112+
| Ingest external STAC/Parquet into Rasteret schema | `build()`, `build_from_stac()`, `build_from_table()` | Normalizes schema, can enrich COG metadata, can persist cache |
113+
| Reopen an existing Collection artifact | `load(path)` | Opens an already-materialized Parquet collection |
114+
| Re-wrap an in-memory read-ready Arrow object | `as_collection(table)` | Validates required columns and wraps without rebuild |
115+
108116
### Build from STAC
109117

110118
See [`build_from_stac()`](../reference/rasteret.md) API reference.

docs/how-to/enriched-parquet-workflows.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -21,14 +21,13 @@ from pathlib import Path
2121

2222
import numpy as np
2323
import pyarrow as pa
24-
import pyarrow.parquet as pq
2524
import shapely
2625
from shapely.geometry import Polygon
2726

2827
import rasteret
2928

3029
# Build collection from STAC
31-
collection = rasteret.build_from_stac(
30+
base = rasteret.build_from_stac(
3231
name="bangalore",
3332
stac_api="https://earth-search.aws.element84.com/v1",
3433
collection="sentinel-2-l2a",
@@ -38,7 +37,7 @@ collection = rasteret.build_from_stac(
3837
)
3938

4039
# Get the Arrow table
41-
enriched = collection.dataset.to_table()
40+
enriched = base.dataset.to_table()
4241
n = enriched.num_rows
4342
```
4443

@@ -75,8 +74,15 @@ enriched = enriched.append_column("label", pa.array(labels, type=pa.int32()))
7574
### Save
7675

7776
```python
78-
pq.write_table(enriched, "./experiment_v1.parquet")
79-
collection = rasteret.load("./experiment_v1.parquet")
77+
# Re-wrap in-memory (no rebuild), then persist if needed
78+
collection = rasteret.as_collection(
79+
enriched,
80+
name="experiment-v1",
81+
data_source=base.data_source,
82+
)
83+
collection.export("./experiment_v1")
84+
# Reload from disk so later sections use the persisted, lazy dataset path.
85+
collection = rasteret.load("./experiment_v1")
8086
```
8187

8288
The enriched Parquet now contains scene metadata, COG tile cache, AOI
@@ -94,7 +100,7 @@ import duckdb
94100

95101
import rasteret
96102

97-
collection = rasteret.load("./experiment_v1.parquet")
103+
collection = rasteret.load("./experiment_v1")
98104
enriched = collection.dataset.to_table()
99105

100106
con = duckdb.connect()
@@ -150,7 +156,7 @@ import pyarrow.compute as pc
150156

151157
import rasteret
152158

153-
collection = rasteret.load("./experiment_v1.parquet")
159+
collection = rasteret.load("./experiment_v1")
154160
enriched = collection.dataset.to_table()
155161

156162
# Filter: train split, low cloud
@@ -180,7 +186,7 @@ import geopandas as gpd
180186

181187
import rasteret
182188

183-
collection = rasteret.load("./experiment_v1.parquet")
189+
collection = rasteret.load("./experiment_v1")
184190
enriched = collection.dataset.to_table()
185191

186192
gdf = gpd.GeoDataFrame(

docs/index.md

Lines changed: 5 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,11 @@
1717
!!! success "What Rasteret does"
1818

1919
Parse headers **once**, cache in Parquet, read pixels concurrently
20-
with no GDAL in the path. Because the index is Parquet, it's also
21-
the table you work with - filter, join, enrich, and query with
22-
standard tools before you ever fetch a pixel.
20+
with no GDAL in the path.
2321

2422
```text
25-
STAC API / GeoParquet --> Collection (Parquet) --> Tile-level byte reads
26-
(once) (queryable, enrichable) (no GDAL, no headers)
23+
STAC API / GeoParquet --> Parquet Index --> Tile-level byte reads
24+
(once) (queryable) (no GDAL, no headers)
2725
```
2826

2927
---
@@ -58,21 +56,6 @@
5856
Same Parquet index = same records = same results.
5957
Share a few MB file and collaborators skip re-indexing.
6058

61-
- :material-table-edit:{ .lg .middle } **Your dataset is a table**
62-
63-
---
64-
65-
Filter, join, enrich with DuckDB or PyArrow. Add splits,
66-
labels, and quality flags as columns. The index is the dataset.
67-
68-
- :material-swap-horizontal:{ .lg .middle } **Any Parquet with COG URLs**
69-
70-
---
71-
72-
`build_from_table()` turns existing GeoParquet into a
73-
Collection. Source Cooperative exports, STAC GeoParquet,
74-
custom catalogs - if it has URLs, Rasteret can read it.
75-
7659
</div>
7760

7861
---
@@ -99,7 +82,8 @@ One PR adds a dataset and every user gets access on the next release. No proprie
9982
[Design Decisions](explanation/design-decisions.md) for the thinking behind it.
10083

10184
Pick any ID and pass it to `build()`. For datasets not in the catalog, use
102-
`build_from_stac()` or `build_from_table()`.
85+
`build_from_stac()` or `build_from_table()`. Reopen persisted collections with
86+
`load()`, or re-wrap read-ready Arrow tables with `as_collection()`.
10387

10488
---
10589

docs/reference/rasteret.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,19 @@ Most users need only a few of these:
77
- **`build()`** - build a Collection from the [catalog](../how-to/dataset-catalog.md) by ID.
88
- **`build_from_stac()`** / **`build_from_table()`** - full-control builders for STAC APIs or existing Parquet.
99
- **`load()`** - reload a previously built Collection from Parquet.
10+
- **`as_collection()`** - wrap a read-ready Arrow table/dataset as a Collection (no enrich/persist).
1011
- **`register()`** - add a custom catalog entry to the in-memory registry.
1112
- **`register_local()`** - register a local Collection as a catalog entry (persists to `~/.rasteret/datasets.local.json`).
1213
- **`create_backend()`** - create an authenticated I/O backend for [multi-cloud reads](../how-to/custom-cloud-provider.md).
1314

15+
Entrypoint semantics:
16+
17+
| Function | Use when | Rebuild/enrich? |
18+
|---|---|---|
19+
| `build()` / `build_from_stac()` / `build_from_table()` | Ingesting external sources into Rasteret schema | Yes |
20+
| `load()` | Reopening an existing persisted Collection | No |
21+
| `as_collection()` | Re-wrapping a read-ready Arrow table/dataset in memory | No |
22+
1423
See [Getting Started](../getting-started/index.md) for usage examples.
1524

1625
::: rasteret
@@ -19,6 +28,7 @@ See [Getting Started](../getting-started/index.md) for usage examples.
1928
- build
2029
- build_from_stac
2130
- build_from_table
31+
- as_collection
2232
- create_backend
2333
- load
2434
- register

docs/tutorials/index.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,13 @@ Recommended sequence (matches sidebar order and notebook “Next” links):
1616
| [Configuring Custom Collections](04_custom_cloud_and_bands.ipynb) | Cloud configs, band mappings, and storage backends for non-built-in datasets |
1717
| [TorchGeo Benchmark](05_torchgeo_comparison.ipynb) | Side-by-side performance comparison with native TorchGeo |
1818

19-
!!! tip "Which build function?"
19+
!!! tip "Which entry point?"
2020
**Quickstart** and **TorchGeo Integration** use `build()`, which looks up
2121
STAC API details from the [dataset catalog](../how-to/dataset-catalog.md).
2222
**Building from Parquet** uses `build_from_table()` for existing Parquet files.
23+
For in-memory Arrow tables that are already read-ready, use
24+
`as_collection()`.
25+
For previously exported collections, use `load()`.
2326
For custom STAC APIs not in the catalog, use `build_from_stac()`; see
2427
[Collection Management](../how-to/collection-management.md).
2528

examples/major_tom_on_the_fly_collection.py

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,8 @@ def enrich_major_tom_columns(base: "rasteret.Collection", grid_km: int) -> pa.Ta
179179
"bbox_miny",
180180
"bbox_maxx",
181181
"bbox_maxy",
182+
"year",
183+
"month",
182184
"s2:product_uri",
183185
"collection",
184186
]:
@@ -364,17 +366,14 @@ def main() -> None:
364366
print(f"base_rows={base.dataset.count_rows()}")
365367

366368
enriched_table = enrich_major_tom_columns(base, args.grid_km)
367-
enriched = rasteret.build_from_table(
369+
enriched = rasteret.as_collection(
368370
enriched_table,
369371
name=args.name,
370-
data_source="sentinel-2-l2a",
371-
workspace_dir=args.workspace,
372-
enrich_cog=False,
373-
max_concurrent=args.max_concurrent,
374-
force=True,
375372
)
373+
collection_path = args.workspace / f"{args.name}_records"
374+
enriched.export(collection_path)
376375
print(f"enriched_rows={enriched.dataset.count_rows()}")
377-
print(f"collection_path={args.workspace / f'{args.name}_records'}")
376+
print(f"collection_path={collection_path}")
378377

379378
if args.metadata_path:
380379
report_metadata_overlap(enriched, args.metadata_path)

0 commit comments

Comments
 (0)