Skip to content

Commit cf9c7d5

Browse files
committed
docs: clarify that catalog entries and build paths are not STAC-only
README, dataset-catalog how-to, contributing guides, and interop docs all implied the catalog and build() are STAC-only. In reality, DatasetDescriptor supports STAC API, static catalog, or GeoParquet URI (AEF is a built-in GeoParquet-only entry). Updated framing to present both source types equally and restructured the dataset contribution prerequisites to show STAC and GeoParquet verification paths side by side. Also fixed stale test filename references from the earlier consolidation.
1 parent ba66e59 commit cf9c7d5

File tree

6 files changed

+112
-75
lines changed

6 files changed

+112
-75
lines changed

CONTRIBUTING.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,11 @@ uv run pytest -q
3232

3333
## Adding a dataset to the catalog
3434

35-
Add a `DatasetDescriptor` to `src/rasteret/catalog.py`. Before opening a PR,
36-
verify the [prerequisites](https://terrafloww.github.io/rasteret/how-to/dataset-catalog/#prerequisites-for-contributing-a-built-in-dataset):
37-
STAC access works, band map points to parseable COGs, `build()` succeeds
38-
end-to-end, and license metadata is sourced from the authoritative STAC API.
35+
Add a `DatasetDescriptor` to `src/rasteret/catalog.py`. Entries can point
36+
to a STAC API, a static STAC catalog, or a GeoParquet file. Before opening
37+
a PR, verify the [prerequisites](https://terrafloww.github.io/rasteret/how-to/dataset-catalog/#prerequisites-for-contributing-a-built-in-dataset):
38+
data source is reachable, band map points to parseable COGs, `build()`
39+
succeeds end-to-end, and license metadata is verified from the provider.
3940

4041
## Full contributor guide
4142

README.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ See [Getting Started](https://terrafloww.github.io/rasteret/getting-started/) fo
7171

7272
## Built-in datasets
7373

74-
Rasteret ships with a growing catalog of datasets, no STAC URLs to memorize:
74+
Rasteret ships with a growing catalog of datasets. Pick an ID and go:
7575

7676
```
7777
$ rasteret datasets list
@@ -90,11 +90,12 @@ pc/usda-cdl USDA Cropland Data Layer conus
9090
aef/v1-annual AlphaEarth Foundation Embeddings (Annual) global CC-BY-4.0 none
9191
```
9292

93-
Each entry includes license metadata sourced from the authoritative STAC API,
94-
and a `commercial_use` flag for quick filtering.
93+
Each entry includes license metadata and a `commercial_use` flag for quick
94+
filtering.
9595

96-
The catalog is open and community-driven. Each dataset entry is ~20 lines of
97-
Python: One PR adds a dataset; every user gets access on the next release.
96+
The catalog is open and community-driven. Each entry is ~20 lines of
97+
Python pointing to a STAC API or a GeoParquet file. One PR adds a dataset,
98+
every user gets access on the next release.
9899

99100
Pick any ID and pass it to `build()`. Don't see your dataset? Use
100101
`build_from_stac()` for any STAC API, `build_from_table()` for existing
@@ -118,9 +119,9 @@ collection = rasteret.build(
118119
)
119120
```
120121

121-
`build()` picks the dataset from the catalog, queries the STAC API, parses
122-
COG headers, and caches everything as Parquet. The next run loads in
123-
milliseconds.
122+
`build()` picks the dataset from the catalog (backed by a STAC API or a
123+
GeoParquet file, depending on the entry), parses COG headers, and caches
124+
everything as Parquet. The next run loads in milliseconds.
124125

125126
### Inspect and filter
126127

@@ -130,7 +131,7 @@ collection.bands # ['B01', 'B02', ..., 'B12', 'SCL']
130131
len(collection) # 47
131132

132133

133-
# Filter in memory no network calls
134+
# Filter in memory, no network calls
134135
filtered = collection.subset(cloud_cover_lt=15, date_range=("2024-03-01", "2024-06-01"))
135136
```
136137

docs/contributing.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -202,27 +202,28 @@ tile reads.
202202
## Adding a dataset to the catalog
203203

204204
The built-in catalog lives in `src/rasteret/catalog.py`. Each entry is a
205-
`DatasetDescriptor`: roughly 20 lines of Python declaring a STAC URL,
206-
band map, license info, and coverage hints.
205+
`DatasetDescriptor`: roughly 20 lines of Python declaring a data source
206+
(STAC API, static STAC catalog, or GeoParquet URI), band map, license
207+
info, and coverage hints.
207208

208209
Before adding a dataset, work through the
209210
[prerequisites checklist](how-to/dataset-catalog.md#prerequisites-for-contributing-a-built-in-dataset).
210211
The short version:
211212

212-
1. **STAC access works**: `pystac_client` for STAC APIs, `pystac` for
213-
static catalogs (`catalog.json` on S3).
213+
1. **Data source is reachable**: STAC API, static catalog, or GeoParquet
214+
file. Verify you can query it or read it with PyArrow.
214215
2. **Band map has at least one working COG**: Rasteret parses COG headers
215216
during `build()`. If no asset can be parsed, Rasteret can't index or read
216217
the dataset.
217-
3. **End-to-end `build()` succeeds**: run `rasteret.build()` with
218-
`query={"max_items": 2}` and verify `count_rows() > 0`.
218+
3. **End-to-end `build()` succeeds**: run `rasteret.build()` with a small
219+
scope and verify `len(col) > 0`.
219220
4. **License is verified from the authoritative source**: pull `license`,
220-
`license_url`, and `commercial_use` from the STAC collection metadata.
221-
Do not guess.
221+
`license_url`, and `commercial_use` from the data provider. Do not guess.
222222
5. **Descriptor includes required metadata**: include `id`, `name`, `description`,
223223
`stac_api` (or `geoparquet_uri`), `band_map`, `license`, `license_url`,
224224
`spatial_coverage`, `temporal_range`. For static catalogs, set
225-
`static_catalog=True`.
225+
`static_catalog=True`. For GeoParquet sources, include `column_map` if
226+
columns need renaming.
226227

227228
For static STAC catalogs (no `/search` endpoint), set `static_catalog=True`
228229
on the descriptor. Rasteret uses `pystac.Catalog.from_file()` to traverse

docs/explanation/compatibility-matrix.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ points to the tests that cover them.
2727
|---|---|---|---|
2828
| `get_xarray()` | rasterio windowed read | Sentinel-2 uint16, AEF int8 | `test_execution.py` |
2929
| `get_gdf()` | rasterio windowed read | Sentinel-2 uint16 | `test_execution.py` |
30-
| `to_torchgeo_dataset()` | Pure TorchGeo GeoDataset | Sentinel-2 uint16 | `test_torchgeo_network_usage.py` |
30+
| `to_torchgeo_dataset()` | Pure TorchGeo GeoDataset | Sentinel-2 uint16 | `test_torchgeo_network.py` |
3131

3232
## Running live smoke locally
3333

docs/explanation/interop.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ full contract that samplers and dataset composition rely on:
4040
| `crs` | Set from the collection's EPSG code via `CRS.from_epsg()` |
4141
| `res` | Derived from the first record's COG metadata transform |
4242
| Samplers | Works with `RandomGeoSampler`, `GridGeoSampler`, and any sampler that reads `bounds`, `index`, and `res` |
43-
| Dataset composition | Works with `IntersectionDataset` and `UnionDataset` the index is designed so `reset_index()` does not conflict |
43+
| Dataset composition | Works with `IntersectionDataset` and `UnionDataset`; the index is designed so `reset_index()` does not conflict |
4444

4545
Rasteret replaces the I/O backend (async obstore instead of rasterio/GDAL)
4646
but speaks the same interface. Nothing downstream of the dataset object
@@ -54,11 +54,11 @@ constructor parameters are Rasteret-specific.
5454

5555
| Feature | What it does | Interop impact |
5656
|---|---|---|
57-
| `label_field` | Adds `sample["label"]` from a metadata column | None extra key, ignored by TorchGeo trainers |
58-
| `time_series=True` | Stacks all spatially overlapping records into `[T, C, H, W]` | None standard tensor shape, works with TorchGeo transforms |
59-
| `target_crs=` | Reprojects scenes from different CRS zones on the fly | None result has uniform CRS, transparent to samplers |
60-
| `cloud_config=` | Configures authenticated cloud reads (requester-pays, signed URLs) | None constructor-level, transparent to samplers |
61-
| `allow_resample=True` | Resamples bands with different native resolutions onto a common grid | None output tensor has uniform resolution |
57+
| `label_field` | Adds `sample["label"]` from a metadata column | None: extra key, ignored by TorchGeo trainers |
58+
| `time_series=True` | Stacks all spatially overlapping records into `[T, C, H, W]` | None: standard tensor shape, works with TorchGeo transforms |
59+
| `target_crs=` | Reprojects scenes from different CRS zones on the fly | None: result has uniform CRS, transparent to samplers |
60+
| `cloud_config=` | Configures authenticated cloud reads (requester-pays, signed URLs) | None: constructor-level, transparent to samplers |
61+
| `allow_resample=True` | Resamples bands with different native resolutions onto a common grid | None: output tensor has uniform resolution |
6262

6363
#### Behavior details
6464

@@ -183,7 +183,7 @@ comparison uses `rasterio.merge.merge` as the oracle, matching what TorchGeo's
183183
own `_merge_or_stack` calls. Coverage spans 12 datasets including Sentinel-2,
184184
Landsat, NAIP, Copernicus DEM, ESA WorldCover, and AEF (south-up). See
185185
`test_dataset_pixel_comparison.py` (requires `--network`), plus
186-
`test_public_network_smoke.py`, `test_torchgeo_network_usage.py`, and
186+
`test_public_network_smoke.py`, `test_torchgeo_network.py`, and
187187
`test_network_smoke.py`.
188188

189189
If you encounter edge cases where output differs from rasterio, please

docs/how-to/dataset-catalog.md

Lines changed: 79 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,19 @@
11
# Dataset Catalog
22

33
Rasteret ships with a built-in **dataset catalog**: a registry of known
4-
datasets so you can build a Collection without remembering STAC API URLs or
5-
endpoint details. Most users only need the `build()` function shown below;
6-
the later sections cover browsing, local registration, and advanced
7-
customisation.
4+
datasets so you can build a Collection by ID without remembering endpoints
5+
or file paths. Each entry points to a STAC API, a GeoParquet file, or both.
6+
Most users only need the `build()` function shown below; the later sections
7+
cover browsing, local registration, and advanced customisation.
88

99
The built-in catalog includes 12 datasets: Sentinel-2, Landsat,
1010
NAIP, Copernicus DEM, ESRI Land Cover, ESA WorldCover, USDA CDL,
1111
ALOS DEM, NASADEM, and AlphaEarth Foundation embeddings. Run
1212
`rasteret datasets list` to see them all.
1313

14-
Each catalog entry includes **license metadata** sourced from the
15-
authoritative STAC API: a license identifier, a URL to the full license
16-
text, and a `commercial_use` flag so you can quickly tell whether the
17-
data can be used commercially.
14+
Each catalog entry includes **license metadata**: a license identifier, a
15+
URL to the full license text, and a `commercial_use` flag so you can quickly
16+
tell whether the data can be used commercially.
1817

1918
In short:
2019

@@ -55,7 +54,7 @@ Many entries include an **Example bbox** and **Example time**. These are known-g
5554
```python
5655
import rasteret
5756

58-
# Build from a catalog entry (STAC-backed datasets require bbox + date_range)
57+
# STAC-backed entries need bbox + date_range; GeoParquet entries may not
5958
collection = rasteret.build(
6059
"earthsearch/sentinel-2-l2a",
6160
name="bangalore",
@@ -105,9 +104,9 @@ If you want explicit control over STAC endpoints and collection IDs, use
105104

106105
TorchGeo ships per-dataset Python classes that download data locally
107106
and read from disk via rasterio/GDAL. Rasteret's catalog points at
108-
cloud-hosted data (STAC APIs, GeoParquet): no downloads, no custom
109-
code per dataset. You get a standard `Collection` from any catalog
110-
entry, then read pixels on demand from the cloud.
107+
cloud-hosted data (STAC APIs or GeoParquet files): no downloads, no
108+
custom code per dataset. You get a standard `Collection` from any
109+
catalog entry, then read pixels on demand from the cloud.
111110

112111
---
113112

@@ -182,7 +181,8 @@ There are two supported patterns:
182181
)
183182
```
184183

185-
The `band_map` maps user-facing band codes to STAC asset keys.
184+
The `band_map` maps user-facing band codes to asset keys (STAC
185+
asset keys, or column-derived keys for GeoParquet sources).
186186
It is auto-registered so downstream code resolves band names
187187
without users needing to touch `BandRegistry` directly.
188188

@@ -202,39 +202,69 @@ requirements. Every built-in entry must actually work with Rasteret's
202202
pipeline. Listing a dataset that can't be ingested is worse than not
203203
listing it at all.
204204

205-
### 1. STAC access works
205+
A catalog entry can point to a **STAC API**, a **static STAC catalog**,
206+
or a **GeoParquet file** (like the built-in AEF embeddings dataset).
207+
The verification steps differ slightly depending on the source type.
206208

207-
The dataset must be reachable via either a **STAC API** (with a `/search`
208-
endpoint) or a **static STAC catalog** (`catalog.json` on S3). Verify
209-
with:
209+
### 1. Data source is reachable
210210

211-
```python
212-
# STAC API
213-
import pystac_client
214-
client = pystac_client.Client.open("<stac_api_url>")
215-
col = client.get_collection("<collection_id>")
211+
=== "STAC API / static catalog"
216212

217-
# Static catalog
218-
import pystac
219-
cat = pystac.Catalog.from_file("<catalog_json_url>")
220-
items = list(cat.get_all_items()) # should return items
221-
```
213+
The dataset must be reachable via either a **STAC API** (with a `/search`
214+
endpoint) or a **static STAC catalog** (`catalog.json` on S3). Verify
215+
with:
216+
217+
```python
218+
# STAC API
219+
import pystac_client
220+
client = pystac_client.Client.open("<stac_api_url>")
221+
col = client.get_collection("<collection_id>")
222+
223+
# Static catalog
224+
import pystac
225+
cat = pystac.Catalog.from_file("<catalog_json_url>")
226+
items = list(cat.get_all_items()) # should return items
227+
```
228+
229+
=== "GeoParquet"
230+
231+
The Parquet file must be readable by PyArrow and contain the four
232+
required columns (`id`, `datetime`, `geometry`, `assets`), or columns
233+
that can be mapped to them via `column_map`. Verify with:
234+
235+
```python
236+
import pyarrow.parquet as pq
237+
table = pq.read_table("<parquet_uri>")
238+
print(table.schema)
239+
# Confirm id, datetime, geometry, and asset URL columns exist
240+
```
222241

223-
### 2. Band map has at least one working COG asset
242+
### 2. Assets point to tiled GeoTIFFs (COGs)
224243

225-
The `band_map` must map at least one band code to a STAC asset key that
226-
points to a Cloud-Optimized GeoTIFF (COG). Rasteret parses COG headers
244+
The `band_map` must map at least one band code to an asset key that
245+
points to a Cloud-Optimized GeoTIFF. Rasteret parses COG headers
227246
during `build()`. If no assets can be parsed, Rasteret can't index or read
228247
the dataset.
229248

230-
Check a sample item's asset keys:
249+
=== "STAC"
231250

232-
```python
233-
item = items[0]
234-
for key, asset in item.assets.items():
235-
print(f"{key}: {asset.media_type}")
236-
# Look for "image/tiff" or "application=geotiff" entries
237-
```
251+
Check a sample item's asset keys:
252+
253+
```python
254+
item = items[0]
255+
for key, asset in item.assets.items():
256+
print(f"{key}: {asset.media_type}")
257+
# Look for "image/tiff" or "application=geotiff" entries
258+
```
259+
260+
=== "GeoParquet"
261+
262+
Check that the `assets` column (or `href_column`) contains URLs
263+
pointing to `.tif` / `.tiff` files:
264+
265+
```python
266+
print(table.column("assets")[0]) # or your href column
267+
```
238268

239269
### 3. End-to-end `build()` succeeds
240270

@@ -245,29 +275,30 @@ import rasteret
245275
col = rasteret.build(
246276
"<dataset_id>",
247277
name="smoke-test",
248-
query={"max_items": 2},
278+
query={"max_items": 2}, # STAC only; ignored for GeoParquet
249279
force=True,
250280
)
251-
print(col.dataset.count_rows()) # should be > 0
281+
print(len(col)) # should be > 0
252282
```
253283

254-
For STAC API datasets (non-static catalogs), `bbox` and `date_range` are required.
284+
For STAC API datasets (non-static catalogs), `bbox` and `date_range` are
285+
required. GeoParquet-backed entries may not need them (the Parquet file
286+
is the complete record set).
255287

256288
### 4. License is verified from the authoritative source
257289

258-
Pull the license from the STAC API or catalog metadata. Do not guess:
290+
Pull the license from the data provider. Do not guess:
259291

260292
```python
261293
# STAC API
262294
col = client.get_collection("<collection_id>")
263295
print(col.license) # "CC-BY-4.0", "proprietary", etc.
264296
license_links = [l.href for l in col.links if l.rel == "license"]
265-
266-
# Static catalog - check item-level properties
267-
item = items[0]
268-
print(item.properties.get("license"))
269297
```
270298

299+
For GeoParquet-only datasets, check the data provider's website or
300+
repository for license terms.
301+
271302
Set `commercial_use=False` when the license prohibits it (e.g.
272303
`CC-BY-NC-4.0`).
273304

@@ -276,6 +307,9 @@ Set `commercial_use=False` when the license prohibits it (e.g.
276307
Include at minimum: `id`, `name`, `description`, `stac_api` (or
277308
`geoparquet_uri`), `band_map`, `license`, `license_url`, `spatial_coverage`,
278309
`temporal_range`. For static catalogs, set `static_catalog=True`.
310+
For GeoParquet sources, include `column_map` if the source uses
311+
non-standard column names, and `href_column` if asset URLs live in a
312+
single column rather than a struct.
279313

280314
---
281315

0 commit comments

Comments
 (0)