11# Dataset Catalog
22
33Rasteret ships with a built-in ** dataset catalog** : a registry of known
4- datasets so you can build a Collection without remembering STAC API URLs or
5- endpoint details. Most users only need the ` build() ` function shown below;
6- the later sections cover browsing, local registration, and advanced
7- customisation.
4+ datasets so you can build a Collection by ID without remembering endpoints
5+ or file paths. Each entry points to a STAC API, a GeoParquet file, or both.
6+ Most users only need the ` build() ` function shown below; the later sections
7+ cover browsing, local registration, and advanced customisation.
88
99The built-in catalog includes 12 datasets: Sentinel-2, Landsat,
1010NAIP, Copernicus DEM, ESRI Land Cover, ESA WorldCover, USDA CDL,
1111ALOS DEM, NASADEM, and AlphaEarth Foundation embeddings. Run
1212` rasteret datasets list ` to see them all.
1313
14- Each catalog entry includes ** license metadata** sourced from the
15- authoritative STAC API: a license identifier, a URL to the full license
16- text, and a ` commercial_use ` flag so you can quickly tell whether the
17- data can be used commercially.
14+ Each catalog entry includes ** license metadata** : a license identifier, a
15+ URL to the full license text, and a ` commercial_use ` flag so you can quickly
16+ tell whether the data can be used commercially.
1817
1918In short:
2019
@@ -55,7 +54,7 @@ Many entries include an **Example bbox** and **Example time**. These are known-g
5554``` python
5655import rasteret
5756
58- # Build from a catalog entry ( STAC-backed datasets require bbox + date_range)
57+ # STAC-backed entries need bbox + date_range; GeoParquet entries may not
5958collection = rasteret.build(
6059 " earthsearch/sentinel-2-l2a" ,
6160 name = " bangalore" ,
@@ -105,9 +104,9 @@ If you want explicit control over STAC endpoints and collection IDs, use
105104
106105 TorchGeo ships per-dataset Python classes that download data locally
107106 and read from disk via rasterio/GDAL. Rasteret's catalog points at
108- cloud-hosted data (STAC APIs, GeoParquet): no downloads, no custom
109- code per dataset. You get a standard `Collection` from any catalog
110- entry, then read pixels on demand from the cloud.
107+ cloud-hosted data (STAC APIs or GeoParquet files ): no downloads, no
108+ custom code per dataset. You get a standard `Collection` from any
109+ catalog entry, then read pixels on demand from the cloud.
111110
112111---
113112
@@ -182,7 +181,8 @@ There are two supported patterns:
182181 )
183182 ```
184183
185- The ` band_map ` maps user-facing band codes to STAC asset keys.
184+ The ` band_map ` maps user-facing band codes to asset keys (STAC
185+ asset keys, or column-derived keys for GeoParquet sources).
186186 It is auto-registered so downstream code resolves band names
187187 without users needing to touch ` BandRegistry ` directly.
188188
@@ -202,39 +202,69 @@ requirements. Every built-in entry must actually work with Rasteret's
202202pipeline. Listing a dataset that can't be ingested is worse than not
203203listing it at all.
204204
205- ### 1. STAC access works
205+ A catalog entry can point to a ** STAC API** , a ** static STAC catalog** ,
206+ or a ** GeoParquet file** (like the built-in AEF embeddings dataset).
207+ The verification steps differ slightly depending on the source type.
206208
207- The dataset must be reachable via either a ** STAC API** (with a ` /search `
208- endpoint) or a ** static STAC catalog** (` catalog.json ` on S3). Verify
209- with:
209+ ### 1. Data source is reachable
210210
211- ``` python
212- # STAC API
213- import pystac_client
214- client = pystac_client.Client.open(" <stac_api_url>" )
215- col = client.get_collection(" <collection_id>" )
211+ === "STAC API / static catalog"
216212
217- # Static catalog
218- import pystac
219- cat = pystac.Catalog.from_file(" <catalog_json_url>" )
220- items = list (cat.get_all_items()) # should return items
221- ```
213+ The dataset must be reachable via either a **STAC API** (with a `/search`
214+ endpoint) or a **static STAC catalog** (`catalog.json` on S3). Verify
215+ with:
216+
217+ ```python
218+ # STAC API
219+ import pystac_client
220+ client = pystac_client.Client.open("<stac_api_url>")
221+ col = client.get_collection("<collection_id>")
222+
223+ # Static catalog
224+ import pystac
225+ cat = pystac.Catalog.from_file("<catalog_json_url>")
226+ items = list(cat.get_all_items()) # should return items
227+ ```
228+
229+ === "GeoParquet"
230+
231+ The Parquet file must be readable by PyArrow and contain the four
232+ required columns (`id`, `datetime`, `geometry`, `assets`), or columns
233+ that can be mapped to them via `column_map`. Verify with:
234+
235+ ```python
236+ import pyarrow.parquet as pq
237+ table = pq.read_table("<parquet_uri>")
238+ print(table.schema)
239+ # Confirm id, datetime, geometry, and asset URL columns exist
240+ ```
222241
223- ### 2. Band map has at least one working COG asset
242+ ### 2. Assets point to tiled GeoTIFFs (COGs)
224243
225- The ` band_map ` must map at least one band code to a STAC asset key that
226- points to a Cloud-Optimized GeoTIFF (COG) . Rasteret parses COG headers
244+ The ` band_map ` must map at least one band code to an asset key that
245+ points to a Cloud-Optimized GeoTIFF. Rasteret parses COG headers
227246during ` build() ` . If no assets can be parsed, Rasteret can't index or read
228247the dataset.
229248
230- Check a sample item's asset keys:
249+ === "STAC"
231250
232- ``` python
233- item = items[0 ]
234- for key, asset in item.assets.items():
235- print (f " { key} : { asset.media_type} " )
236- # Look for "image/tiff" or "application=geotiff" entries
237- ```
251+ Check a sample item's asset keys:
252+
253+ ```python
254+ item = items[0]
255+ for key, asset in item.assets.items():
256+ print(f"{key}: {asset.media_type}")
257+ # Look for "image/tiff" or "application=geotiff" entries
258+ ```
259+
260+ === "GeoParquet"
261+
262+ Check that the `assets` column (or `href_column`) contains URLs
263+ pointing to `.tif` / `.tiff` files:
264+
265+ ```python
266+ print(table.column("assets")[0]) # or your href column
267+ ```
238268
239269### 3. End-to-end ` build() ` succeeds
240270
@@ -245,29 +275,30 @@ import rasteret
245275col = rasteret.build(
246276 " <dataset_id>" ,
247277 name = " smoke-test" ,
248- query = {" max_items" : 2 },
278+ query = {" max_items" : 2 }, # STAC only; ignored for GeoParquet
249279 force = True ,
250280)
251- print (col.dataset.count_rows( )) # should be > 0
281+ print (len (col )) # should be > 0
252282```
253283
254- For STAC API datasets (non-static catalogs), ` bbox ` and ` date_range ` are required.
284+ For STAC API datasets (non-static catalogs), ` bbox ` and ` date_range ` are
285+ required. GeoParquet-backed entries may not need them (the Parquet file
286+ is the complete record set).
255287
256288### 4. License is verified from the authoritative source
257289
258- Pull the license from the STAC API or catalog metadata . Do not guess:
290+ Pull the license from the data provider . Do not guess:
259291
260292``` python
261293# STAC API
262294col = client.get_collection(" <collection_id>" )
263295print (col.license) # "CC-BY-4.0", "proprietary", etc.
264296license_links = [l.href for l in col.links if l.rel == " license" ]
265-
266- # Static catalog - check item-level properties
267- item = items[0 ]
268- print (item.properties.get(" license" ))
269297```
270298
299+ For GeoParquet-only datasets, check the data provider's website or
300+ repository for license terms.
301+
271302Set ` commercial_use=False ` when the license prohibits it (e.g.
272303` CC-BY-NC-4.0 ` ).
273304
@@ -276,6 +307,9 @@ Set `commercial_use=False` when the license prohibits it (e.g.
276307Include at minimum: ` id ` , ` name ` , ` description ` , ` stac_api ` (or
277308` geoparquet_uri ` ), ` band_map ` , ` license ` , ` license_url ` , ` spatial_coverage ` ,
278309` temporal_range ` . For static catalogs, set ` static_catalog=True ` .
310+ For GeoParquet sources, include ` column_map ` if the source uses
311+ non-standard column names, and ` href_column ` if asset URLs live in a
312+ single column rather than a struct.
279313
280314---
281315
0 commit comments