|
2 | 2 |
|
3 | 3 | This page explains how Rasteret's components work together. |
4 | 4 |
|
| 5 | +## Category framing |
| 6 | + |
| 7 | +Rasteret uses an **index-first geospatial retrieval** architecture: |
| 8 | + |
| 9 | +- **Control plane (tables/Parquet)**: dataset discovery, filtering, train/val/test splits, patch metadata, and cached COG metadata. |
| 10 | +- **Data plane (COG object storage)**: on-demand TIFF byte reads from source GeoTIFF/COG assets. |
| 11 | + |
| 12 | +This split keeps metadata workflows table-native while avoiding payload-in-Parquet duplication. |
| 13 | + |
| 14 | +### Architecture |
| 15 | + |
| 16 | +```text |
| 17 | +┌──────────────────────────────────────────────────────────────────────────────┐ |
| 18 | +│ Rasteret control plane (Collection lifecycle + index schema) │ |
| 19 | +│ build* / load / as_collection / export / subset / where │ |
| 20 | +│ outputs: Parquet-backed collection rows (scene metadata + COG header cache) │ |
| 21 | +└───────────────────────────────┬──────────────────────────────────────────────┘ |
| 22 | + │ Parquet rows + geometry + user columns |
| 23 | + v |
| 24 | +┌──────────────────────────────────────────────────────────────────────────────┐ |
| 25 | +│ Arrow ecosystem (optional compute + enrichment) │ |
| 26 | +│ DuckDB / Polars / GeoPandas / pandas / PyArrow │ |
| 27 | +│ operations: add split/label/patch/AOI columns, filter, joins, aggregations │ |
| 28 | +└───────────────────────────────┬──────────────────────────────────────────────┘ |
| 29 | + │ filtered rows + geometry column |
| 30 | + v |
| 31 | +┌──────────────────────────────────────────────────────────────────────────────┐ |
| 32 | +│ Rasteret IO engine (custom byte range fetches) │ |
| 33 | +│ get_numpy() / get_xarray() / to_torchgeo_dataset() │ |
| 34 | +│ consumes filtered rows + geometry + cached tile metadata │ |
| 35 | +└───────────────────────────────┬──────────────────────────────────────────────┘ |
| 36 | + │ async byte-range tile requests |
| 37 | + v |
| 38 | +┌──────────────────────────────────────────────────────────────────────────────┐ |
| 39 | +│ Data plane: object storage with GeoTIFF/COG files │ |
| 40 | +│ S3 / GCS / Azure / *.tif │ |
| 41 | +└──────────────────────────────────────────────────────────────────────────────┘ |
| 42 | +``` |
| 43 | + |
5 | 44 | ## Data flow |
6 | 45 |
|
7 | 46 | ```text |
@@ -33,7 +72,7 @@ Collection (Arrow dataset wrapper) |
33 | 72 | +--> iterate_rasters() --> async RasterAccessor stream |
34 | 73 | | |
35 | 74 | v |
36 | | -obstore (auto-routes to S3Store / AzureStore / GCSStore / HTTPStore) |
| 75 | +custom IO engine (async byte-range reads, tile decode, geometry mask) — obstore as HTTP transport (auto-routes to S3Store / AzureStore / GCSStore / HTTPStore) |
37 | 76 | ``` |
38 | 77 |
|
39 | 78 | `DatasetRegistry` is the in-code dataset catalog. It powers `build()` and the |
@@ -89,8 +128,9 @@ Rasteret's speedup over sequential approaches. |
89 | 128 | tile-level byte-range reads. It: |
90 | 129 |
|
91 | 130 | 1. Merges nearby byte ranges to minimize HTTP round-trips. |
92 | | -2. Delegates to `obstore` for all remote reads, with automatic URL routing |
93 | | - to native cloud stores (S3Store, AzureStore, GCSStore, or HTTPStore). |
| 131 | +2. Issues async byte-range requests via obstore (HTTP transport layer), with |
| 132 | + automatic URL routing to native cloud stores (S3Store, AzureStore, GCSStore, |
| 133 | + or HTTPStore). |
94 | 134 | 3. Decompresses tiles (deflate, LZW, zstd, LERC) in a thread pool. |
95 | 135 | (TIFF JPEG is currently rejected with a hard error until implemented.) |
96 | 136 |
|
|
0 commit comments