Skip to content

Commit 431eeb0

Browse files
committed
docs: clarify Rasteret's custom IO layer and obstore's transport role
obstore is the HTTP transport for multi-cloud URL routing (S3/GCS/Azure), not the source of read performance. Performance comes from the index-first approach: pre-cached tile offsets in Parquet, no header round-trips, and asyncio concurrency across scenes and bands. Updated: design-decisions, architecture, benchmark, custom-cloud-provider, changelog, notebooks/README, COGReader docstring. Signed-off-by: print-sid8 sidsub94@gmail.com
1 parent 974c029 commit 431eeb0

File tree

12 files changed

+186
-60
lines changed

12 files changed

+186
-60
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ Rasteret parses those headers **once**, caches them in Parquet, and its
2525
own reader fetches pixels concurrently with no GDAL in the path.
2626
**Up to 20x faster** on cold starts.
2727

28-
We call this pattern **index-first geospatial image retrieval**:
28+
Rasteret calls this pattern **index-first geospatial retrieval**:
2929

3030
- **Control plane**: a queryable Parquet index (scene metadata, COG header metadata, user columns like splits/labels)
3131
- **Data plane**: on-demand tile reads from the original GeoTIFF/COG objects
@@ -326,7 +326,7 @@ the full GeoDataset contract:
326326
- Works with `RandomGeoSampler`, `GridGeoSampler`, and any custom sampler
327327
- Works with `IntersectionDataset` and `UnionDataset` for dataset composition
328328

329-
Rasteret replaces the I/O backend (async obstore instead of rasterio/GDAL) but
329+
Rasteret replaces the I/O backend (custom IO instead of rasterio/GDAL) but
330330
speaks the same interface. Your samplers, DataLoader, transforms, and training
331331
loop do not change.
332332

docs/changelog.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,9 +38,10 @@
3838
Cooperative exports, STAC GeoParquet, custom catalogs). No STAC API
3939
required for the table path. Optional `enrich_cog=True` parses COG
4040
headers for accelerated reads.
41-
- **Multi-cloud obstore backend**: S3, Azure Blob, and GCS routing via URL
42-
auto-detection, with automatic fallback to anonymous access.
43-
- **`create_backend()`** for authenticated reads with obstore credential
41+
- **Multi-cloud support**: S3, Azure Blob, and GCS routing via URL
42+
auto-detection, with automatic fallback to anonymous access. obstore replaces
43+
direct aiohttp as the HTTP transport, adding unified multi-cloud routing.
44+
- **`create_backend()`** for authenticated reads with credential
4445
providers (e.g., Planetary Computer SAS tokens).
4546
- **TorchGeo adapter**: `collection.to_torchgeo_dataset()` returns a
4647
`GeoDataset` backed by Rasteret's async COG reader. Supports
@@ -101,7 +102,7 @@
101102
### Other changes
102103

103104
- Arrow-native geometry internals (GeoArrow replaces Shapely in hot paths).
104-
- obstore as base dependency (Rust-native async HTTP).
105+
- obstore as base dependency (multi-cloud support).
105106
- CLI: `rasteret collections build|list|info|delete|import`,
106107
`rasteret datasets list|info|build|register-local|export-local|unregister-local`.
107108
- TorchGeo `time_series=True` uses spatial-only intersection, matching

docs/contributing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ src/rasteret/
8686
│ └── utils.py Sync/async bridging (run_sync), CRS transforms, data source inference
8787
8888
├── fetch/
89-
│ ├── cog.py [COGReader](reference/fetch/cog.md): HTTP range reads, tile decompression, obstore backend
89+
│ ├── cog.py [COGReader](reference/fetch/cog.md): HTTP range reads, tile decompression, aioHTTP via Obstore
9090
│ └── header_parser.py TIFF/COG header parsing: IFD extraction, tile offset discovery
9191
9292
├── ingest/

docs/explanation/architecture.md

Lines changed: 43 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,45 @@
22

33
This page explains how Rasteret's components work together.
44

5+
## Category framing
6+
7+
Rasteret uses an **index-first geospatial retrieval** architecture:
8+
9+
- **Control plane (tables/Parquet)**: dataset discovery, filtering, train/val/test splits, patch metadata, and cached COG metadata.
10+
- **Data plane (COG object storage)**: on-demand TIFF byte reads from source GeoTIFF/COG assets.
11+
12+
This split keeps metadata workflows table-native while avoiding payload-in-Parquet duplication.
13+
14+
### Architecture
15+
16+
```text
17+
┌──────────────────────────────────────────────────────────────────────────────┐
18+
│ Rasteret control plane (Collection lifecycle + index schema) │
19+
│ build* / load / as_collection / export / subset / where │
20+
│ outputs: Parquet-backed collection rows (scene metadata + COG header cache) │
21+
└───────────────────────────────┬──────────────────────────────────────────────┘
22+
│ Parquet rows + geometry + user columns
23+
v
24+
┌──────────────────────────────────────────────────────────────────────────────┐
25+
│ Arrow ecosystem (optional compute + enrichment) │
26+
│ DuckDB / Polars / GeoPandas / pandas / PyArrow │
27+
│ operations: add split/label/patch/AOI columns, filter, joins, aggregations │
28+
└───────────────────────────────┬──────────────────────────────────────────────┘
29+
│ filtered rows + geometry column
30+
v
31+
┌──────────────────────────────────────────────────────────────────────────────┐
32+
│ Rasteret IO engine (custom byte range fetches) │
33+
│ get_numpy() / get_xarray() / to_torchgeo_dataset() │
34+
│ consumes filtered rows + geometry + cached tile metadata │
35+
└───────────────────────────────┬──────────────────────────────────────────────┘
36+
│ async byte-range tile requests
37+
v
38+
┌──────────────────────────────────────────────────────────────────────────────┐
39+
│ Data plane: object storage with GeoTIFF/COG files │
40+
│ S3 / GCS / Azure / *.tif │
41+
└──────────────────────────────────────────────────────────────────────────────┘
42+
```
43+
544
## Data flow
645

746
```text
@@ -33,7 +72,7 @@ Collection (Arrow dataset wrapper)
3372
+--> iterate_rasters() --> async RasterAccessor stream
3473
|
3574
v
36-
obstore (auto-routes to S3Store / AzureStore / GCSStore / HTTPStore)
75+
custom IO engine (async byte-range reads, tile decode, geometry mask) — obstore as HTTP transport (auto-routes to S3Store / AzureStore / GCSStore / HTTPStore)
3776
```
3877

3978
`DatasetRegistry` is the in-code dataset catalog. It powers `build()` and the
@@ -89,8 +128,9 @@ Rasteret's speedup over sequential approaches.
89128
tile-level byte-range reads. It:
90129

91130
1. Merges nearby byte ranges to minimize HTTP round-trips.
92-
2. Delegates to `obstore` for all remote reads, with automatic URL routing
93-
to native cloud stores (S3Store, AzureStore, GCSStore, or HTTPStore).
131+
2. Issues async byte-range requests via obstore (HTTP transport layer), with
132+
automatic URL routing to native cloud stores (S3Store, AzureStore, GCSStore,
133+
or HTTPStore).
94134
3. Decompresses tiles (deflate, LZW, zstd, LERC) in a thread pool.
95135
(TIFF JPEG is currently rejected with a hard error until implemented.)
96136

docs/explanation/benchmark.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ before pixel reads begin.
4545
| Backend | Transport | Install |
4646
|---------|-----------|---------|
4747
| **TorchGeo** | GDAL `/vsicurl/` (sequential) | `uv pip install torchgeo` |
48-
| **Rasteret** | Rust-native HTTP via obstore (concurrent) | `uv pip install rasteret` |
48+
| **Rasteret** | Custom async IO engine (concurrent byte-range reads, obstore transport) | `uv pip install rasteret` |
4949

5050
## Cold start (first run)
5151

docs/explanation/design-decisions.md

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -84,17 +84,16 @@ When does promotion happen?
8484
In both cases, the promotion is explicit at the call site (not an implicit
8585
side effect of masking).
8686

87-
## obstore as the HTTP backend
87+
## obstore as the URL routing backend
8888

89-
Rasteret uses [obstore](https://github.com/developmentseed/obstore) for all
90-
remote byte-range reads. obstore wraps the
91-
[`object_store`](https://docs.rs/object_store/) Rust crate. It is Rust-native,
92-
multi-cloud, and does not depend on GDAL.
89+
Rasteret uses [obstore](https://github.com/developmentseed/obstore) as the
90+
HTTP transport layer for multi-cloud URL routing. Rasteret's byte-range read
91+
logic, tile decode, and concurrency are custom; obstore replaced direct aiohttp
92+
to gain unified S3/GCS/Azure routing without adding per-cloud clients.
9393

9494
Why obstore as a hard dependency rather than optional?
9595

96-
- **Multi-cloud + auth**: Rasteret previously used a Python HTTP client
97-
(aiohttp) for range reads. obstore gives a single, well-tested interface for
96+
- **Multi-cloud + auth**: Rasteret has custom IO for byte range reads. obstore gives a single, well-tested interface for
9897
S3/GCS/Azure/HTTPS plus credential providers (requester-pays, SAS signing,
9998
Earthdata, etc.).
10099
- **Single code path**: one backend means fewer branches to test and maintain.
@@ -103,13 +102,12 @@ Why obstore as a hard dependency rather than optional?
103102
`storage.googleapis.com`, AzureStore for `*.blob.core.windows.net`, and
104103
HTTPStore for everything else. Authenticated reads use obstore credential
105104
providers via `create_backend()`.
106-
- **Well-maintained lineage**: obstore is from Development Seed and it wraps
107-
the same Rust crate [`object_store`](https://docs.rs/object_store/) that Databricks, InfluxDB, and many in Arrow
108-
ecosystem depend on.
105+
- **Well-maintained lineage**: obstore wraps the same Rust crate [`object_store`](https://docs.rs/object_store/)
106+
that Databricks, InfluxDB, and many in Arrow ecosystem depend on.
109107

110108
## Decoupled index and read layers
111109

112-
The Parquet index is the stable contract. The read layer ([COGReader](../reference/fetch/cog.md), obstore)
110+
The Parquet index is the stable contract. The read layer ([COGReader](../reference/fetch/cog.md))
113111
is swappable via the [`StorageBackend`](../reference/cloud.md) protocol. Concretely:
114112

115113
- Custom backends (e.g. a pre-configured `S3Store`) can be plugged in without

docs/explanation/interop.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ full contract that samplers and dataset composition rely on:
4646
| Samplers | Works with `RandomGeoSampler`, `GridGeoSampler`, and any sampler that reads `bounds`, `index`, and `res` |
4747
| Dataset composition | Works with `IntersectionDataset` and `UnionDataset`; the index is designed so `reset_index()` does not conflict |
4848

49-
Rasteret replaces the I/O backend (async obstore instead of rasterio/GDAL)
49+
Rasteret replaces the I/O backend (custom IO instead of rasterio/GDAL)
5050
but speaks the same interface. Nothing downstream of the dataset object
5151
needs to change.
5252

@@ -93,7 +93,7 @@ See [TorchGeo Integration](../tutorials/02_torchgeo_09_accelerator.ipynb) and
9393

9494
### xarray / GeoPandas / NumPy
9595

96-
Rasteret handles the I/O (async byte-range reads via obstore), then hands
96+
Rasteret handles the I/O (async byte-range reads), then hands
9797
off to standard xarray, GeoPandas, or NumPy outputs:
9898

9999
- [`Collection.get_xarray(...)`](../reference/core/collection.md) returns an `xr.Dataset`
@@ -137,7 +137,7 @@ Rasteret uses rasterio for geometry masking (`rasterio.features.geometry_mask`),
137137
multi-CRS reprojection (`rasterio.warp.reproject`), and TorchGeo query-grid
138138
placement (`rasterio.merge.merge` via `rio_semantics.py`). CRS transforms and
139139
coordinate operations use pyproj directly. Tile reads go through Rasteret's
140-
own async pipeline backed by obstore. No GDAL in the tile-read path.
140+
own async IO. No GDAL in the tile-read path.
141141

142142
CRS encoding in xarray output uses pyproj's CF conventions (`CRS.to_cf()`,
143143
`CRS.to_wkt()`, `CRS.to_json()`), not rioxarray.

docs/how-to/custom-cloud-provider.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,8 @@
33
## Do you need this page?
44

55
Most datasets work **without any authentication or configuration**. Rasteret's
6-
obstore backend routes URLs to native cloud stores (S3, Azure Blob, GCS)
7-
automatically, and public data like Sentinel-2 on Earth Search is read
8-
anonymously.
6+
IO layer automatically routes URLs to native cloud stores (S3, Azure Blob, GCS),
7+
and public data like Sentinel-2 on Earth Search is read anonymously.
98

109
| Data source | Auth needed | What to do |
1110
|---|---|---|
@@ -82,7 +81,7 @@ ds = collection.get_xarray(
8281
)
8382
```
8483

85-
Rasteret auto-creates an obstore backend from the config at read time.
84+
Rasteret auto-creates a backend from the config at read time.
8685
For requester-pays buckets, AWS credentials are resolved automatically
8786
from environment variables or `~/.aws/credentials` via boto3.
8887

@@ -107,10 +106,10 @@ For catalog datasets, the `data_source` is the catalog ID (e.g.
107106
should use with `CloudConfig.get_config(...)` if you want to inspect or
108107
override built-in behavior.
109108

110-
## Multi-cloud obstore backend
109+
## Multi-cloud URL routing (via obstore)
111110

112-
Rasteret uses obstore for all remote reads and natively routes URLs to the
113-
correct cloud store:
111+
Rasteret's IO layer uses obstore as the HTTP transport and natively routes
112+
URLs to the correct cloud store:
114113

115114
| URL pattern | Store type |
116115
|---|---|
@@ -128,7 +127,7 @@ This happens automatically -- no configuration needed for public data.
128127

129128
Use `create_backend()` when your data source provides its own credential
130129
mechanism (Planetary Computer SAS tokens, Earthdata-style temporary S3 credentials).
131-
This passes the credential provider directly to the obstore native store
130+
This passes the credential provider to the underlying cloud store
132131
(S3Store, AzureStore, GCSStore):
133132

134133
### Planetary Computer

docs/reference/integrations/torchgeo.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
TorchGeo `GeoDataset` adapter for Rasteret collections.
44

55
`RasteretGeoDataset` is a standard TorchGeo `GeoDataset` subclass. It
6-
replaces the I/O backend (async obstore instead of rasterio/GDAL) while
6+
replaces the I/O backend (instead of rasterio/GDAL) while
77
honoring the full GeoDataset contract: `index`, `crs`, `res`,
88
`__getitem__(GeoSlice) -> Sample`. Compatible with all TorchGeo samplers,
99
collation helpers (`stack_samples`, `concat_samples`), transforms, and

0 commit comments

Comments
 (0)