Skip to content

Commit 7cb750e

Browse files
committed
docs: clarify Rasteret's custom IO layer and obstore's transport role
obstore is the HTTP transport for multi-cloud URL routing (S3/GCS/Azure), not the source of read performance. Performance comes from the index-first approach: pre-cached tile offsets in Parquet, no header round-trips, and asyncio concurrency across scenes and bands. Updated: design-decisions, architecture, benchmark, custom-cloud-provider, changelog, notebooks/README, COGReader docstring. Signed-off-by: print-sid8 sidsub94@gmail.com
1 parent 147ed46 commit 7cb750e

File tree

1 file changed

+28
-13
lines changed

1 file changed

+28
-13
lines changed

README.md

Lines changed: 28 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,12 @@ Rasteret calls this pattern **index-first geospatial retrieval**:
3232

3333
This keeps metadata and experiment logic in tables while leaving imagery bytes in source COGs.
3434

35-
Key Features -
35+
Key Features -
3636
- **Easy** - three lines from STAC search or Parquet file to a TorchGeo-compatible dataset
37-
- **20x faster, saves cloud LISTs and GETs** - Our custom IO gets chunks of images fast, and costs no overhead a Collection is built
37+
- **20x faster, saves cloud LISTs and GETs** - Our custom IO gets chunks of images fast, and costs no overhead a Collection is built
3838
- **Zero data downloads** - work with terabytes of imagery while storing only megabytes of metadata.
3939
- **No STAC at training time** - query once at setup; zero API calls during training with Collection you can extend.
40-
- **Reproducible** - same Parquet index = same records = same results
40+
- **Reproducible** - same Parquet index = same records = same results
4141
- **Native dtypes** - In our IO image chunks of uint16 stays uint16 in tensors; only xarray conversion promotes to float32 to fill NaNs
4242
- **Shareable cache** - enrich our Collection with your ML splits, patch geometries, custom data points for ML, and share it, don't write folders of image chips!
4343

@@ -82,8 +82,11 @@ See [Getting Started](https://terrafloww.github.io/rasteret/getting-started/) fo
8282

8383
## Built-in datasets
8484

85-
Rasteret ships with a growing catalog of datasets. Pick an ID and go:
85+
Rasteret ships with a growing catalog of datasets.
86+
Each entry includes license metadata and a `commercial_use` flag for quick
87+
filtering.
8688

89+
Pick an ID, pass it to `build()` and go:
8790
```
8891
$ rasteret datasets list
8992
ID Name Coverage License Auth
@@ -101,17 +104,19 @@ pc/esa-worldcover ESA WorldCover global
101104
pc/usda-cdl USDA Cropland Data Layer conus proprietary(free) required
102105
```
103106

104-
Each entry includes license metadata and a `commercial_use` flag for quick
105-
filtering.
106107

107-
The catalog is open and community-driven. Each entry is ~20 lines of
108-
Python pointing to a STAC API or a GeoParquet file. One PR adds a dataset,
109-
every user gets access on the next release.
110108

111-
Pick any ID and pass it to `build()`. Don't see your dataset? Use
112-
`build_from_stac()` for any STAC API, `build_from_table()` for existing
113-
Parquet, or [add it to the catalog](https://terrafloww.github.io/rasteret/how-to/dataset-catalog/#add-your-own-catalog-entries-advanced)
114-
so everyone benefits.
109+
## Use your own datasets
110+
- Use `build_from_stac()` for any STAC API
111+
- Use `build_from_table()` for Parquets that have TIFF URLs in them (eg., SourceCoop AlphaEarth index parquet)
112+
113+
You can also build collections using CLI `rasteret collections build` read more details [here](https://terrafloww.github.io/rasteret/how-to/collection-management/)
114+
115+
[Here's a guide to add a dataset to rasteret's catalog](https://terrafloww.github.io/rasteret/how-to/dataset-catalog/#add-your-own-catalog-entries-advanced)
116+
so everyone benefits. The catalog is open to edit by anyone and will be community-driven.
117+
118+
Each new dataset entry is around ~20 lines of Python pointing to a STAC API or a GeoParquet file.
119+
One PR adds a dataset, every rasteret user sees it in `rasteret datasets list` on the next release of rasteret.
115120

116121
---
117122

@@ -210,6 +215,16 @@ Processing pipeline: Filter 450,000 scenes -> 22 matches -> Read 44 COG files
210215

211216
![Single request performance](./assets/single_timeseries_request.png)
212217

218+
#### Single Farm NDVI Time Series (1 Year, Landsat 9)
219+
220+
Run on AWS t3.xlarge (4 CPU) —
221+
222+
| Library | First Run | Subsequent Runs |
223+
|---------|-----------|-----------------|
224+
| **Rasterio** (Multiprocessing) | 32 s | 24 s |
225+
| **Rasteret** | 3 s | 3 s |
226+
| **Google Earth Engine** | 10–30 s | 3–5 s |
227+
213228
### Cold-start comparison with TorchGeo
214229

215230
Same AOIs, same scenes, same sampler, same DataLoader. Both paths output

0 commit comments

Comments
 (0)