Skip to content

Commit 48aea6a

Browse files
committed
major revamp of readme thanks to user feedback
1 parent 6f68755 commit 48aea6a

File tree

1 file changed

+173
-47
lines changed

1 file changed

+173
-47
lines changed

README.md

Lines changed: 173 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,142 @@
11
# 🛰️ Rasteret
22

3-
Fast and efficient access to Cloud-Optimized GeoTIFFs (COGs), optimized for Sentinel-2 and Landsat data.
3+
Faster querying of Cloud-Optimized GeoTIFFs (COGs) with lower HTTP requests in your workflows, currently tested for Sentinel-2 and Landsat COG files.
44

55
> [!WARNING]
66
> Work-in-progress library. The APIs are subject to change, and as such, documentation is not yet available.
77
8+
## Table of Contents
9+
- [Features](#-features)
10+
- [Why Rasteret?](#why-this-library)
11+
- [Built-in Data Sources](#-built-in-data-sources)
12+
- [Prerequisites](#-prerequisites)
13+
- [Installation](#-installation)
14+
- [Quick Start](#-quick-start)
15+
- [License](#-license)
16+
- [Contributing](#-contributing)
17+
18+
---
19+
820
## 🚀 Features
921
- Fast byte-range based COG access
10-
- STAC Geoparquet creation with COG internal metadata columns
11-
- Paid public data support (AWS S3 Landsat)
22+
- STAC Geoparquet creation with COG header metadata
23+
- Paid S3 bucket support (AWS S3 Landsat)
1224
- Xarray and GeoDataFrame outputs
1325
- Parallel data loading
1426
- Simple high-level API
1527

28+
---
29+
30+
## Why this library?
31+
32+
### 💡 The Problem
33+
34+
Currently satellite image access requires multiple HTTP requests:
35+
- Initial request to read COG headers
36+
- Additional requests if headers are split
37+
- Final requests for actual data tiles
38+
- These requests repeat in new environments:
39+
- New Python environments (like inside parallel Lambdas/ parallel Docker Containers in k8s)
40+
- Or in local environment when GDAL cache is cleared (like a Jupyter kernel restart / Laptop restart)
41+
42+
### ✨ Rasteret's Solution
43+
44+
Rasteret reimagines how we access cloud-hosted satellite imagery by:
45+
- Creating local 'collections' with pre-cached COG file headers along with STAC metadata
46+
- Calculating exact byte ranges for image tiles needed, without header requests
47+
- Making single optimized HTTP request per required tile
48+
- Ensuring COG file headers are never re-read across new Python environments
49+
50+
### 📊 Performance Benchmarks
51+
52+
#### Speed Benchmarks
53+
54+
Test setup: Filter 1 year of STAC (100+ scenes), process 20 Sentinel-2 filtered scenes over an agricultural area, accessing RED and NIR bands (40 COG files total)
55+
56+
| Operation | Component | Rasterio | Rasteret | Notes |
57+
|-----------|-----------|----------|-----------|--------|
58+
| STAC Query | Metadata Search | 2.0s | 0.5s | Finding available scenes (STAC API vs Geoparquet) |
59+
| Data Access | Header Reading | 12s | - | ~0.3s per file (Rasterio) vs Not required (Rasteret) |
60+
| | Tile Reading | 32s | 8s | Actual data access |
61+
| **Total Time** | | **44s** | **8s** | **5.5x faster** |
62+
63+
The speed improvement comes from:
64+
- Querying local GeoParquet instead of STAC API endpoints
65+
- Eliminating repeated header requests
66+
- Optimized parallel data loading
67+
68+
69+
70+
#### Cost Analysis
71+
72+
Example: 1000 Landsat scenes (4 bands each) across 50 parallel environments
73+
74+
#### First Run Setup
75+
| Operation | Rasterio | Rasteret | Calculation |
76+
|-----------|----------|-----------|-------------|
77+
| Header Requests | $3.20 | $3.20 | 1000 scenes × 4 bands × 2 requests × $0.0004/1000 |
78+
| Data Tile Requests | $0.32 | $0.32 | 100 farms × 2 tiles × 4 bands × $0.0004/1000 |
79+
| **Total Per Environment** | **$3.52** | **$3.52** | One-time setup for Rasteret |
80+
81+
#### Subsequent Runs (50 Environments)
82+
| Operation | Rasterio | Rasteret | Notes |
83+
|-----------|----------|-----------|--------|
84+
| Header Requests | $160 | $0 | 50 × $3.20 (Rasterio) vs Cached headers (Rasteret)|
85+
| Data Tile Requests | $16 | $16 | 50 × $0.32 |
86+
| **Total** | **$176** | **$16** | **91% savings** |
87+
88+
#### Alternative: Full Images Download
89+
| Cost Type | Amount | Notes |
90+
|-----------|---------|--------|
91+
| Data Transfer | $576 | 6.4TB (1.6GB * 4000 files) × $0.09/GB |
92+
| Monthly Storage | $150 | Varies by provider |
93+
| GET Requests | Still needed | For company S3 access |
94+
| **Total** | **$726+** | Plus ongoing storage |
95+
96+
The cost breakdown:
97+
- Each COG file typically needs 2 requests to read its headers (~$0.0004 per 1000 requests)
98+
- With Rasteret, headers are read once during Collection creation
99+
- Subsequent access only requires data tile requests
100+
- In the above cases we assume 2 COG tiles are needed per farm
101+
- Cost savings compound with distributed (in new dockers and python envs) / repeated processing
102+
103+
</details>
104+
105+
### 🎯 Key Benefits
106+
107+
108+
This makes Rasteret particularly effective for:
109+
- Time series analysis requiring many scenes
110+
- ML pipelines with multiple training runs
111+
- Production systems using serverless/container deployments
112+
- Multi-tenant applications accessing same data
113+
- Not needing convert COG to Zarr for most usecases
114+
115+
116+
---
117+
118+
## 🌍 Built-in Data Sources
119+
- Sentinel-2 Level 2A
120+
- Earthsearch v1 [STAC Endpoint](https://earth-search.aws.element84.com/v1/collections/sentinel-2-l2a/) (AWS S3 US-West2 bucket)
121+
- Landsat Collection 2 Level 2 SR
122+
- USGS Landsatlook STAC Server [Endpoint](https://landsatlook.usgs.gov/stac-server/collections/landsat-c2l2-sr/) (AWS S3 US-West2 bucket)
123+
124+
## ⚠️ Known Limitations
125+
- Currently tested only with Sentinel-2 and Landsat 8,9 platform's data
126+
- S3 based Rasteret Collection creation and loading is not yet supported, right now they need to be in local disk
127+
128+
---
129+
16130
## 📋 Prerequisites
17131
- Python 3.10.x,3.11.x
18132
- AWS credentials (for accessing paid AWS buckets)
19133

20134
### ⚙️ AWS Credentials Setup
21135
For accessing paid AWS buckets:
22136

137+
<details>
138+
<summary><b>Setting up AWS credentials</b></summary>
139+
23140
(Prefferable) You can set up your AWS credentials by creating a `~/.aws/credentials` file with the following content:
24141

25142
```
@@ -33,15 +150,20 @@ Alternatively, you can set the credentials as environment variables:
33150
export AWS_ACCESS_KEY_ID='your_access_key'
34151
export AWS_SECRET_ACCESS_KEY='your_secret_key'
35152
```
153+
</details>
154+
155+
---
36156

37157
## 📦 Installation
38158
```bash
39159
pip install rasteret
40160
```
41161

162+
---
163+
42164
## 🏃‍♂️ Quick Start
43165

44-
1. Define Areas of Interest
166+
### 1. Define Areas of Interest
45167

46168
Create polygons for your regions of interest:
47169

@@ -70,114 +192,118 @@ aoi2_polygon = Polygon([
70192
(77.56, 13.02)
71193
])
72194

73-
# get total bounds of all polygons above
195+
# Use the total bounds of all polygons above
196+
# OR give an even larger AOI that covers all your future analysis areas
197+
# like AOI of a State or a Country
74198
bbox = aoi1_polygon.union(aoi2_polygon).bounds
75199
```
76200

77-
2. Configure Rasteret
201+
### 2. Configure Rasteret
78202

79203
Set up basic parameters for data collection, and check for existing collection
80204
in your workspace directory, if they were created earlier.
81205

82206
```python
83207
# Collection configuration
208+
209+
# give your custom name for local collection, it will attached to the
210+
# beginning of the collection name for eg., bangalore_202401-12_landsat
211+
# date range and data source name is added automatically while rasteret creates a collection
84212
custom_name = "bangalore"
213+
214+
# pay time and cost upfront for COG headers and STAC metadata
215+
# here we are writing 1 year worth of STAC metadata and COG file headers to local disk
85216
date_range = ("2024-01-01", "2024-12-31")
217+
218+
# choose from LANDSAT / SENTINEL2
86219
data_source = DataSources.LANDSAT
87220

88-
# Set up workspace
221+
# Set up workspace folder as you wish
89222
workspace_dir = Path.home() / "rasteret_workspace"
90223
workspace_dir.mkdir(exist_ok=True)
91-
)
92224

93-
# List existing collections
225+
# List existing collections if there are any in the workspace folder (by default is /home/user/rasteret_workspace)
94226
collections = Rasteret.list_collections()
95227
for c in collections:
96228
print(f"- {c['name']}: {c['data_source']}, {c['date_range']}, {c['size']} scenes")
97-
98229
```
99-
3. Initialize and Create Collection
100230

101-
Create or Load a local collection:
102-
Containing internal COG metadata of scenes, and its STAC metadata
231+
### 3. Initialize and Create Collection
103232

104233
```python
105234
# Try loading existing collection
106235
try:
107-
# example name
236+
# example name given here
108237
processor = Rasteret.load_collection("bangalore_202401-12_landsat")
109238
except ValueError:
110-
# Create new collection
239+
240+
# Instantiate the Class
111241
processor = Rasteret(
112242
custom_name="bangalore",
113243
data_source=DataSources.LANDSAT,
114244
date_range=("2024-01-01", "2024-01-31")
115245
)
246+
247+
# and create a new collection
248+
# here we are giving the BBOX for which STAC items and thier COG headers will be
249+
# downloaded to local. and also filtering using PySTAC filters for LANDSAT 8 platform
250+
# specifically from LANDSAT USGS STAC, and giving a scene level cloud cover filter
116251
processor.create_collection(
117252
bbox=bbox,
118253
cloud_cover_lt=20,
119254
platform={"in": ["LANDSAT_8"]}
120255
)
121256
```
122257

123-
4. Query collection and Process Data
124-
258+
### 4. Query the Collection and Process Data
125259

126260
```python
127-
# Query collection with filters to get the data you want
261+
# Query collection created above with filters to get the data you want
262+
# in this case 2 geometries, 2 bands, and a few PySTAC search filters are provided
128263
ds = processor.get_xarray(
129264
geometries=[aoi1_polygon,aoi2_polygon],
130265
bands=["B4", "B5"],
131266
cloud_cover_lt=20,
132267
date_range=["2024-01-10", "2024-01-30"]
133268
)
134-
135-
# returns an xarray dataset with the data for the geometries and bands specified
269+
# this returns an xarray dataset variable "ds" with the data for the geometries and bands specified
270+
# behind the scenes, the library is efficiently filtering the local STAC geoparquet,
271+
# for the LANDSAT scenes that pass the filters and dates provided
272+
# then its getting the tif urls of the requested bands
273+
# then grabbing COG tiles only for the geometries from those tif files
274+
# and creating a xarray dataset for each geom and its time series data
136275

137276
# Calculate NDVI
138277
ndvi_ds = (ds.B5 - ds.B4) / (ds.B5 + ds.B4)
278+
279+
# give a data variable name for NDVI array
139280
ndvi_ds = xr.Dataset(
140281
{"NDVI": ndvi},
141282
coords=ds.coords,
142283
attrs=ds.attrs,
143284
)
144285

286+
# create a output folder if you wish to
287+
output_dir = Path(f"ndvi_results_{custom_name}")
288+
output_dir.mkdir(exist_ok=True)
289+
145290
# Save results from xarray to geotiff files, each geometry's data will be stored in
146-
# its own folder
291+
# its own folder. Here we are giving the file name prefix and also mentioning
292+
# which Xarray varible to save
293+
# each geometry in xarray will get its own folder
147294
output_files = save_per_geometry(ndvi_ds, output_dir, file_prefix="ndvi", data_var="NDVI")
148295

149296
for geom_id, filepath in output_files.items():
150297
print(f"Geometry {geom_id}: {filepath}")
151-
```
152-
153-
154-
## Why this library?
155-
156-
Details on why this library was made, and how it reads multiple COGs efficiently and fast -
157-
[Read the blog post here](https://blog.terrafloww.com/efficient-cloud-native-raster-data-access-an-alternative-to-rasterio-gdal/)
158-
159-
The aim of this library is to reduce the number of API calls to S3 objects (COGs), which
160-
will result in lesser time consumed for random file access and hence faster time series analysis without needing to convert COGs to other formats like Zarr or NetCDF.
161-
162-
It also reduces the cost incurred by readers of paid data sources like Landsat on AWS where GET and LIST requests are significantly reduced due to local collection of COG internal metadata.
163298

164-
Benchmarks -
299+
# example print
300+
# geometry_1 : ndvi_results_bangalore/geometry_1/ndvi_20241207.tif
301+
```
165302

166-
- Rasteret vs Zarr coming soon
167-
- [Rasteret vs Rasterio](https://blog.terrafloww.com/efficient-cloud-native-raster-data-access-an-alternative-to-rasterio-gdal/)
168-
169-
## 🌍 Built-in Data Sources
170-
- Sentinel-2 Level 2A
171-
- Earthsearch V1 [STAC Endpoint](https://earth-search.aws.element84.com/v1/collections/sentinel-2-l2a/) (AWS S3 US-West2 bucket)
172-
- Landsat Collection 2 Level 2 SR
173-
- USGS Landsatlook STAC Server [Endpoint](https://landsatlook.usgs.gov/stac-server/collections/landsat-c2l2-sr/)
303+
---
174304

175305
## 📝 License
176306
Apache 2.0 License
177307

178308
## 🤝 Contributing
179-
Contributions welcome!
180-
181-
## ⚠️ Known Limitations
182-
- Higher memory usage than Rasterio/GDAL
183-
- Currently tested with Sentinel-2 and Landsat 8,9 platforms
309+
Contributions welcome!

0 commit comments

Comments
 (0)