Skip to content

Commit 5271c30

Browse files
committed
fix(torchgeo): time_series spatial-only intersection, tighten changelog for v0.3.0
- time_series=True now uses spatial-only intersection, matching TorchGeo's own RasterDataset behaviour where all spatially overlapping records are stacked regardless of the sampler's time slice. - Tighten changelog language for accuracy. - Add duckdb install note to notebook 06. Signed-off-by: print-sid8 sidsub94@gmail.com
1 parent b842e14 commit 5271c30

File tree

3 files changed

+72
-61
lines changed

3 files changed

+72
-61
lines changed

docs/changelog.md

Lines changed: 41 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -5,65 +5,60 @@
55
### Highlights
66

77
- License changed from AGPL-3.0-only to **Apache-2.0**.
8-
- **Dataset catalog**: `build()`, `register()`, `register_local()` with a
9-
growing catalog of pre-registered datasets across Earth Search, Planetary
10-
Computer.
11-
- **Multi-cloud obstore backend**: native S3, Azure Blob, and GCS routing
12-
via URL auto-detection. Cross-cloud credential provider guard with
13-
automatic fallback to anonymous access.
8+
- **Dataset catalog**: `build()` with 13 pre-registered datasets across
9+
Earth Search, Planetary Computer, and AlphaEarth Foundation.
10+
`register_local()` for adding your own.
11+
- **Multi-cloud obstore backend**: S3, Azure Blob, and GCS routing via URL
12+
auto-detection, with automatic fallback to anonymous access.
1413
- **`create_backend()`** for authenticated reads with obstore credential
15-
providers (e.g., Planetary Computer SAS).
16-
- **Local catalog persistence**: `register_local()` persists to
17-
`~/.rasteret/datasets.local.json`; `export_local_descriptor()` for
18-
sharing catalog entries alongside Collections.
19-
- **Torchgeo GeoDataset**: Adapter created that use rasteret's own I/O parts to create a Torchgeo
20-
GeoDataset.
21-
- **Native dtype preservation**: COG tiles return in their source dtype (uint16, int8,
22-
float32, etc.). No forced float32 conversion.
23-
- **Rasterio-aligned masking defaults**: AOI reads now default to `all_touched=False`
24-
and fill masked/outside-coverage pixels with `nodata` when present, otherwise `0`.
25-
The primary read API (`read_cog`) returns a `valid_mask`.
26-
- **rioxarray removed**: CRS encoding uses pyproj CF conventions directly (WKT2, PROJJSON,
27-
GeoTransform). The `xarray` extra no longer pulls rioxarray.
28-
- **Extended TIFF header parsing**: nodata, SamplesPerPixel, PlanarConfiguration,
29-
PhotometricInterpretation, ExtraSamples, GeoDoubleParams CRS support.
30-
- **Cross-CRS masking**: by default, uses the exact transformed polygon (rasterio-aligned).
31-
Optional bbox masking remains available for bbox-style workflows.
32-
- **Multi-CRS auto-reprojection**: queries spanning multiple UTM zones automatically
33-
reproject to the most common CRS. Cross-CRS reprojection uses GDAL's
34-
`calculate_default_transform` for correct resolution handling.
35-
14+
providers (e.g., Planetary Computer SAS tokens).
15+
- **TorchGeo adapter**: `collection.to_torchgeo_dataset()` returns a
16+
`GeoDataset` backed by Rasteret's async COG reader. Supports
17+
`time_series=True` (`[T, C, H, W]` output), multi-CRS reprojection,
18+
and works with all TorchGeo samplers and collation helpers.
19+
- **Native dtype preservation**: COG tiles return in their source dtype
20+
(uint16, int8, float32, etc.) instead of forcing float32.
21+
- **Rasterio-aligned masking**: AOI reads default to `all_touched=False`
22+
and fill outside-coverage pixels with `nodata` when present, otherwise `0`.
23+
`read_cog` returns a `valid_mask`.
24+
- **rioxarray removed**: CRS encoding uses pyproj CF conventions directly.
25+
The `xarray` extra no longer pulls rioxarray.
26+
- **Extended TIFF header parsing**: nodata, SamplesPerPixel,
27+
PlanarConfiguration, PhotometricInterpretation, ExtraSamples,
28+
GeoDoubleParams CRS support.
29+
- **Multi-CRS auto-reprojection**: queries spanning multiple UTM zones
30+
reproject to the most common CRS using GDAL's
31+
`calculate_default_transform`.
3632

3733
### Collection API
3834

39-
- **Collection inspection**: `.bands`, `.bounds`, `.epsg`, `len()`, `__repr__()`,
40-
`.describe()`, `.compare_to_catalog()` for quick metadata access without
41-
materializing the full table.
42-
- **Filtering**: `collection.subset(cloud_cover_lt=..., date_range=..., bbox=...,
43-
geometries=..., split=...)` for friendly filtering; `collection.where(expr)` for
44-
raw Arrow dataset expressions. `select_split()` convenience wrapper.
35+
- **Inspection**: `.bands`, `.bounds`, `.epsg`, `len()`, `__repr__()`,
36+
`.describe()`, `.compare_to_catalog()`.
37+
- **Filtering**: `collection.subset(cloud_cover_lt=..., date_range=...,
38+
bbox=..., geometries=..., split=...)` and `collection.where(expr)` for
39+
raw Arrow expressions.
4540
- **Sharing**: `collection.export("path/")` writes a portable copy;
4641
`rasteret.load("path/")` reloads it.
4742

4843
### Other changes
4944

5045
- Arrow-native geometry internals (GeoArrow replaces Shapely in hot paths).
51-
- obstore as base dependency for Rust-native HTTP backend.
52-
- CLI: `rasteret collections build|list|info|delete|import`, `rasteret build` shortcut.
53-
- CLI: `rasteret datasets list|info|build|register-local|export-local|unregister-local`.
46+
- obstore as base dependency (Rust-native async HTTP).
47+
- CLI: `rasteret collections build|list|info|delete|import`,
48+
`rasteret datasets list|info|build|register-local|export-local|unregister-local`.
49+
- TorchGeo `time_series=True` uses spatial-only intersection, matching
50+
TorchGeo's own `RasterDataset` behaviour where all spatially overlapping
51+
records are stacked regardless of the sampler's time slice.
5452

5553
### Tested
5654

57-
- All three output paths (xarray, GDF, TorchGeo) are tested against direct
58-
rasterio reads across 12 datasets (Sentinel-2, Landsat, NAIP, Copernicus DEM,
59-
ESA WorldCover, AEF, and more). The TorchGeo path uses `rasterio.merge.merge`
60-
as the oracle, matching TorchGeo's own read semantics. See
61-
`test_dataset_pixel_comparison.py` and `test_network_smoke.py`.
55+
- All three output paths (xarray, GeoDataFrame, TorchGeo) tested against
56+
direct rasterio reads across 12 datasets (Sentinel-2, Landsat, NAIP,
57+
Copernicus DEM, ESA WorldCover, AEF, and more).
6258

6359
### Breaking changes
6460

65-
- `get_xarray()` returns data in native COG dtype instead of always float32. Code that
66-
assumed float32 output may need adjustment (e.g., `ds.B04.values.dtype` is now `uint16`
67-
for Sentinel-2 instead of `float32`).
68-
- The `xarray` extra no longer installs rioxarray. If you depend on `ds.rio.*` methods,
69-
install rioxarray separately.
61+
- `get_xarray()` returns data in native COG dtype instead of always float32.
62+
Code that assumed float32 output may need adjustment.
63+
- The `xarray` extra no longer installs rioxarray. If you depend on
64+
`ds.rio.*` methods, install rioxarray separately.

notebooks/06_non_stac_cog_collections.ipynb

Lines changed: 22 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,14 @@
33
{
44
"cell_type": "markdown",
55
"metadata": {},
6-
"source": "# Building from Parquet\n\nTutorials 01-02 build Collections from a STAC API via `build()`. But Rasteret\nworks with **any Parquet file that has GeoTIFF URLs**; no STAC API needed.\n\n`build_from_table()` reads Parquet directly from S3, GCS, or local paths.\nIt validates the schema, derives bounding boxes, and produces a standard\nCollection backed by Arrow.\n\nThis notebook uses the [Maxar Open Data](https://www.maxar.com/open-data)\ncatalog from [Source Cooperative](https://source.coop/maxar/maxar-opendata):\n14,979 sub-meter satellite scenes across 23 disaster events, published as\nSTAC GeoParquet with fully public COGs."
6+
"source": "# Building from Parquet\n\nTutorials 01-02 build Collections from a STAC API via `build()`. But Rasteret\nworks with **any Parquet file that has GeoTIFF URLs**; no STAC API needed.\n\n`build_from_table()` reads Parquet directly from S3, GCS, or local paths.\nIt validates the schema, derives bounding boxes, and produces a standard\nCollection backed by Arrow.\n\nThis notebook uses the [Maxar Open Data](https://www.maxar.com/open-data)\ncatalog from [Source Cooperative](https://source.coop/maxar/maxar-opendata):\n14,979 sub-meter satellite scenes across 23 disaster events, published as\nSTAC GeoParquet with fully public COGs.",
7+
"id": "cell-0"
78
},
89
{
910
"cell_type": "markdown",
1011
"metadata": {},
11-
"source": "## 1. Build a Collection from a remote Parquet\n\n`build_from_table()` reads the Parquet from S3 via PyArrow (using obstore\nunder the hood), validates the four required columns (`id`, `datetime`,\n`geometry`, `assets`), and produces a Collection.\n\nSource Cooperative data lives in a public S3 bucket. Set\n`AWS_NO_SIGN_REQUEST` so PyArrow skips credential lookup."
12+
"source": "## 1. Build a Collection from a remote Parquet\n\n`build_from_table()` reads the Parquet from S3 via PyArrow (using obstore\nunder the hood), validates the four required columns (`id`, `datetime`,\n`geometry`, `assets`), and produces a Collection.\n\nSource Cooperative data lives in a public S3 bucket. Set\n`AWS_NO_SIGN_REQUEST` so PyArrow skips credential lookup.",
13+
"id": "cell-1"
1214
},
1315
{
1416
"cell_type": "code",
@@ -31,12 +33,14 @@
3133
"print(f\"Collection: {collection.name}\")\n",
3234
"print(f\"Scenes: {collection.dataset.count_rows()}\")\n",
3335
"print(f\"Columns: {collection.dataset.schema.names[:8]}...\")"
34-
]
36+
],
37+
"id": "cell-2"
3538
},
3639
{
3740
"cell_type": "markdown",
3841
"metadata": {},
39-
"source": "## 2. Explore the Collection with DuckDB\n\nThe Collection is backed by Arrow. DuckDB reads Arrow tables with zero\ncopy - pass the Python variable directly, no file I/O."
42+
"source": "## 2. Explore the Collection with DuckDB\n\nThe Collection is backed by Arrow. DuckDB reads Arrow tables with zero\ncopy - pass the Python variable directly, no file I/O.\n\n> Requires the `examples` extra: `pip install rasteret[examples]`",
43+
"id": "cell-3"
4044
},
4145
{
4246
"cell_type": "code",
@@ -67,12 +71,14 @@
6771
" GROUP BY event\n",
6872
" ORDER BY scenes DESC\n",
6973
"\"\"\").show()"
70-
]
74+
],
75+
"id": "cell-4"
7176
},
7277
{
7378
"cell_type": "markdown",
7479
"metadata": {},
75-
"source": "## 3. Filter\n\nRasteret's `subset()` and `where()` are convenience methods for common\nfilters. You can also filter the Arrow table directly with DuckDB,\nPyArrow, or pandas - whichever fits your workflow."
80+
"source": "## 3. Filter\n\nRasteret's `subset()` and `where()` are convenience methods for common\nfilters. You can also filter the Arrow table directly with DuckDB,\nPyArrow, or pandas - whichever fits your workflow.",
81+
"id": "cell-5"
7682
},
7783
{
7884
"cell_type": "code",
@@ -115,29 +121,34 @@
115121
"print(f\"PyArrow filter: {pc.sum(mask).as_py()} scenes\")\n",
116122
"\n",
117123
"print(\"\\nAll three query the same Arrow data. Use whichever fits your workflow.\")"
118-
]
124+
],
125+
"id": "cell-6"
119126
},
120127
{
121128
"cell_type": "markdown",
122129
"metadata": {},
123-
"source": "## 4. Export and share\n\nExport the Collection so a teammate can load it; no S3 access or Source\nCooperative account needed on their end."
130+
"source": "## 4. Export and share\n\nExport the Collection so a teammate can load it; no S3 access or Source\nCooperative account needed on their end.",
131+
"id": "cell-7"
124132
},
125133
{
126134
"cell_type": "code",
127135
"execution_count": null,
128136
"metadata": {},
129137
"outputs": [],
130-
"source": "import tempfile\nfrom pathlib import Path\n\nwith tempfile.TemporaryDirectory() as tmpdir:\n export_path = Path(tmpdir) / \"maxar_collection\"\n collection.export(export_path)\n\n # Teammate loads it in one line:\n reloaded = rasteret.load(export_path)\n print(f\"Loaded: {reloaded.name}, {reloaded.dataset.count_rows()} scenes\")"
138+
"source": "import tempfile\nfrom pathlib import Path\n\nwith tempfile.TemporaryDirectory() as tmpdir:\n export_path = Path(tmpdir) / \"maxar_collection\"\n collection.export(export_path)\n\n # Teammate loads it in one line:\n reloaded = rasteret.load(export_path)\n print(f\"Loaded: {reloaded.name}, {reloaded.dataset.count_rows()} scenes\")",
139+
"id": "cell-8"
131140
},
132141
{
133142
"cell_type": "markdown",
134143
"metadata": {},
135-
"source": "## 5. Column mapping (non-standard schemas)\n\nNot every Parquet file uses STAC column names. If your source uses different\nnames, provide a `column_map`:\n\n```python\ncollection = rasteret.build_from_table(\n \"s3://my-bucket/my-catalog.parquet\",\n name=\"custom\",\n column_map={\"scene_id\": \"id\", \"timestamp\": \"datetime\"},\n)\n```\n\nRasteret requires four columns: `id`, `datetime`, `geometry`, `assets`.\nEverything else is passed through as-is. See the\n[Schema Contract](../../explanation/schema-contract/) for details."
144+
"source": "## 5. Column mapping (non-standard schemas)\n\nNot every Parquet file uses STAC column names. If your source uses different\nnames, provide a `column_map`:\n\n```python\ncollection = rasteret.build_from_table(\n \"s3://my-bucket/my-catalog.parquet\",\n name=\"custom\",\n column_map={\"scene_id\": \"id\", \"timestamp\": \"datetime\"},\n)\n```\n\nRasteret requires four columns: `id`, `datetime`, `geometry`, `assets`.\nEverything else is passed through as-is. See the\n[Schema Contract](../../explanation/schema-contract/) for details.",
145+
"id": "cell-9"
136146
},
137147
{
138148
"cell_type": "markdown",
139149
"metadata": {},
140-
"source": "## Summary\n\n| Step | What happens |\n|------|-------------|\n| `build_from_table(s3_uri)` | Reads Parquet from S3/GCS/local, validates schema, creates Collection |\n| `collection.dataset.to_table()` | Arrow table - pass directly to DuckDB, PyArrow, pandas |\n| `collection.where()` / `subset()` | Convenience filters (Arrow pushdown) |\n| `collection.export()` -> `rasteret.load()` | Share a portable Collection |\n\n**When to use which build function:**\n\n| Situation | Use |\n|---|---|\n| Dataset in the catalog (Sentinel-2, Landsat, NAIP, ...) | `rasteret.build()` |\n| Custom STAC API not in the catalog | `rasteret.build_from_stac()` |\n| Existing Parquet with GeoTIFF URLs (this notebook) | `rasteret.build_from_table()` |\n| Someone shared a Collection with you | `rasteret.load()` |\n\nNext: [Parquet Filtering](../03_parquet_first_filtering/)"
150+
"source": "## Summary\n\n| Step | What happens |\n|------|-------------|\n| `build_from_table(s3_uri)` | Reads Parquet from S3/GCS/local, validates schema, creates Collection |\n| `collection.dataset.to_table()` | Arrow table - pass directly to DuckDB, PyArrow, pandas |\n| `collection.where()` / `subset()` | Convenience filters (Arrow pushdown) |\n| `collection.export()` -> `rasteret.load()` | Share a portable Collection |\n\n**When to use which build function:**\n\n| Situation | Use |\n|---|---|\n| Dataset in the catalog (Sentinel-2, Landsat, NAIP, ...) | `rasteret.build()` |\n| Custom STAC API not in the catalog | `rasteret.build_from_stac()` |\n| Existing Parquet with GeoTIFF URLs (this notebook) | `rasteret.build_from_table()` |\n| Someone shared a Collection with you | `rasteret.load()` |\n\nNext: [Parquet Filtering](../03_parquet_first_filtering/)",
151+
"id": "cell-10"
141152
}
142153
],
143154
"metadata": {

src/rasteret/integrations/torchgeo.py

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -714,11 +714,16 @@ def __getitem__(self, index: GeoSlice) -> Sample:
714714

715715
t_step = 1 if t.step is None else int(t.step)
716716
if self.time_series:
717-
# TorchGeo semantics: filter by time, then spatial, then stack.
718-
interval = pd.Interval(t.start, t.stop, closed="both")
719-
df = self.index.iloc[self.index.index.overlaps(interval)]
717+
# TorchGeo semantics: time_series stacks ALL spatially
718+
# overlapping records regardless of the sampler's time slice.
719+
# TorchGeo's own RasterDataset effectively does this because
720+
# files without parseable dates get [Timestamp.min, .max],
721+
# making every sampler query match every file. Rasteret
722+
# stores precise per-scene dates from STAC, so we must skip
723+
# the time filter here; users control date range upstream
724+
# via build(date_range=...) or collection.subset().
725+
df = self.index.cx[x.start : x.stop, y.start : y.stop]
720726
df = df.iloc[::t_step]
721-
df = df.cx[x.start : x.stop, y.start : y.stop]
722727
else:
723728
interval = pd.Interval(t.start, t.stop, closed="both")
724729
df = self.index.iloc[self.index.index.overlaps(interval)]

0 commit comments

Comments
 (0)