fix(torchgeo): time_series spatial-only intersection, tighten changelog for v0.3.0

print-sid8 · print-sid8 · commit 5271c3023d51 · 2026-02-25T19:27:18.000Z
- time_series=True now uses spatial-only intersection, matching
    TorchGeo's own RasterDataset behaviour where all spatially
    overlapping records are stacked regardless of the sampler's
    time slice.
  - Tighten changelog language for accuracy.
  - Add duckdb install note to notebook 06.

  Signed-off-by: print-sid8 sidsub94@gmail.com
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -5,65 +5,60 @@
 ### Highlights
 
 - License changed from AGPL-3.0-only to **Apache-2.0**.
-- **Dataset catalog**: `build()`, `register()`, `register_local()` with a
-  growing catalog of pre-registered datasets across Earth Search, Planetary
-  Computer.
-- **Multi-cloud obstore backend**: native S3, Azure Blob, and GCS routing
-  via URL auto-detection. Cross-cloud credential provider guard with
-  automatic fallback to anonymous access.
+- **Dataset catalog**: `build()` with 13 pre-registered datasets across
+  Earth Search, Planetary Computer, and AlphaEarth Foundation.
+  `register_local()` for adding your own.
+- **Multi-cloud obstore backend**: S3, Azure Blob, and GCS routing via URL
+  auto-detection, with automatic fallback to anonymous access.
 - **`create_backend()`** for authenticated reads with obstore credential
-  providers (e.g., Planetary Computer SAS).
-- **Local catalog persistence**: `register_local()` persists to
-  `~/.rasteret/datasets.local.json`; `export_local_descriptor()` for
-  sharing catalog entries alongside Collections.
-- **Torchgeo GeoDataset**: Adapter created that use rasteret's own I/O parts to create a Torchgeo
-  GeoDataset.
-- **Native dtype preservation**: COG tiles return in their source dtype (uint16, int8,
-  float32, etc.). No forced float32 conversion.
-- **Rasterio-aligned masking defaults**: AOI reads now default to `all_touched=False`
-  and fill masked/outside-coverage pixels with `nodata` when present, otherwise `0`.
-  The primary read API (`read_cog`) returns a `valid_mask`.
-- **rioxarray removed**: CRS encoding uses pyproj CF conventions directly (WKT2, PROJJSON,
-  GeoTransform). The `xarray` extra no longer pulls rioxarray.
-- **Extended TIFF header parsing**: nodata, SamplesPerPixel, PlanarConfiguration,
-  PhotometricInterpretation, ExtraSamples, GeoDoubleParams CRS support.
-- **Cross-CRS masking**: by default, uses the exact transformed polygon (rasterio-aligned).
-  Optional bbox masking remains available for bbox-style workflows.
-- **Multi-CRS auto-reprojection**: queries spanning multiple UTM zones automatically
-  reproject to the most common CRS. Cross-CRS reprojection uses GDAL's
-  `calculate_default_transform` for correct resolution handling.
-
+  providers (e.g., Planetary Computer SAS tokens).
+- **TorchGeo adapter**: `collection.to_torchgeo_dataset()` returns a
+  `GeoDataset` backed by Rasteret's async COG reader. Supports
+  `time_series=True` (`[T, C, H, W]` output), multi-CRS reprojection,
+  and works with all TorchGeo samplers and collation helpers.
+- **Native dtype preservation**: COG tiles return in their source dtype
+  (uint16, int8, float32, etc.) instead of forcing float32.
+- **Rasterio-aligned masking**: AOI reads default to `all_touched=False`
+  and fill outside-coverage pixels with `nodata` when present, otherwise `0`.
+  `read_cog` returns a `valid_mask`.
+- **rioxarray removed**: CRS encoding uses pyproj CF conventions directly.
+  The `xarray` extra no longer pulls rioxarray.
+- **Extended TIFF header parsing**: nodata, SamplesPerPixel,
+  PlanarConfiguration, PhotometricInterpretation, ExtraSamples,
+  GeoDoubleParams CRS support.
+- **Multi-CRS auto-reprojection**: queries spanning multiple UTM zones
+  reproject to the most common CRS using GDAL's
+  `calculate_default_transform`.
 
 ### Collection API
 
-- **Collection inspection**: `.bands`, `.bounds`, `.epsg`, `len()`, `__repr__()`,
-  `.describe()`, `.compare_to_catalog()` for quick metadata access without
-  materializing the full table.
-- **Filtering**: `collection.subset(cloud_cover_lt=..., date_range=..., bbox=...,
-  geometries=..., split=...)` for friendly filtering; `collection.where(expr)` for
-  raw Arrow dataset expressions. `select_split()` convenience wrapper.
+- **Inspection**: `.bands`, `.bounds`, `.epsg`, `len()`, `__repr__()`,
+  `.describe()`, `.compare_to_catalog()`.
+- **Filtering**: `collection.subset(cloud_cover_lt=..., date_range=...,
+  bbox=..., geometries=..., split=...)` and `collection.where(expr)` for
+  raw Arrow expressions.
 - **Sharing**: `collection.export("path/")` writes a portable copy;
   `rasteret.load("path/")` reloads it.
 
 ### Other changes
 
 - Arrow-native geometry internals (GeoArrow replaces Shapely in hot paths).
-- obstore as base dependency for Rust-native HTTP backend.
-- CLI: `rasteret collections build|list|info|delete|import`, `rasteret build` shortcut.
-- CLI: `rasteret datasets list|info|build|register-local|export-local|unregister-local`.
+- obstore as base dependency (Rust-native async HTTP).
+- CLI: `rasteret collections build|list|info|delete|import`,
+  `rasteret datasets list|info|build|register-local|export-local|unregister-local`.
+- TorchGeo `time_series=True` uses spatial-only intersection, matching
+  TorchGeo's own `RasterDataset` behaviour where all spatially overlapping
+  records are stacked regardless of the sampler's time slice.
 
 ### Tested
 
-- All three output paths (xarray, GDF, TorchGeo) are tested against direct
-  rasterio reads across 12 datasets (Sentinel-2, Landsat, NAIP, Copernicus DEM,
-  ESA WorldCover, AEF, and more). The TorchGeo path uses `rasterio.merge.merge`
-  as the oracle, matching TorchGeo's own read semantics. See
-  `test_dataset_pixel_comparison.py` and `test_network_smoke.py`.
+- All three output paths (xarray, GeoDataFrame, TorchGeo) tested against
+  direct rasterio reads across 12 datasets (Sentinel-2, Landsat, NAIP,
+  Copernicus DEM, ESA WorldCover, AEF, and more).
 
 ### Breaking changes
 
-- `get_xarray()` returns data in native COG dtype instead of always float32. Code that
-  assumed float32 output may need adjustment (e.g., `ds.B04.values.dtype` is now `uint16`
-  for Sentinel-2 instead of `float32`).
-- The `xarray` extra no longer installs rioxarray. If you depend on `ds.rio.*` methods,
-  install rioxarray separately.
+- `get_xarray()` returns data in native COG dtype instead of always float32.
+  Code that assumed float32 output may need adjustment.
+- The `xarray` extra no longer installs rioxarray. If you depend on
+  `ds.rio.*` methods, install rioxarray separately.
diff --git a/notebooks/06_non_stac_cog_collections.ipynb b/notebooks/06_non_stac_cog_collections.ipynb
@@ -3,12 +3,14 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "# Building from Parquet\n\nTutorials 01-02 build Collections from a STAC API via `build()`. But Rasteret\nworks with **any Parquet file that has GeoTIFF URLs**; no STAC API needed.\n\n`build_from_table()` reads Parquet directly from S3, GCS, or local paths.\nIt validates the schema, derives bounding boxes, and produces a standard\nCollection backed by Arrow.\n\nThis notebook uses the [Maxar Open Data](https://www.maxar.com/open-data)\ncatalog from [Source Cooperative](https://source.coop/maxar/maxar-opendata):\n14,979 sub-meter satellite scenes across 23 disaster events, published as\nSTAC GeoParquet with fully public COGs."
+   "source": "# Building from Parquet\n\nTutorials 01-02 build Collections from a STAC API via `build()`. But Rasteret\nworks with **any Parquet file that has GeoTIFF URLs**; no STAC API needed.\n\n`build_from_table()` reads Parquet directly from S3, GCS, or local paths.\nIt validates the schema, derives bounding boxes, and produces a standard\nCollection backed by Arrow.\n\nThis notebook uses the [Maxar Open Data](https://www.maxar.com/open-data)\ncatalog from [Source Cooperative](https://source.coop/maxar/maxar-opendata):\n14,979 sub-meter satellite scenes across 23 disaster events, published as\nSTAC GeoParquet with fully public COGs.",
+   "id": "cell-0"
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 1. Build a Collection from a remote Parquet\n\n`build_from_table()` reads the Parquet from S3 via PyArrow (using obstore\nunder the hood), validates the four required columns (`id`, `datetime`,\n`geometry`, `assets`), and produces a Collection.\n\nSource Cooperative data lives in a public S3 bucket. Set\n`AWS_NO_SIGN_REQUEST` so PyArrow skips credential lookup."
+   "source": "## 1. Build a Collection from a remote Parquet\n\n`build_from_table()` reads the Parquet from S3 via PyArrow (using obstore\nunder the hood), validates the four required columns (`id`, `datetime`,\n`geometry`, `assets`), and produces a Collection.\n\nSource Cooperative data lives in a public S3 bucket. Set\n`AWS_NO_SIGN_REQUEST` so PyArrow skips credential lookup.",
+   "id": "cell-1"
   },
   {
    "cell_type": "code",
@@ -31,12 +33,14 @@
     "print(f\"Collection: {collection.name}\")\n",
     "print(f\"Scenes: {collection.dataset.count_rows()}\")\n",
     "print(f\"Columns: {collection.dataset.schema.names[:8]}...\")"
-   ]
+   ],
+   "id": "cell-2"
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 2. Explore the Collection with DuckDB\n\nThe Collection is backed by Arrow. DuckDB reads Arrow tables with zero\ncopy - pass the Python variable directly, no file I/O."
+   "source": "## 2. Explore the Collection with DuckDB\n\nThe Collection is backed by Arrow. DuckDB reads Arrow tables with zero\ncopy - pass the Python variable directly, no file I/O.\n\n> Requires the `examples` extra: `pip install rasteret[examples]`",
+   "id": "cell-3"
   },
   {
    "cell_type": "code",
@@ -67,12 +71,14 @@
     "    GROUP BY event\n",
     "    ORDER BY scenes DESC\n",
     "\"\"\").show()"
-   ]
+   ],
+   "id": "cell-4"
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 3. Filter\n\nRasteret's `subset()` and `where()` are convenience methods for common\nfilters. You can also filter the Arrow table directly with DuckDB,\nPyArrow, or pandas - whichever fits your workflow."
+   "source": "## 3. Filter\n\nRasteret's `subset()` and `where()` are convenience methods for common\nfilters. You can also filter the Arrow table directly with DuckDB,\nPyArrow, or pandas - whichever fits your workflow.",
+   "id": "cell-5"
   },
   {
    "cell_type": "code",
@@ -115,29 +121,34 @@
     "print(f\"PyArrow filter:   {pc.sum(mask).as_py()} scenes\")\n",
     "\n",
     "print(\"\\nAll three query the same Arrow data. Use whichever fits your workflow.\")"
-   ]
+   ],
+   "id": "cell-6"
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 4. Export and share\n\nExport the Collection so a teammate can load it; no S3 access or Source\nCooperative account needed on their end."
+   "source": "## 4. Export and share\n\nExport the Collection so a teammate can load it; no S3 access or Source\nCooperative account needed on their end.",
+   "id": "cell-7"
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "import tempfile\nfrom pathlib import Path\n\nwith tempfile.TemporaryDirectory() as tmpdir:\n    export_path = Path(tmpdir) / \"maxar_collection\"\n    collection.export(export_path)\n\n    # Teammate loads it in one line:\n    reloaded = rasteret.load(export_path)\n    print(f\"Loaded: {reloaded.name}, {reloaded.dataset.count_rows()} scenes\")"
+   "source": "import tempfile\nfrom pathlib import Path\n\nwith tempfile.TemporaryDirectory() as tmpdir:\n    export_path = Path(tmpdir) / \"maxar_collection\"\n    collection.export(export_path)\n\n    # Teammate loads it in one line:\n    reloaded = rasteret.load(export_path)\n    print(f\"Loaded: {reloaded.name}, {reloaded.dataset.count_rows()} scenes\")",
+   "id": "cell-8"
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 5. Column mapping (non-standard schemas)\n\nNot every Parquet file uses STAC column names. If your source uses different\nnames, provide a `column_map`:\n\n```python\ncollection = rasteret.build_from_table(\n    \"s3://my-bucket/my-catalog.parquet\",\n    name=\"custom\",\n    column_map={\"scene_id\": \"id\", \"timestamp\": \"datetime\"},\n)\n```\n\nRasteret requires four columns: `id`, `datetime`, `geometry`, `assets`.\nEverything else is passed through as-is. See the\n[Schema Contract](../../explanation/schema-contract/) for details."
+   "source": "## 5. Column mapping (non-standard schemas)\n\nNot every Parquet file uses STAC column names. If your source uses different\nnames, provide a `column_map`:\n\n```python\ncollection = rasteret.build_from_table(\n    \"s3://my-bucket/my-catalog.parquet\",\n    name=\"custom\",\n    column_map={\"scene_id\": \"id\", \"timestamp\": \"datetime\"},\n)\n```\n\nRasteret requires four columns: `id`, `datetime`, `geometry`, `assets`.\nEverything else is passed through as-is. See the\n[Schema Contract](../../explanation/schema-contract/) for details.",
+   "id": "cell-9"
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## Summary\n\n| Step | What happens |\n|------|-------------|\n| `build_from_table(s3_uri)` | Reads Parquet from S3/GCS/local, validates schema, creates Collection |\n| `collection.dataset.to_table()` | Arrow table - pass directly to DuckDB, PyArrow, pandas |\n| `collection.where()` / `subset()` | Convenience filters (Arrow pushdown) |\n| `collection.export()` -> `rasteret.load()` | Share a portable Collection |\n\n**When to use which build function:**\n\n| Situation | Use |\n|---|---|\n| Dataset in the catalog (Sentinel-2, Landsat, NAIP, ...) | `rasteret.build()` |\n| Custom STAC API not in the catalog | `rasteret.build_from_stac()` |\n| Existing Parquet with GeoTIFF URLs (this notebook) | `rasteret.build_from_table()` |\n| Someone shared a Collection with you | `rasteret.load()` |\n\nNext: [Parquet Filtering](../03_parquet_first_filtering/)"
+   "source": "## Summary\n\n| Step | What happens |\n|------|-------------|\n| `build_from_table(s3_uri)` | Reads Parquet from S3/GCS/local, validates schema, creates Collection |\n| `collection.dataset.to_table()` | Arrow table - pass directly to DuckDB, PyArrow, pandas |\n| `collection.where()` / `subset()` | Convenience filters (Arrow pushdown) |\n| `collection.export()` -> `rasteret.load()` | Share a portable Collection |\n\n**When to use which build function:**\n\n| Situation | Use |\n|---|---|\n| Dataset in the catalog (Sentinel-2, Landsat, NAIP, ...) | `rasteret.build()` |\n| Custom STAC API not in the catalog | `rasteret.build_from_stac()` |\n| Existing Parquet with GeoTIFF URLs (this notebook) | `rasteret.build_from_table()` |\n| Someone shared a Collection with you | `rasteret.load()` |\n\nNext: [Parquet Filtering](../03_parquet_first_filtering/)",
+   "id": "cell-10"
   }
  ],
  "metadata": {
diff --git a/src/rasteret/integrations/torchgeo.py b/src/rasteret/integrations/torchgeo.py
@@ -714,11 +714,16 @@ def __getitem__(self, index: GeoSlice) -> Sample:
 
             t_step = 1 if t.step is None else int(t.step)
             if self.time_series:
-                # TorchGeo semantics: filter by time, then spatial, then stack.
-                interval = pd.Interval(t.start, t.stop, closed="both")
-                df = self.index.iloc[self.index.index.overlaps(interval)]
+                # TorchGeo semantics: time_series stacks ALL spatially
+                # overlapping records regardless of the sampler's time slice.
+                # TorchGeo's own RasterDataset effectively does this because
+                # files without parseable dates get [Timestamp.min, .max],
+                # making every sampler query match every file.  Rasteret
+                # stores precise per-scene dates from STAC, so we must skip
+                # the time filter here; users control date range upstream
+                # via build(date_range=...) or collection.subset().
+                df = self.index.cx[x.start : x.stop, y.start : y.stop]
                 df = df.iloc[::t_step]
-                df = df.cx[x.start : x.stop, y.start : y.stop]
             else:
                 interval = pd.Interval(t.start, t.stop, closed="both")
                 df = self.index.iloc[self.index.index.overlaps(interval)]