-
Notifications
You must be signed in to change notification settings - Fork 30
Description
This issue is discussing what is (IMO) a best-practice for stactools packages: the ability to generate a STAC item without any I/O.
Currently most stactools packages have a high-level stac.create_item(asset_href: str, ...) -> pystac.Item function that generates a STAC item from a string. If the method requires reading any data / metadata, it will handle that I/O. This is very convenient, and ideally every stactools package has a way of doing this (especially useful when using a CLI).
Some of the more complicated stactools packages also generate cloud-optimized assets from the "source" asset at asset_href. In some of these packages, whether the output STAC item catalogs the cloud-optimized asset is directly tied to that function creating the cloud-optimized asset itself (see https://github.com/stactools-packages/goes-glm/blob/c9c3bc42685e66e0eaace599096ef6050c05eb57/src/stactools/goes_glm/stac.py#L46-L47 for example).
At a minimum, it should be easy to regenerate STAC metadata (including metadata for the cloud-optimized assets) without having to regenerate the cloud-optimized assets.
Now we have a couple ways to handle this:
- The user passes all the hrefs to both the source asset and the cloud-optimized asset. The
create_itemmethod is responsible for reading the data:
def create_item(source_asset_href, cloud_optimized_asset_hrefs, ...):
...If the user provides cloud_optimzied_asset_hrefs then cloud-optimized asset (re)generation can be skipped.
2. The user passes in the data (and perhaps the hrefs, to easily set the href for each asset).
def create_item(source_data, cloud_optimized_data):
...Of these, I think we should steer package developers towards option 2, but I'm curious to hear others' thoughts. That's the approach taken by stac-table and xstac, and I think it works pretty well. Users are able to provide (essentially) any dataframe or Dataset and we can generate STAC metadata for it. Crucially, all of rasterio, pyarrow / dask.dataframe, and xarray can lazily read data so creating / passing around a DataFrame or Dataset doesn't actually read data (unless it's required by the method).