Skip to content

Document best practice: I/O-free STAC item generation #369

@TomAugspurger

Description

@TomAugspurger

This issue is discussing what is (IMO) a best-practice for stactools packages: the ability to generate a STAC item without any I/O.

Currently most stactools packages have a high-level stac.create_item(asset_href: str, ...) -> pystac.Item function that generates a STAC item from a string. If the method requires reading any data / metadata, it will handle that I/O. This is very convenient, and ideally every stactools package has a way of doing this (especially useful when using a CLI).

Some of the more complicated stactools packages also generate cloud-optimized assets from the "source" asset at asset_href. In some of these packages, whether the output STAC item catalogs the cloud-optimized asset is directly tied to that function creating the cloud-optimized asset itself (see https://github.com/stactools-packages/goes-glm/blob/c9c3bc42685e66e0eaace599096ef6050c05eb57/src/stactools/goes_glm/stac.py#L46-L47 for example).

At a minimum, it should be easy to regenerate STAC metadata (including metadata for the cloud-optimized assets) without having to regenerate the cloud-optimized assets.

Now we have a couple ways to handle this:

  1. The user passes all the hrefs to both the source asset and the cloud-optimized asset. The create_item method is responsible for reading the data:
def create_item(source_asset_href, cloud_optimized_asset_hrefs, ...):
    ...

If the user provides cloud_optimzied_asset_hrefs then cloud-optimized asset (re)generation can be skipped.
2. The user passes in the data (and perhaps the hrefs, to easily set the href for each asset).

def create_item(source_data, cloud_optimized_data):
    ...

Of these, I think we should steer package developers towards option 2, but I'm curious to hear others' thoughts. That's the approach taken by stac-table and xstac, and I think it works pretty well. Users are able to provide (essentially) any dataframe or Dataset and we can generate STAC metadata for it. Crucially, all of rasterio, pyarrow / dask.dataframe, and xarray can lazily read data so creating / passing around a DataFrame or Dataset doesn't actually read data (unless it's required by the method).

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions