-
Couldn't load subscription status.
- Fork 126
Description
Hi all,
I'm tentatively making a pitch to add convenience methods for converting pystac objects (Asset, Item, ItemCollection, ...) and their linked assets to commonly used data containers (xarray.Dataset, geopandas.GeoDataFrame, pandas.DataFrame, etc.).
I'm opening this in pystac since this is primarily for convenience, so that users can method-chain their way from STAC Catalog to data container, and pystac owns the namespaces I care about. You can already do everything I'm showing today without any changes to pystac but it feels less nice. I really think that pd.read_csv is part of why Python is where it is today for data analytics; I want using STAC from Python to be as easy to use as pd.read_csv.
Secondarily, it can elevate the best-practice way to go from STAC to data containers, by providing a top-level method similar to to_dict().
As a couple hypothetical examples, to give an idea:
ds = (
catalog
.get_collection("sentinel-2-l2a")
.get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251")
.assets["B03"]
.to_xarray()
)
dsOr building a datacube from a pystac-client search (which subclasses pystac).
ds = (
catalog
.search(collections="sentinel-2-l2a", bbox=bbox)
.get_all_items() # ItemCollection
.to_xarray()
)
dsImplementation details
This would be optional. pystac would not add required dependencies on pandas, xarray, etc. It would merely provide the methods Item.to_xarray, Asset.to_xarray, ... Internally those methods would try to import the implementation and raise an ImportError if the optional dependencies aren't met at runtime.
Speaking of the implementations, there's a few things to figure out. Some relatively complicated conversions (like ItemCollection -> xarray) are implemented multiple times (https://stackstac.readthedocs.io/, https://odc-stac.readthedocs.io/en/latest/examples.html). pystac certainly wouldn't want to re-implement that conversion and would dispatch to one or either of those libraries (perhaps letting users decide with an engine argument).
Others conversions, like Asset -> Zarr, are so straightforward they haven't really been codified in a library yet (though I have a prototype at https://github.com/TomAugspurger/staccontainers/blob/086c2a7d46520ca5213d70716726b28ba6f36ba5/staccontainers/_xarray.py#L61-L63).
Maybe those could live in pystac; I'd be happy to maintain them.
https://github.com/TomAugspurger/staccontainers might serve as an idea of what some of these implementations would look like. It isn't too complicated.
Problems
A non-exhaustive list of reasons not to do this:
- It's not strictly necessary: You can do all this today, with some effort.
- It's a can of worms: Why
to_xarrayand notto_numpy(),to_PIL, ...? Whyto_pandas()and notto_spark(),to_modin, ...?
Alternatives
Alternatively, we could recommend using intake, along with intake-stac, which would wrap pystac-client and pystac. That would be the primary "user-facing" catalog people actually interface with. It already has a rich ecosystem of drivers that convert from files to data containers. I've hit some issues with trying to use intake-stac, but those could presumably be fixed with some effort.
Or more generally, another library could wrap pystac / pystac-client and provide these convenience methods. But (to me) that feels needlessly complicated.
Examples
A whole bunch of examples, under the <details>, to give some ideas of the various conversions. You can view the outputs at https://notebooksharing.space/view/db3c2096bf7e9c212425d00746bb17232e6b26cdc63731022fc2697c636ca4ed#displayOptions=.
catalog -> collection -> item -> asset -> xarray (raster)
ds = (
catalog
.get_collection("sentinel-2-l2a")
.get_item("S2B_MSIL2A_20220612T182919_R027_T24XWR_20220613T123251")
.assets["B03"]
.to_xarray()
)catalog -> collection -> item -> asset -> xarray (zarr)
ds = (
catalog
.get_collection("cil-gdpcir-cc0")
.get_item("cil-gdpcir-INM-INM-CM5-0-ssp585-r1i1p1f1-day")
.assets["pr"]
.to_xarray()
)catlaog -> collection -> item -> asset -> xarray (references)
ds = (
catalog
.get_collection("deltares-floods")
.get_item("NASADEM-90m-2050-0010")
.assets["index"]
.to_xarray()
)catalog -> collection -> item -> asset -> geodataframe
df = (
catalog
.get_collection("us-census")
.get_item("2020-cb_2020_us_tbg_500k")
.assets["data"]
.to_geopandas()
)
df.head()search / ItemCollection -> geopandas
df = catalog.search(collections=["sentinel-2-l2a"], bbox=[9.4, 0, 9.5, 1]).to_geopandas()
dfProposed Methods
We should figure out what the "target" in each of these to_ methods is. I think there are a couple things to figure out:
- Do we target container types ("to_dataframe", "to_datarray") or libraries ("to_pandas", "to_geopandas", "to_xarray", ...)
- Do we support larger-than-memory results through an argument (
to_dataframe(..., engine="dask")orto_dataframe(..., npartitions=...)) or alternative methods.to_dask_dataframe()).
But I would loosely propose
Asset.to_xarray->xarray.Dataset{Item,ItemCollection}.to_xarray->xarray.DatasetAsset.to_geopandas-> geopandas.GeoDataFrameAsset.to_pandas-> pandas.DataFrameAsset.to_dask_geopandas->dask_geopandas.GeoDataFrameAsset.to_dask_dataframe->dask.dataframe.DataFrameItemCollection.to_geopandas->geopandas.GeoDataFrame
There's a bunch more to figure out, but hopefully that's enough to get started.