-
Couldn't load subscription status.
- Fork 4
Description
Below paraphrased from:
https://cloudnativegeo.slack.com/archives/C060YAB0FHV/p1747780003580869?thread_ts=1747250715.883169&cid=C060YAB0FHV
I have a desire to continuously update a geoparquet-backed STAC datastore - however, I don't see an obvious append workflow - maybe this isn't idiomatic with what we expect of a STAC client, perhaps we expect a stac server to implement this logic on the backend? In a relatively quick and dirty prototype I put together, I addressed this by simply reading all the items into memory, subsetting to 'features', extending with my new item and writing the new item to disk (or to blob storage). See below example:
async def add_items_to_parquet(
self, fire_event_name: str, items: List[Dict[str, Any]]
) -> str:
"""
Add STAC items to the consolidated GeoParquet file
Returns:
Path to the updated GeoParquet file
"""
# Validate all items using `stac_pydantic`
for item in items:
self.validate_stac_item(item)
# If the parquet file doesn't exist yet, just write the items directly
if not os.path.exists(self.parquet_path):
await rustac.write(self.parquet_path, items, format="geoparquet")
return self.parquet_path
# Read existing items first
all_items = await rustac.read(self.parquet_path)
all_items = all_items["features"]
# Combine with new items
all_items.extend(items)
# Write back to parquet file
await rustac.write(self.parquet_path, all_items, format="geoparquet")
return self.parquet_path
@gadomski quickly and helpfully replied that:
... appends are not supported at the moment. right now the "recommended" approach is "just re-write it" (https://www.gadom.ski/presentations/2025-04-30-CNG.html#/4/3) as that can be pretty fast (a second or two) for decent numbers of items (tens of thousands, at least)
this is as much of a limitation in stac-geoparquet generally as it is with rustac in particular. because parquet requires a fixed schema, and STAC is very flexible, adding new STAC items to an existing parquet file is susceptible to schema mismatches (and therefore errors). that's why, for now, we've punted
Regardless - even if it is not rustac-py's job (seems quite reasonable to not open this can of worms), it might be nice to include a clean example of the "just rewrite it" strategy within the docs, perhaps with some editorializing on the potential for errors w/ schema mismatches.