Skip to content

Append to data store functionality (or provide example of reasonable approach)Β #139

@GondekNP

Description

@GondekNP

Below paraphrased from:
https://cloudnativegeo.slack.com/archives/C060YAB0FHV/p1747780003580869?thread_ts=1747250715.883169&cid=C060YAB0FHV

I have a desire to continuously update a geoparquet-backed STAC datastore - however, I don't see an obvious append workflow - maybe this isn't idiomatic with what we expect of a STAC client, perhaps we expect a stac server to implement this logic on the backend? In a relatively quick and dirty prototype I put together, I addressed this by simply reading all the items into memory, subsetting to 'features', extending with my new item and writing the new item to disk (or to blob storage). See below example:

    async def add_items_to_parquet(
        self, fire_event_name: str, items: List[Dict[str, Any]]
    ) -> str:
        """
        Add STAC items to the consolidated GeoParquet file

        Returns:
            Path to the updated GeoParquet file
        """

        # Validate all items using `stac_pydantic`
        for item in items:
            self.validate_stac_item(item)


        # If the parquet file doesn't exist yet, just write the items directly
        if not os.path.exists(self.parquet_path):
            await rustac.write(self.parquet_path, items, format="geoparquet")
            return self.parquet_path


        # Read existing items first
        all_items = await rustac.read(self.parquet_path)
        all_items = all_items["features"]


        # Combine with new items
        all_items.extend(items)


        # Write back to parquet file
        await rustac.write(self.parquet_path, all_items, format="geoparquet")


        return self.parquet_path

@gadomski quickly and helpfully replied that:

... appends are not supported at the moment. right now the "recommended" approach is "just re-write it" (https://www.gadom.ski/presentations/2025-04-30-CNG.html#/4/3) as that can be pretty fast (a second or two) for decent numbers of items (tens of thousands, at least)
this is as much of a limitation in stac-geoparquet generally as it is with rustac in particular. because parquet requires a fixed schema, and STAC is very flexible, adding new STAC items to an existing parquet file is susceptible to schema mismatches (and therefore errors). that's why, for now, we've punted

Regardless - even if it is not rustac-py's job (seems quite reasonable to not open this can of worms), it might be nice to include a clean example of the "just rewrite it" strategy within the docs, perhaps with some editorializing on the potential for errors w/ schema mismatches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions