Scalable schema-naive ingestion #49

kylebarron · 2024-04-26T21:37:41Z

In general, converting STAC to GeoParquet runs into schema inference issues, because GeoParquet needs a strict schema while STAC can have a much looser schema, or a schema that changes per row.

The current Arrow-based conversion approach uses two alternate methods:

a fully in-memory approach, where schema inference happens in memory. This works but is constrained to a size of data that can fit into memory at once.
forcing the user to provide a full schema for their data. This is quite a lot of work, and the average user will not know how to construct an Arrow schema to describe their input data.

Instead, in chatting with @bitner, we realized that we could improve on these two approaches by leveraging the knowledge that we're working with STAC spec objects. As long as the user knows which extensions are included in a collection, stac-geoparquet can pre-define the maximal Arrow schema defined by the STAC Item specification. This allows for minimal work by the end user while enabling streaming conversions of JSON data into GeoParquet.

To avoid the user needing to know the full set of asset names, we define assets under a Map type, which has pros and cons as noted in radiantearth/stac-geoparquet-spec#7. In particular, it's not possible to statically infer the asset key names from the Parquet schema using a Map type, and it's also not possible to access data from only a single asset without downloading data for every asset. E.g. if you wanted to know the red asset's href, you'd have to download the hrefs for all assets, while a struct type would allow you to access only the red href column.

But converting first into a Map-based GeoParquet file, as we do in this PR, could make for an efficient ingestion process, because it would allow us to quickly find the full set of asset names.

So this scalable STAC ingestion would become a two-step process:

Convert STAC to a "flexible schema GeoParquet"
Convert this intermediate Parquet format into STAC-GeoParquet spec-compliant files. This step could also exclude any columns that are defined by the spec but not included in any JSON file. (It's easy from the Parquet metadata to see if any column is fully null).

The second part would become much, much easier by happening after the first step, instead of trying to start directly from JSON files.

Change list

Implement class-based schema handlers (PartialSchema). Note that this requires a certain amount of complexity because the schema for how we want data to reside in memory is not necessarily the same as the schema used for parsing input dicts.
- Implement "partial" schemas for the core item spec and for several popular extensions
- This also allows users who have data with a custom extension to implement only the custom fields defined by their extension, instead of creating a full STAC schema from scratch.
Test (successfully) with NAIP STAC input from planetary computer

This heavily uses pyarrow.unify_schemas to be able to work with partial schemas (for the core spec and for each extension).

This continues the discussion started in radiantearth/stac-geoparquet-spec#7.

bitner · 2025-05-28T19:44:48Z

@kylebarron thinking about this again ...

Looking at how the data is actually getting layed out with these structures, I think that there could potentially actually be some other benefits to using Map even as the final storage layout.

With Assets as a Struct, the nested Asset Struct attributes (href, rel, type, etc...) end up with a column per Asset attribute per Asset type, while as a Map(String, Struct) there would be a single column for all hrefs, a single column for all rels etc. This should compress much better than the struct representation -- this should be particularly effective for any assets with proj:* definitions. For the case that you put forward of getting just the red href, it wouldn't need to read the entire wide red asset, it would just need to pull from that href column. If you were getting a lot of hrefs (and particularly if you were getting hrefs from multiple assets), any benefit from a Struct might actually shift in favor of a Map.

I'm going to try to benchmark some parquet file sizes and query performance with a basic table with just an id and the assets as either Map or Struct to see if any of my suspicions hold true.

gadomski · 2025-05-28T19:53:33Z

@bitner while you're experimenting, I'll bring in the idea of moving assets up to the top-level, in the same way we did with properties (e.g. stac-utils/rustac#725). Cholmes experimented with this w/ a couple of million s2 items and it went well. I'm thinking something like this:

key	value
`id`	the item id
`datetime`	the item datetime, brought up out of `properties`
`asset:<key-a>`	an asset
`asset:<key-b>`	another asset

I think a lot of access patterns for stac-geoparquet are only going to want to grab/query on one asset, not all of them, so anything that enables access to only one asset would be a win.

bitner · 2025-05-28T21:03:06Z

@gadomski that actually goes completely counter to where I was thinking, having separated keys, or struct really makes the schema management soooooo much more complicated

The interesting thing....
Using DuckDB default Snappy Compression, for the sample that I just made ended up being 689K for the struct and 1.6M for the map.
Using DuckDB ZSTD Compression, it was 374K for the struct and only 230K for the Map.

gadomski · 2025-05-28T21:12:59Z

having separated keys, or struct really makes the schema management soooooo much more complicated

I'm just coming from the STAC perspective, where for a given Collection you usually have the same set of asset keys for every item (maybe missing one, which could be nullable).

Scalable schema-naive ingestion

9080ed1

kylebarron mentioned this pull request May 9, 2024

Exhaustive schema inference #50

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scalable schema-naive ingestion #49

Scalable schema-naive ingestion #49

Uh oh!

kylebarron commented Apr 26, 2024

Uh oh!

bitner commented May 28, 2025

Uh oh!

gadomski commented May 28, 2025

Uh oh!

bitner commented May 28, 2025

Uh oh!

gadomski commented May 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Scalable schema-naive ingestion #49

Are you sure you want to change the base?

Scalable schema-naive ingestion #49

Uh oh!

Conversation

kylebarron commented Apr 26, 2024

Change list

Uh oh!

bitner commented May 28, 2025

Uh oh!

gadomski commented May 28, 2025

Uh oh!

bitner commented May 28, 2025

Uh oh!

gadomski commented May 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants