Skip to content

Conversation

@kylebarron
Copy link
Collaborator

@kylebarron kylebarron commented Jun 3, 2024

I was adding tests for parse_stac_ndjson_to_arrow when I discovered that that was inadvertently broken in #53 due to a refactor in the schema inference. It has been confusing to work with untyped pyarrow.RecordBatch classes because they're a black box, and we have two distinct data schemas we're working with: one that is as close to the raw JSON as possible, and another that conforms to our STAC GeoParquet schema.

This PR refactors this internally by adding RawBatch and CleanBatch wrapper types (open to better naming suggestions, but these are not public, so we can easily rename in the future). They both hold an internal pyarrow.RecordBatch but RawBatch aligns as much as possible to the raw STAC JSON representation, while CleanBatch aligns to the STAC-GeoParquet schema.

These wrapper types make it much easier to reason about the shape of the data at different points of the pipeline.

Change list

  • Add more tests for STAC conversion via arrow and ndjson
  • Create RawBatch and CleanBatch for more reliable internal typing
  • Add parquet tests

@kylebarron kylebarron requested a review from TomAugspurger June 3, 2024 08:16
@kylebarron kylebarron mentioned this pull request Jun 3, 2024
@bitner
Copy link
Collaborator

bitner commented Jun 3, 2024

Rather than Raw and Clean, I'd rather something that was more explicit. StacArrowBatch and StacJsonBatch perhaps?

@kylebarron
Copy link
Collaborator Author

I had considered JsonBatch first, but I was trying to think of a different name because it's not a collection of JSON objects. And similarly StacArrowBatch isn't that descriptive because both batches are just Arrow underneath

@TomAugspurger
Copy link
Collaborator

TomAugspurger commented Jun 3, 2024 via email

@kylebarron
Copy link
Collaborator Author

I renamed to StacArrowBatch and StacJsonBatch

@kylebarron kylebarron merged commit cf0699b into main Jun 4, 2024
@kylebarron kylebarron deleted the kyle/batch-typing branch June 25, 2024 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants