Skip to content

Conversation

@bitner
Copy link
Collaborator

@bitner bitner commented May 8, 2025

Supersedes #97

  • Remove PC specific code
  • Add functions to convert data from pgstac filtering by collection, collection + start_datetime + end_datetime, or CQL2 filter into an iterator of items, a RecordBatchReader, or to a Stac Geoparquet file.
  • Add notebook with example on how to modify rows before dumping and how to create an incrementally updateable parquet backup from pgstac

@bitner bitner requested a review from kylebarron May 8, 2025 15:03
@bitner
Copy link
Collaborator Author

bitner commented May 8, 2025

@ghidalgo3 ^^^

# types are consistent across all items.
if "naip:year" in item["properties"]:
item["properties"]["naip:year"] = int(item["properties"]["naip:year"])
if "proj:epsg" in item["properties"]:
Copy link

@ghidalgo3 ghidalgo3 May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think stac-geoparquet should have special cases for NAIP, or any other particular collection. Can we have a callable argument to pgstac_to_iter that allows callers to inspect and modify the item before it is yielded by the iterator?

Consider this item: https://planetarycomputer.microsoft.com/api/stac/v1/collections/modis-21A2-061/items/MYD21A2.A2025113.h35v10.061.2025125160047

When I tried to export this collection, I got this error:

  File "/Users/gustavo/miniconda3/envs/pctasks312/lib/python3.12/site-packages/stac_geoparquet/pgstac_reader.py", line 89, in __call__
    item["properties"]["proj:epsg"] = int(item["properties"]["proj:epsg"])
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I removed that and just added in the case for the test items I was using to be called in the row_func that is called per item.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still having problems with item[properties][proj:epsg] with OpenPC's STAC items. I think because some STAC items have a null properties.proj:epsg, and this bit of code runs before the row_func is called, we error out exporting collections that have a null EPSG proj property, like the one linked above.

Can you remove this cast and make it the responsiblity of the row_func?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d'oh. missed that one. removed now.

…s inline with all pgstac_to* functions, add sync pgstac_to_parquet function
@bitner
Copy link
Collaborator Author

bitner commented May 12, 2025

@ghidalgo3 I made it so that you can pass a row_func that can transform items to every pgstac_to* function. I moved the code to fix the issues with naip:year in my example dataset into the row_func that is used in the notebook. I also made the sync example an actual function in the library.

)

to_parquet(
record_batch_reader,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing output_path into a Path doesn't work well with fsspec, because it complains that only local filesystems can use Path. I don't think it's necessary to parse output_path into a Path, can you leave it as as string?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm. Yeah, I definitely was just thinking local filesystem here and using Path because we need to make sure that directories exist before putting things in them. I'll need to look at what utilities there may be in fsspec land or alternately make sure that I only create directories when using local filesystem.

@ghidalgo3
Copy link

I can't approve, but I have tested these changes in the stac-geoparquet export process for OpenPC and they work! My only ask is to add some logging.

@bitner bitner merged commit 5113f50 into main May 27, 2025
5 checks passed
@bitner bitner deleted the pgstac_reader branch May 27, 2025 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants