-
Notifications
You must be signed in to change notification settings - Fork 15
Write to Delta Lake #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| from deltalake import DeltaTable | ||
|
|
||
|
|
||
| def parse_stac_ndjson_to_delta_lake( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make a generic arrow_batches_to_delta_lake that takes in batches and then just make the parse_stac_ndjson_to_delta_lake a wrapper around that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, that generic arrow_batches_to_delta_lake function is literally just write_deltalake. You can pass an Iterable[pa.RecordBatch] directly to it. (You just also need to know the schema separately. Are you suggesting a helper that takes the first batch and passes its schema to write_deltalake?)
|
This PR should be ready to go, where we don't yet solve the In a follow up PR we may want to consider defaulting null types to string, but that may complicate schema evolution if later data has non-null values for STAC keys. |
This PR adds a new function
parse_stac_ndjson_to_delta_laketo convert a JSON source to a Delta Lake table. It is based on #57, so only look at the most recent commits, and that PR should be merged first.There's a complication here: Delta Lake refuses to write any column inferred with data type
null, with:This is a problem because if all items in a STAC Collection have a
nullJSON key, it gets inferred as an Arrownulltype. For example, in the3dep-lidar-copccollection in the tests, it hasstart_datetimeandend_datetimefields, and so according to the spec,datetimeis alwaysnull. This means we cannot write this collection to Delta Lake solely with automatic schema inference.In the latest commit I started to implement some manual schema modifications for
datetimeandproj:epsg, which fixed the error for3dep-lidar-copc. But3dep-lidar-dsmhas more fields that are inferred as null. In particular the schema paths:are both
null. It's not ideal to hard-code manual overrides for every extension, so we should discuss how to handle this.Possible options:
nullfields from the JSON before reading. This would be easier to fit into an Arrow schema, but would lose meaning. E.g.proj:epsgset to null has specific semantic meaning that we don't want to lose.datetime,proj:epsg, etc.