feat: Basic geospatial support#6392
Conversation
|
Hey @PhysicsACE, Thanks for putting this together — geospatial support is great to see. This is a big change though (~33k lines across 220 files), and for something this size we'd really need to review it incrementally. Before we dig into the code, I'd love to open up a GH discussion to align on what we want to support as core functionality vs what should be community-driven. Once we've got that scoped out, we can figure out a good PR breakdown. For example, adding Union type would probably warrant its own PR rather than being bundled in here. Would you be up for that? |
|
Thanks for putting this together @PhysicsACE - exciting to see geospatial moving forward! +1 to @universalmind303's suggestion. This is a solid basis for the discussion even if the PR itself needs to be broken up. Looking at this PR, there's a few architectural points I'd want to resolve there:
|
|
@universalmind303 @desmondcheongzx Hey, thank you for the prompt feedback. I agree, this change definitely needs to be broken up and we need a clear distinction between core and community driven functionality. I'll start breaking up the PR into smaller chunks and open a GH discussion to lay out the necessary next steps. |
|
Related to this work. I just started working on exposing our internal Extension datatype to the public API (#6396). This should make it much easier to add geo support without needing to change as much daft internals. |
Changes Made
This is a prototype for integrating geoarrow into daft to support geospatial datatypes. This change introduces several new logical datatypes including wkt, wkb, point, linestring, polygon, multipoint, multilinestring, multipolygon, geometrycollection, geometry, rect and geography to the daft schema. Their physical types follow the geoarrow specification. Geoarrow supports 2 different coordinate formats, a struct variant and a fixedsizelist variant. For the purposes of this draft, I opted to use the struct format. This code also introduces UnionArrays to daft which are a key component in geoarrow for storing collections of geometries.
GeoArrow
The geoarrow crate is under development and many of its features are behind a beta version of the package. In this draft, I only migrated the array, cast, schema and geo-expr (disclaimer: used claude code to refactor some functions) crates to create a foundation for geospatial support. I made some tweaks to the geoarrow internals as it defaults to using listarrays as opposed the the largelistarrays used in daft to support translation between daft and geoarrow arrays.
The only current entry point to using geospatial types is via casting a string array to wkt which can then are casted to other geospatial types as geospatial specific IO is not implemented for now. This draft only adds type definitions and basic casting operations to show what a geoarrow integration in daft could look like. I have another branch that adds builin functions support and fills in the TODOs of this change but this draft is already large enough. This code is still messy and definitely needs more review and testing but if the general structure and approach looks sound, I can try and get it review ready in the next couple of days.
Related Issues
Based on the discussions in #3061. Pinging @desmondcheongzx for review.