Skip to content

feat: Basic geospatial support#6392

Draft
PhysicsACE wants to merge 2 commits intoEventual-Inc:mainfrom
PhysicsACE:geospatial
Draft

feat: Basic geospatial support#6392
PhysicsACE wants to merge 2 commits intoEventual-Inc:mainfrom
PhysicsACE:geospatial

Conversation

@PhysicsACE
Copy link

Changes Made

This is a prototype for integrating geoarrow into daft to support geospatial datatypes. This change introduces several new logical datatypes including wkt, wkb, point, linestring, polygon, multipoint, multilinestring, multipolygon, geometrycollection, geometry, rect and geography to the daft schema. Their physical types follow the geoarrow specification. Geoarrow supports 2 different coordinate formats, a struct variant and a fixedsizelist variant. For the purposes of this draft, I opted to use the struct format. This code also introduces UnionArrays to daft which are a key component in geoarrow for storing collections of geometries.

GeoArrow

The geoarrow crate is under development and many of its features are behind a beta version of the package. In this draft, I only migrated the array, cast, schema and geo-expr (disclaimer: used claude code to refactor some functions) crates to create a foundation for geospatial support. I made some tweaks to the geoarrow internals as it defaults to using listarrays as opposed the the largelistarrays used in daft to support translation between daft and geoarrow arrays.

The only current entry point to using geospatial types is via casting a string array to wkt which can then are casted to other geospatial types as geospatial specific IO is not implemented for now. This draft only adds type definitions and basic casting operations to show what a geoarrow integration in daft could look like. I have another branch that adds builin functions support and fills in the TODOs of this change but this draft is already large enough. This code is still messy and definitely needs more review and testing but if the general structure and approach looks sound, I can try and get it review ready in the next couple of days.

Related Issues

Based on the discussions in #3061. Pinging @desmondcheongzx for review.

@github-actions github-actions bot added the feat label Mar 14, 2026
@universalmind303
Copy link
Member

Hey @PhysicsACE, Thanks for putting this together — geospatial support is great to see. This is a big change though (~33k lines across 220 files), and for something this size we'd really need to review it incrementally.

Before we dig into the code, I'd love to open up a GH discussion to align on what we want to support as core functionality vs what should be community-driven. Once we've got that scoped out, we can figure out a good PR breakdown. For example, adding Union type would probably warrant its own PR rather than being bundled in here.

Would you be up for that?

@desmondcheongzx
Copy link
Collaborator

Thanks for putting this together @PhysicsACE - exciting to see geospatial moving forward!

+1 to @universalmind303's suggestion. This is a solid basis for the discussion even if the PR itself needs to be broken up.

Looking at this PR, there's a few architectural points I'd want to resolve there:

  • Vendoring vs. depending on upstream geoarrow-rs. It looks like a bulk of this diff is a vendored fork of the geoarrow-array, geoarrow-schema, geoarrow-cast, and geoarrow-expr-geo crates with modifications to use LargeList instead of geoarrow's native List. One option we should strongly consider is depending on the upstream crates directly and handle the i32 -> i64 offset conversion at the boundary where geoarrow arrays enter Daft's internal representation. This keeps us out of the business of maintaining a fork of a pre-1.0 library. Open to other suggestions here too.

  • Scope of this PR. Right now the only entry/exit point is String -> WKT -> {geo type} -> WKT -> String via cast. No I/O integration (GeoParquet, GeoJSON, etc.), no geo expressions wired up. But geoarrow-expr-geo is included with ~3,500 lines of geo operations (area, centroid, distance, intersects, simplify, etc.) that aren't reachable from Daft yet. I'd keep the initial PR minimal: schema types + cast + tests, and bring in the expression library when we actually wire it up.

  • PR breakdown. Agree that UnionArray should be its own PR - it's a general-purpose addition to daft-core that happens to be needed by geoarrow but is independently useful. Something like:

    1. UnionArray support in daft-core
    2. Geospatial DataType variants + GeospatialMode in daft-schema
    3. Cast integration (upstream geoarrow dep + boundary conversion)
    4. Geo expressions (when ready)

@PhysicsACE
Copy link
Author

@universalmind303 @desmondcheongzx Hey, thank you for the prompt feedback. I agree, this change definitely needs to be broken up and we need a clear distinction between core and community driven functionality. I'll start breaking up the PR into smaller chunks and open a GH discussion to lay out the necessary next steps.

@universalmind303
Copy link
Member

Related to this work. I just started working on exposing our internal Extension datatype to the public API (#6396). This should make it much easier to add geo support without needing to change as much daft internals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants