-
Notifications
You must be signed in to change notification settings - Fork 30
feat(python/sedonadb): Implement GDAL/OGR formats via pyogrio #283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(python/sedonadb): Implement GDAL/OGR formats via pyogrio #283
Conversation
|
Closes #137 I assume? Just creating a link |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements support for reading spatial file formats using GDAL/OGR through pyogrio in the Python sedonadb package. The changes introduce a plugin architecture for external file formats and implement the first concrete format (pyogrio) that enables reading common geospatial formats like FlatGeoBuf, GeoPackage, and GeoJSON.
Key changes:
- Added
ExternalFormatSpectrait and infrastructure for pluggable file format support - Implemented PyogrioFormatSpec for GDAL/OGR format support via pyogrio
- Added spatial filter pushdown capability with bounding box extraction
- Implemented interval intersection methods for spatial filtering optimization
Reviewed Changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
rust/sedona/src/context.rs |
Added read_external_format() method and tests for external format reading |
rust/sedona-geometry/src/interval.rs |
Added is_full() and intersection() methods to interval traits |
rust/sedona-geometry/src/bounding_box.rs |
Added intersection() method for bounding box operations |
rust/sedona-expr/src/spatial_filter.rs |
Added filter_bbox() method and Clone trait for spatial filter pushdown |
rust/sedona-datasource/src/utility.rs |
New file implementing ProjectedRecordBatchReader for column projection |
rust/sedona-datasource/src/spec.rs |
Enhanced Object with URL reconstruction and display, added tests |
rust/sedona-datasource/src/format.rs |
Added schema inference validation for empty object lists |
python/sedonadb/src/datasource.rs |
New file implementing Python bindings for external format specs |
python/sedonadb/src/context.rs |
Added read_external_format() method to Python context |
python/sedonadb/python/sedonadb/datasource.py |
New file with ExternalFormatSpec and PyogrioFormatSpec implementations |
python/sedonadb/python/sedonadb/context.py |
Added read_pyogrio() public API method |
python/sedonadb/tests/test_datasource.py |
New test file for pyogrio format functionality |
Various Cargo.toml files |
Added dependencies for new functionality |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
| /// Compute the maximum extent of a filter for a specific column index | ||
| /// | ||
| /// Some spatial file formats have the ability to push down a bounding box | ||
| /// into an index. This function allows deriving that bounding box based | ||
| /// on what DataFusion provides, which is a physical expression. | ||
| /// | ||
| /// Note that this always succeeds; however, for a non-spatial expression or | ||
| /// a non-spatial expression that is unsupported, the full bounding box is | ||
| /// returned. | ||
| pub fn filter_bbox(&self, column_index: usize) -> BoundingBox { | ||
| match self { | ||
| SpatialFilter::Intersects(column, bounding_box) | ||
| | SpatialFilter::Covers(column, bounding_box) => { | ||
| if column.index() == column_index { | ||
| return bounding_box.clone(); | ||
| } | ||
| } | ||
| SpatialFilter::And(lhs, rhs) => { | ||
| let lhs_box = lhs.filter_bbox(column_index); | ||
| let rhs_box = rhs.filter_bbox(column_index); | ||
| if let Ok(bounds) = lhs_box.intersection(&rhs_box) { | ||
| return bounds; | ||
| } | ||
| } | ||
| SpatialFilter::Or(lhs, rhs) => { | ||
| let mut bounds = lhs.filter_bbox(column_index); | ||
| bounds.update_box(&rhs.filter_bbox(column_index)); | ||
| return bounds; | ||
| } | ||
| SpatialFilter::LiteralFalse => { | ||
| return BoundingBox::xy(Interval::empty(), Interval::empty()) | ||
| } | ||
| SpatialFilter::HasZ(_) | SpatialFilter::Unknown => {} | ||
| } | ||
|
|
||
| BoundingBox::xy(Interval::full(), Interval::full()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kontinuation can you sanity check this?
| fn intersection(&self, other: &Self) -> Result<Self, SedonaGeometryError> { | ||
| match (self.is_wraparound(), other.is_wraparound()) { | ||
| // Neither is wraparound | ||
| (false, false) => Ok(self.inner.intersection(&other.inner)?.into()), | ||
| // One is wraparound | ||
| (true, false) => other.intersection(self), | ||
| (false, true) => { | ||
| let inner = self.inner; | ||
| let (left, right) = other.split(); | ||
| match (inner.intersects_interval(&left), inner.intersects_interval(&right)) { | ||
| // Intersects both the left and right intervals | ||
| (true, true) => { | ||
| Err(SedonaGeometryError::Invalid(format!("Can't represent the intersection of {self:?} and {other:?} as a single WraparoundInterval"))) | ||
| }, | ||
| // Intersects only the left interval | ||
| (true, false) => Ok(inner.intersection(&left)?.into()), | ||
| // Intersects only the right interval | ||
| (false, true) => Ok(inner.intersection(&right)?.into()), | ||
| (false, false) => Ok(WraparoundInterval::empty()), | ||
| } | ||
| } | ||
| // Both are wraparound | ||
| (true, true) => { | ||
| // Both wraparound intervals represent complements of excluded regions | ||
| // Intersection of complements = complement of union of excluded regions | ||
| // self excludes (hi, lo), other excludes (other.hi, other.lo) | ||
| // We need to find the union of these excluded regions | ||
| let excluded_self = Interval::new(self.inner.hi, self.inner.lo); | ||
| let excluded_other = Interval::new(other.inner.hi, other.inner.lo); | ||
|
|
||
| // We can't use the excluded union if the excluded region of self and other | ||
| // are disjoint | ||
| if excluded_self.intersects_interval(&excluded_other) { | ||
| let excluded_union = excluded_self.merge_interval(&excluded_other); | ||
|
|
||
| // The intersection is the complement of the union of excluded regions | ||
| Ok(WraparoundInterval::new( | ||
| excluded_union.hi(), | ||
| excluded_union.lo(), | ||
| )) | ||
| } else { | ||
| Err(SedonaGeometryError::Invalid(format!("Can't represent the intersection of {self:?} and {other:?} as a single WraparoundInterval"))) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a bit of non-trivial math that could use a sanity check (the tests below have some ASCII art to help visualize the various cases)
This PR implements GDAL/OGR format reading, which also required some infrastructure. I tried to put as much infrastructure as possible in Rust so that we can reuse it when we implement this in Rust for the CLI/maybe other bindings.
There were a few missing pieces I discovered here (extracting a bounding box from a filter expression, which in turn required the ability to intersect intervals/bounding boxes which we hadn't implemented yet. Anybody should feel free to say the word and I'm happy to PR these separately to help with review.
Some cool features we support here:
https://and/or.zipped sources using/vsicurl/and/vsizip/, respectively.read_pyogrio("some/dir/*.fgb")and it should work (and is tested).