Skip to content

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Nov 7, 2025

This PR implements GDAL/OGR format reading, which also required some infrastructure. I tried to put as much infrastructure as possible in Rust so that we can reuse it when we implement this in Rust for the CLI/maybe other bindings.

There were a few missing pieces I discovered here (extracting a bounding box from a filter expression, which in turn required the ability to intersect intervals/bounding boxes which we hadn't implemented yet. Anybody should feel free to say the word and I'm happy to PR these separately to help with review.

import sedona.db

sd = sedona.db.connect()
sd.options.interactive = True

url = "https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/ns-water_water-line.fgb"
sd.read_pyogrio(url).to_view("water_line")

sd.view("water_line").head(5)

Some cool features we support here:

  • Bounding box pushdown: the (spatial) query filters are pushed into the datasource. We could potentially push some non-spatial filters but it's less clear if that would actually help (even for spatial filters we might be able to evaluate them faster than GDAL if there's no spatial index on the source or if it's a wide window because we can do it in parallel).
  • Projection pushdown: skip reading unused columns
  • https:// and/or .zipped sources using /vsicurl/ and /vsizip/, respectively.
  • Multi file support. You can read_pyogrio("some/dir/*.fgb") and it should work (and is tested).
  • Infrastructure to wire up other file formats via Python. We probably need to extend the implementation a bit to support other formats but it is mostly there. While we don't really want to implement the actual reader in Python (== holding the GIL while reading stuff), using Python to obtain the C ArrowArrayStream should be OK.

@petern48
Copy link
Collaborator

petern48 commented Nov 9, 2025

Closes #137 I assume? Just creating a link

@paleolimbot paleolimbot requested a review from Copilot November 19, 2025 15:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements support for reading spatial file formats using GDAL/OGR through pyogrio in the Python sedonadb package. The changes introduce a plugin architecture for external file formats and implement the first concrete format (pyogrio) that enables reading common geospatial formats like FlatGeoBuf, GeoPackage, and GeoJSON.

Key changes:

  • Added ExternalFormatSpec trait and infrastructure for pluggable file format support
  • Implemented PyogrioFormatSpec for GDAL/OGR format support via pyogrio
  • Added spatial filter pushdown capability with bounding box extraction
  • Implemented interval intersection methods for spatial filtering optimization

Reviewed Changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
rust/sedona/src/context.rs Added read_external_format() method and tests for external format reading
rust/sedona-geometry/src/interval.rs Added is_full() and intersection() methods to interval traits
rust/sedona-geometry/src/bounding_box.rs Added intersection() method for bounding box operations
rust/sedona-expr/src/spatial_filter.rs Added filter_bbox() method and Clone trait for spatial filter pushdown
rust/sedona-datasource/src/utility.rs New file implementing ProjectedRecordBatchReader for column projection
rust/sedona-datasource/src/spec.rs Enhanced Object with URL reconstruction and display, added tests
rust/sedona-datasource/src/format.rs Added schema inference validation for empty object lists
python/sedonadb/src/datasource.rs New file implementing Python bindings for external format specs
python/sedonadb/src/context.rs Added read_external_format() method to Python context
python/sedonadb/python/sedonadb/datasource.py New file with ExternalFormatSpec and PyogrioFormatSpec implementations
python/sedonadb/python/sedonadb/context.py Added read_pyogrio() public API method
python/sedonadb/tests/test_datasource.py New test file for pyogrio format functionality
Various Cargo.toml files Added dependencies for new functionality

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@paleolimbot paleolimbot changed the title feat(python/sedonadb): Implement GDAL/OGR formats feat(python/sedonadb): Implement GDAL/OGR formats via pyogrio Nov 19, 2025
@paleolimbot paleolimbot marked this pull request as ready for review November 19, 2025 16:03
Comment on lines +67 to +103
/// Compute the maximum extent of a filter for a specific column index
///
/// Some spatial file formats have the ability to push down a bounding box
/// into an index. This function allows deriving that bounding box based
/// on what DataFusion provides, which is a physical expression.
///
/// Note that this always succeeds; however, for a non-spatial expression or
/// a non-spatial expression that is unsupported, the full bounding box is
/// returned.
pub fn filter_bbox(&self, column_index: usize) -> BoundingBox {
match self {
SpatialFilter::Intersects(column, bounding_box)
| SpatialFilter::Covers(column, bounding_box) => {
if column.index() == column_index {
return bounding_box.clone();
}
}
SpatialFilter::And(lhs, rhs) => {
let lhs_box = lhs.filter_bbox(column_index);
let rhs_box = rhs.filter_bbox(column_index);
if let Ok(bounds) = lhs_box.intersection(&rhs_box) {
return bounds;
}
}
SpatialFilter::Or(lhs, rhs) => {
let mut bounds = lhs.filter_bbox(column_index);
bounds.update_box(&rhs.filter_bbox(column_index));
return bounds;
}
SpatialFilter::LiteralFalse => {
return BoundingBox::xy(Interval::empty(), Interval::empty())
}
SpatialFilter::HasZ(_) | SpatialFilter::Unknown => {}
}

BoundingBox::xy(Interval::full(), Interval::full())
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kontinuation can you sanity check this?

Comment on lines +510 to +554
fn intersection(&self, other: &Self) -> Result<Self, SedonaGeometryError> {
match (self.is_wraparound(), other.is_wraparound()) {
// Neither is wraparound
(false, false) => Ok(self.inner.intersection(&other.inner)?.into()),
// One is wraparound
(true, false) => other.intersection(self),
(false, true) => {
let inner = self.inner;
let (left, right) = other.split();
match (inner.intersects_interval(&left), inner.intersects_interval(&right)) {
// Intersects both the left and right intervals
(true, true) => {
Err(SedonaGeometryError::Invalid(format!("Can't represent the intersection of {self:?} and {other:?} as a single WraparoundInterval")))
},
// Intersects only the left interval
(true, false) => Ok(inner.intersection(&left)?.into()),
// Intersects only the right interval
(false, true) => Ok(inner.intersection(&right)?.into()),
(false, false) => Ok(WraparoundInterval::empty()),
}
}
// Both are wraparound
(true, true) => {
// Both wraparound intervals represent complements of excluded regions
// Intersection of complements = complement of union of excluded regions
// self excludes (hi, lo), other excludes (other.hi, other.lo)
// We need to find the union of these excluded regions
let excluded_self = Interval::new(self.inner.hi, self.inner.lo);
let excluded_other = Interval::new(other.inner.hi, other.inner.lo);

// We can't use the excluded union if the excluded region of self and other
// are disjoint
if excluded_self.intersects_interval(&excluded_other) {
let excluded_union = excluded_self.merge_interval(&excluded_other);

// The intersection is the complement of the union of excluded regions
Ok(WraparoundInterval::new(
excluded_union.hi(),
excluded_union.lo(),
))
} else {
Err(SedonaGeometryError::Invalid(format!("Can't represent the intersection of {self:?} and {other:?} as a single WraparoundInterval")))
}
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a bit of non-trivial math that could use a sanity check (the tests below have some ASCII art to help visualize the various cases)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants