feat: Add metrics in DataSourceExec related to spatial predicate pruning#173
feat: Add metrics in DataSourceExec related to spatial predicate pruning#173jiayuasu merged 8 commits intoapache:mainfrom
DataSourceExec related to spatial predicate pruning#173Conversation
| (None, Some(specified_predicate)) => Some(specified_predicate), | ||
| (Some(inner_predicate), None) => Some(inner_predicate), | ||
| (Some(_), Some(specified_predicate)) => { | ||
| parquet_source = parquet_source.with_predicate(specified_predicate.clone()); |
There was a problem hiding this comment.
DataFusion's with_predicate() API will reset metrics unexpectedly. See apache/datafusion#17858
I checked the related implementation, the predicate inside inner ParquetSource and the predicate in GeoParquetFileSource should always be the same, so here I made the implementation more defensive, to avoid the datafusion bug that clear the metrics.
There was a problem hiding this comment.
Pull Request Overview
This PR adds metrics tracking for spatial predicate pruning in DataSourceExec for GeoParquet files. The metrics help understand how effective spatial filters are at pruning files and row groups during query execution.
- Added a metrics struct to track spatial pruning effectiveness at both file and row group levels
- Integrated metrics tracking into the spatial filtering logic for both file-level and row group-level pruning
- Added comprehensive tests to verify the metrics are correctly reported in query execution plans
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| rust/sedona/tests/metrics.rs | New test file verifying spatial pruning metrics in execution plans |
| rust/sedona-geoparquet/src/format.rs | Updated GeoParquetFileSource to pass metrics to file opener |
| rust/sedona-geoparquet/src/file_opener.rs | Added metrics struct and tracking for spatial pruning operations |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| use sedona::context::SedonaContext; | ||
|
|
||
| #[tokio::test] | ||
| async fn geo_parquet_metrics() { |
There was a problem hiding this comment.
Should the parquet specific test be in rust/sedona-geoparquet/tests/ instead of rust/sedona/tests/ for better organization?
There was a problem hiding this comment.
It's true that all of our other pruning tests are in sedona-geoparquet (for better or worse! The top level sedona crate didn't exist when I wrote them..). We don't have access to a real ST_Intersects() there but we do have a fake one to test pruning:
sedona-db/rust/sedona-geoparquet/src/format.rs
Lines 684 to 686 in a15844b
I'd prefer to keep the pruning tests together in sedona-geoparquet but also happy to have some integration-y tests here if there's some technical reason they can't live there 🙂
There was a problem hiding this comment.
For some technical reasons, those tests are easier to be implemented as e2e/integration tests.
If we want to make it in sedona-geoparquet for better organization, it would be testing against some very low-level utility functions, then the issue is they're very volatile -- some simple refactor will require those tests to be rewritten, while integration tests are more stable.
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you! All my comments are optional...looking forward to this 🙂
| use sedona::context::SedonaContext; | ||
|
|
||
| #[tokio::test] | ||
| async fn geo_parquet_metrics() { |
There was a problem hiding this comment.
It's true that all of our other pruning tests are in sedona-geoparquet (for better or worse! The top level sedona crate didn't exist when I wrote them..). We don't have access to a real ST_Intersects() there but we do have a fake one to test pruning:
sedona-db/rust/sedona-geoparquet/src/format.rs
Lines 684 to 686 in a15844b
I'd prefer to keep the pruning tests together in sedona-geoparquet but also happy to have some integration-y tests here if there's some technical reason they can't live there 🙂
rust/sedona/tests/metrics.rs
Outdated
| .await | ||
| .expect("interactive context should initialize"); | ||
|
|
||
| let geo_parquet_path = "../../submodules/sedona-testing/data/parquet/geoparquet-1.1.0.parquet"; |
There was a problem hiding this comment.
We have a helper for this one (mostly to give an actionable error if somebody forgot to initialize the submodules):
sedona-db/rust/sedona-testing/src/data.rs
Lines 43 to 48 in a15844b
There was a problem hiding this comment.
Ah right, that's for geoarrow-data and not sedona-testing. No need to add a second helper, this is great as is!
|
Thank you all for the reviews |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Rationale
It's useful to see how many files or row groups are pruned by spatial filter. This PR extends
DataSourceExec's metrics inGeoParquetFileOpenerrelated to spatial predicate pruning:Demo
See
*_spatial_*entries in metrics:Implementation
Included a struct to hold spatial pruning related metrics isnide
GeoParquetFileOpener, and update those metrics during spatial filtering.