Skip to content

feat: Add metrics in DataSourceExec related to spatial predicate pruning#173

Merged
jiayuasu merged 8 commits intoapache:mainfrom
2010YOUY01:spatial-filter-metrics
Oct 4, 2025
Merged

feat: Add metrics in DataSourceExec related to spatial predicate pruning#173
jiayuasu merged 8 commits intoapache:mainfrom
2010YOUY01:spatial-filter-metrics

Conversation

@2010YOUY01
Copy link
Contributor

Rationale

It's useful to see how many files or row groups are pruned by spatial filter. This PR extends DataSourceExec's metrics in GeoParquetFileOpener related to spatial predicate pruning:

#[derive(Clone)]
struct GeoParquetFileOpenerMetrics {
    /// How many file ranges are pruned by [`SpatialFilter`]
    ///
    /// Note on "file range": an opener may read only part of a file rather than the
    /// entire file; that portion is referred to as the "file range". See [`PartitionedFile`]
    /// for details.
    files_ranges_spatial_pruned: Count,
    /// How many file ranges are matched by [`SpatialFilter`]
    files_ranges_spatial_matched: Count,
    /// How many row groups are pruned by [`SpatialFilter`]
    row_groups_spatial_pruned: Count,
    /// How many row groups are matched by [`SpatialFilter`]
    row_groups_spatial_matched: Count,
}
Demo

See *_spatial_* entries in metrics:

Sedona CLI v0.2.0
> CREATE EXTERNAL TABLE test
STORED AS PARQUET
LOCATION '/Users/yongting/Code/sedona-db/submodules/sedona-testing/data/parquet/geoparquet-1.1.0.parquet';
0 row(s)/0 column(s) fetched.
Elapsed 0.031 seconds.
// Spatial predicate that pruned the entire file
>         EXPLAIN ANALYZE
        SELECT *
        FROM test
        WHERE ST_Intersects(
            geometry,
            ST_SetSRID(
                ST_GeomFromText('POLYGON((-10 84, -10 88, 10 88, 10 84, -10 84))'),
                4326
            )
        );
┌───────────────────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│     plan_type     ┆                                           plan                                          │
│        utf8       ┆                                           utf8                                          │
╞═══════════════════╪═════════════════════════════════════════════════════════════════════════════════════════╡
│ Plan with Metrics ┆ CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=0, elapsed_compute=1. │
│                   ┆ 377µs]                                                                                  │
│                   ┆   FilterExec: st_intersects(geometry@5, 01030000000100000005...), metrics=[output_rows= │
│                   ┆ 0, elapsed_compute=14ns]                                                                │
│                   ┆     RepartitionExec: partitioning=RoundRobinBatch(14), input_partitions=1, metrics=[fet │
│                   ┆ ch_time=2.498667ms, repartition_time=1ns, send_time=14ns]                               │
│                   ┆       DataSourceExec: file_groups={1 group: [[Users/yongting/Code/sedona-db/submodules/ │
│                   ┆ sedona-testing/data/parquet/geoparquet-1.1.0.parquet]]}, projection=[pop_est, continent │
│                   ┆ , name, iso_a3, gdp_md_est, geometry, bbox], file_type=parquet, metrics=[output_rows=0, │
│                   ┆  elapsed_compute=1ns, batches_splitted=0, bytes_scanned=0, file_open_errors=0, file_sca │
│                   ┆ n_errors=0, files_ranges_pruned_statistics=0, files_ranges_spatial_matched=0, files_ran │
│                   ┆ ges_spatial_pruned=1, num_predicate_creation_errors=0, page_index_rows_matched=0, page_ │
│                   ┆ index_rows_pruned=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_r │
│                   ┆ ows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=0, row_g │
│                   ┆ roups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, row_groups_spatial_matched │
│                   ┆ =0, row_groups_spatial_pruned=0, bloom_filter_eval_time=2ns, metadata_load_time=820.626 │
│                   ┆ µs, page_index_eval_time=126ns, row_pushdown_eval_time=2ns, statistics_eval_time=2ns, t │
│                   ┆ ime_elapsed_opening=2.100209ms, time_elapsed_processing=1.899918ms, time_elapsed_scanni │
│                   ┆ ng_total=3.709µs, time_elapsed_scanning_until_data=3.708µs]                             │
│                   ┆                                                                                         │
└───────────────────┴─────────────────────────────────────────────────────────────────────────────────────────┘
1 row(s)/2 column(s) fetched.
Elapsed 0.046 seconds.
// spatial predicate can not skip file/row group
>         EXPLAIN ANALYZE
        SELECT *
        FROM test
        WHERE ST_Intersects(
            geometry,
            ST_SetSRID(
                ST_GeomFromText(
                    'POLYGON((-180 -18.28799, -180 83.23324, 180 83.23324, 180 -18.28799, -180 -18.28799))'
                ),
                4326
            )
        );
┌───────────────────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│     plan_type     ┆                                           plan                                          │
│        utf8       ┆                                           utf8                                          │
╞═══════════════════╪═════════════════════════════════════════════════════════════════════════════════════════╡
│ Plan with Metrics ┆ CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=5, elapsed_compute=11 │
│                   ┆ 4.079µs]                                                                                │
│                   ┆   FilterExec: st_intersects(geometry@5, 01030000000100000005...), metrics=[output_rows= │
│                   ┆ 5, elapsed_compute=919.596µs]                                                           │
│                   ┆     RepartitionExec: partitioning=RoundRobinBatch(14), input_partitions=1, metrics=[fet │
│                   ┆ ch_time=7.449916ms, repartition_time=1ns, send_time=28.347µs]                           │
│                   ┆       DataSourceExec: file_groups={1 group: [[Users/yongting/Code/sedona-db/submodules/ │
│                   ┆ sedona-testing/data/parquet/geoparquet-1.1.0.parquet]]}, projection=[pop_est, continent │
│                   ┆ , name, iso_a3, gdp_md_est, geometry, bbox], file_type=parquet, metrics=[output_rows=5, │
│                   ┆  elapsed_compute=1ns, batches_splitted=0, bytes_scanned=21777, file_open_errors=0, file │
│                   ┆ _scan_errors=0, files_ranges_pruned_statistics=0, files_ranges_spatial_matched=1, files │
│                   ┆ _ranges_spatial_pruned=0, num_predicate_creation_errors=0, page_index_rows_matched=0, p │
│                   ┆ age_index_rows_pruned=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdo │
│                   ┆ wn_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=0, r │
│                   ┆ ow_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, row_groups_spatial_mat │
│                   ┆ ched=1, row_groups_spatial_pruned=0, bloom_filter_eval_time=2ns, metadata_load_time=1.0 │
│                   ┆ 62084ms, page_index_eval_time=376ns, row_pushdown_eval_time=2ns, statistics_eval_time=2 │
│                   ┆ ns, time_elapsed_opening=2.791916ms, time_elapsed_processing=5.083834ms, time_elapsed_s │
│                   ┆ canning_total=4.124333ms, time_elapsed_scanning_until_data=3.912625ms]                  │
│                   ┆                                                                                         │
└───────────────────┴─────────────────────────────────────────────────────────────────────────────────────────┘
1 row(s)/2 column(s) fetched.
Elapsed 0.035 seconds.

Implementation

Included a struct to hold spatial pruning related metrics isnide GeoParquetFileOpener, and update those metrics during spatial filtering.

(None, Some(specified_predicate)) => Some(specified_predicate),
(Some(inner_predicate), None) => Some(inner_predicate),
(Some(_), Some(specified_predicate)) => {
parquet_source = parquet_source.with_predicate(specified_predicate.clone());
Copy link
Contributor Author

@2010YOUY01 2010YOUY01 Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFusion's with_predicate() API will reset metrics unexpectedly. See apache/datafusion#17858
I checked the related implementation, the predicate inside inner ParquetSource and the predicate in GeoParquetFileSource should always be the same, so here I made the implementation more defensive, to avoid the datafusion bug that clear the metrics.

@jiayuasu jiayuasu requested a review from Copilot October 2, 2025 15:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds metrics tracking for spatial predicate pruning in DataSourceExec for GeoParquet files. The metrics help understand how effective spatial filters are at pruning files and row groups during query execution.

  • Added a metrics struct to track spatial pruning effectiveness at both file and row group levels
  • Integrated metrics tracking into the spatial filtering logic for both file-level and row group-level pruning
  • Added comprehensive tests to verify the metrics are correctly reported in query execution plans

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
rust/sedona/tests/metrics.rs New test file verifying spatial pruning metrics in execution plans
rust/sedona-geoparquet/src/format.rs Updated GeoParquetFileSource to pass metrics to file opener
rust/sedona-geoparquet/src/file_opener.rs Added metrics struct and tracking for spatial pruning operations

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

use sedona::context::SedonaContext;

#[tokio::test]
async fn geo_parquet_metrics() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the parquet specific test be in rust/sedona-geoparquet/tests/ instead of rust/sedona/tests/ for better organization?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that all of our other pruning tests are in sedona-geoparquet (for better or worse! The top level sedona crate didn't exist when I wrote them..). We don't have access to a real ST_Intersects() there but we do have a fake one to test pruning:

#[rstest]
#[tokio::test]
async fn pruning_geoparquet_metadata(#[values("st_intersects", "st_contains")] udf_name: &str) {

I'd prefer to keep the pruning tests together in sedona-geoparquet but also happy to have some integration-y tests here if there's some technical reason they can't live there 🙂

Copy link
Contributor Author

@2010YOUY01 2010YOUY01 Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some technical reasons, those tests are easier to be implemented as e2e/integration tests.
If we want to make it in sedona-geoparquet for better organization, it would be testing against some very low-level utility functions, then the issue is they're very volatile -- some simple refactor will require those tests to be rewritten, while integration tests are more stable.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! All my comments are optional...looking forward to this 🙂

use sedona::context::SedonaContext;

#[tokio::test]
async fn geo_parquet_metrics() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that all of our other pruning tests are in sedona-geoparquet (for better or worse! The top level sedona crate didn't exist when I wrote them..). We don't have access to a real ST_Intersects() there but we do have a fake one to test pruning:

#[rstest]
#[tokio::test]
async fn pruning_geoparquet_metadata(#[values("st_intersects", "st_contains")] udf_name: &str) {

I'd prefer to keep the pruning tests together in sedona-geoparquet but also happy to have some integration-y tests here if there's some technical reason they can't live there 🙂

.await
.expect("interactive context should initialize");

let geo_parquet_path = "../../submodules/sedona-testing/data/parquet/geoparquet-1.1.0.parquet";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a helper for this one (mostly to give an actionable error if somebody forgot to initialize the submodules):

/// Find the most likely path to the geoarrow-data testing directory if it exists
///
/// This looks for a geoarrow-data checkout using the value of SEDONA_GEOARROW_DATA_DIR,
/// the directory that would be valid if running cargo run from the repository root,
/// or the directory that would be valid if running cargo test (in that order).
pub fn geoarrow_data_dir() -> Result<String> {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, that's for geoarrow-data and not sedona-testing. No need to add a second helper, this is great as is!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Done in cac92f8

@2010YOUY01
Copy link
Contributor Author

Thank you all for the reviews

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@jiayuasu jiayuasu merged commit b320337 into apache:main Oct 4, 2025
12 checks passed
@paleolimbot paleolimbot added this to the 0.2.0 milestone Nov 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants