feat: Add metrics in `DataSourceExec` related to spatial predicate pruning by 2010YOUY01 · Pull Request #173 · apache/sedona-db

2010YOUY01 · 2025-10-02T09:02:21Z

Rationale

It's useful to see how many files or row groups are pruned by spatial filter. This PR extends DataSourceExec's metrics in GeoParquetFileOpener related to spatial predicate pruning:

#[derive(Clone)]
struct GeoParquetFileOpenerMetrics {
    /// How many file ranges are pruned by [`SpatialFilter`]
    ///
    /// Note on "file range": an opener may read only part of a file rather than the
    /// entire file; that portion is referred to as the "file range". See [`PartitionedFile`]
    /// for details.
    files_ranges_spatial_pruned: Count,
    /// How many file ranges are matched by [`SpatialFilter`]
    files_ranges_spatial_matched: Count,
    /// How many row groups are pruned by [`SpatialFilter`]
    row_groups_spatial_pruned: Count,
    /// How many row groups are matched by [`SpatialFilter`]
    row_groups_spatial_matched: Count,
}

Demo

See *_spatial_* entries in metrics:

Sedona CLI v0.2.0
> CREATE EXTERNAL TABLE test
STORED AS PARQUET
LOCATION '/Users/yongting/Code/sedona-db/submodules/sedona-testing/data/parquet/geoparquet-1.1.0.parquet';
0 row(s)/0 column(s) fetched.
Elapsed 0.031 seconds.
// Spatial predicate that pruned the entire file
>         EXPLAIN ANALYZE
        SELECT *
        FROM test
        WHERE ST_Intersects(
            geometry,
            ST_SetSRID(
                ST_GeomFromText('POLYGON((-10 84, -10 88, 10 88, 10 84, -10 84))'),
                4326
            )
        );
┌───────────────────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│     plan_type     ┆                                           plan                                          │
│        utf8       ┆                                           utf8                                          │
╞═══════════════════╪═════════════════════════════════════════════════════════════════════════════════════════╡
│ Plan with Metrics ┆ CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=0, elapsed_compute=1. │
│                   ┆ 377µs]                                                                                  │
│                   ┆   FilterExec: st_intersects(geometry@5, 01030000000100000005...), metrics=[output_rows= │
│                   ┆ 0, elapsed_compute=14ns]                                                                │
│                   ┆     RepartitionExec: partitioning=RoundRobinBatch(14), input_partitions=1, metrics=[fet │
│                   ┆ ch_time=2.498667ms, repartition_time=1ns, send_time=14ns]                               │
│                   ┆       DataSourceExec: file_groups={1 group: [[Users/yongting/Code/sedona-db/submodules/ │
│                   ┆ sedona-testing/data/parquet/geoparquet-1.1.0.parquet]]}, projection=[pop_est, continent │
│                   ┆ , name, iso_a3, gdp_md_est, geometry, bbox], file_type=parquet, metrics=[output_rows=0, │
│                   ┆  elapsed_compute=1ns, batches_splitted=0, bytes_scanned=0, file_open_errors=0, file_sca │
│                   ┆ n_errors=0, files_ranges_pruned_statistics=0, files_ranges_spatial_matched=0, files_ran │
│                   ┆ ges_spatial_pruned=1, num_predicate_creation_errors=0, page_index_rows_matched=0, page_ │
│                   ┆ index_rows_pruned=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_r │
│                   ┆ ows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=0, row_g │
│                   ┆ roups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, row_groups_spatial_matched │
│                   ┆ =0, row_groups_spatial_pruned=0, bloom_filter_eval_time=2ns, metadata_load_time=820.626 │
│                   ┆ µs, page_index_eval_time=126ns, row_pushdown_eval_time=2ns, statistics_eval_time=2ns, t │
│                   ┆ ime_elapsed_opening=2.100209ms, time_elapsed_processing=1.899918ms, time_elapsed_scanni │
│                   ┆ ng_total=3.709µs, time_elapsed_scanning_until_data=3.708µs]                             │
│                   ┆                                                                                         │
└───────────────────┴─────────────────────────────────────────────────────────────────────────────────────────┘
1 row(s)/2 column(s) fetched.
Elapsed 0.046 seconds.
// spatial predicate can not skip file/row group
>         EXPLAIN ANALYZE
        SELECT *
        FROM test
        WHERE ST_Intersects(
            geometry,
            ST_SetSRID(
                ST_GeomFromText(
                    'POLYGON((-180 -18.28799, -180 83.23324, 180 83.23324, 180 -18.28799, -180 -18.28799))'
                ),
                4326
            )
        );
┌───────────────────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│     plan_type     ┆                                           plan                                          │
│        utf8       ┆                                           utf8                                          │
╞═══════════════════╪═════════════════════════════════════════════════════════════════════════════════════════╡
│ Plan with Metrics ┆ CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=5, elapsed_compute=11 │
│                   ┆ 4.079µs]                                                                                │
│                   ┆   FilterExec: st_intersects(geometry@5, 01030000000100000005...), metrics=[output_rows= │
│                   ┆ 5, elapsed_compute=919.596µs]                                                           │
│                   ┆     RepartitionExec: partitioning=RoundRobinBatch(14), input_partitions=1, metrics=[fet │
│                   ┆ ch_time=7.449916ms, repartition_time=1ns, send_time=28.347µs]                           │
│                   ┆       DataSourceExec: file_groups={1 group: [[Users/yongting/Code/sedona-db/submodules/ │
│                   ┆ sedona-testing/data/parquet/geoparquet-1.1.0.parquet]]}, projection=[pop_est, continent │
│                   ┆ , name, iso_a3, gdp_md_est, geometry, bbox], file_type=parquet, metrics=[output_rows=5, │
│                   ┆  elapsed_compute=1ns, batches_splitted=0, bytes_scanned=21777, file_open_errors=0, file │
│                   ┆ _scan_errors=0, files_ranges_pruned_statistics=0, files_ranges_spatial_matched=1, files │
│                   ┆ _ranges_spatial_pruned=0, num_predicate_creation_errors=0, page_index_rows_matched=0, p │
│                   ┆ age_index_rows_pruned=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdo │
│                   ┆ wn_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=0, r │
│                   ┆ ow_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, row_groups_spatial_mat │
│                   ┆ ched=1, row_groups_spatial_pruned=0, bloom_filter_eval_time=2ns, metadata_load_time=1.0 │
│                   ┆ 62084ms, page_index_eval_time=376ns, row_pushdown_eval_time=2ns, statistics_eval_time=2 │
│                   ┆ ns, time_elapsed_opening=2.791916ms, time_elapsed_processing=5.083834ms, time_elapsed_s │
│                   ┆ canning_total=4.124333ms, time_elapsed_scanning_until_data=3.912625ms]                  │
│                   ┆                                                                                         │
└───────────────────┴─────────────────────────────────────────────────────────────────────────────────────────┘
1 row(s)/2 column(s) fetched.
Elapsed 0.035 seconds.

Implementation

Included a struct to hold spatial pruning related metrics isnide GeoParquetFileOpener, and update those metrics during spatial filtering.

2010YOUY01 · 2025-10-02T09:05:29Z

rust/sedona-geoparquet/src/format.rs

                (None, Some(specified_predicate)) => Some(specified_predicate),
                (Some(inner_predicate), None) => Some(inner_predicate),
-                (Some(_), Some(specified_predicate)) => {
-                    parquet_source = parquet_source.with_predicate(specified_predicate.clone());


DataFusion's with_predicate() API will reset metrics unexpectedly. See apache/datafusion#17858
I checked the related implementation, the predicate inside inner ParquetSource and the predicate in GeoParquetFileSource should always be the same, so here I made the implementation more defensive, to avoid the datafusion bug that clear the metrics.

Copilot

Pull Request Overview

This PR adds metrics tracking for spatial predicate pruning in DataSourceExec for GeoParquet files. The metrics help understand how effective spatial filters are at pruning files and row groups during query execution.

Added a metrics struct to track spatial pruning effectiveness at both file and row group levels
Integrated metrics tracking into the spatial filtering logic for both file-level and row group-level pruning
Added comprehensive tests to verify the metrics are correctly reported in query execution plans

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
rust/sedona/tests/metrics.rs	New test file verifying spatial pruning metrics in execution plans
rust/sedona-geoparquet/src/format.rs	Updated GeoParquetFileSource to pass metrics to file opener
rust/sedona-geoparquet/src/file_opener.rs	Added metrics struct and tracking for spatial pruning operations

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

rust/sedona-geoparquet/src/format.rs

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

rust/sedona-geoparquet/src/file_opener.rs

rust/sedona-geoparquet/src/format.rs

zhangfengcdt · 2025-10-02T15:55:41Z

rust/sedona/tests/metrics.rs

+use sedona::context::SedonaContext;
+
+#[tokio::test]
+async fn geo_parquet_metrics() {


Should the parquet specific test be in rust/sedona-geoparquet/tests/ instead of rust/sedona/tests/ for better organization?

It's true that all of our other pruning tests are in sedona-geoparquet (for better or worse! The top level sedona crate didn't exist when I wrote them..). We don't have access to a real ST_Intersects() there but we do have a fake one to test pruning:

sedona-db/rust/sedona-geoparquet/src/format.rs

Lines 684 to 686 in a15844b

#[rstest]

#[tokio::test]

async fn pruning_geoparquet_metadata(#[values("st_intersects", "st_contains")] udf_name: &str) {

I'd prefer to keep the pruning tests together in sedona-geoparquet but also happy to have some integration-y tests here if there's some technical reason they can't live there 🙂

For some technical reasons, those tests are easier to be implemented as e2e/integration tests.
If we want to make it in sedona-geoparquet for better organization, it would be testing against some very low-level utility functions, then the issue is they're very volatile -- some simple refactor will require those tests to be rewritten, while integration tests are more stable.

paleolimbot

Thank you! All my comments are optional...looking forward to this 🙂

rust/sedona-geoparquet/src/format.rs

paleolimbot · 2025-10-02T17:19:52Z

rust/sedona/tests/metrics.rs

+use sedona::context::SedonaContext;
+
+#[tokio::test]
+async fn geo_parquet_metrics() {


It's true that all of our other pruning tests are in sedona-geoparquet (for better or worse! The top level sedona crate didn't exist when I wrote them..). We don't have access to a real ST_Intersects() there but we do have a fake one to test pruning:

sedona-db/rust/sedona-geoparquet/src/format.rs

Lines 684 to 686 in a15844b

#[rstest]

#[tokio::test]

async fn pruning_geoparquet_metadata(#[values("st_intersects", "st_contains")] udf_name: &str) {

I'd prefer to keep the pruning tests together in sedona-geoparquet but also happy to have some integration-y tests here if there's some technical reason they can't live there 🙂

paleolimbot · 2025-10-02T17:22:05Z

rust/sedona/tests/metrics.rs

+        .await
+        .expect("interactive context should initialize");
+
+    let geo_parquet_path = "../../submodules/sedona-testing/data/parquet/geoparquet-1.1.0.parquet";


We have a helper for this one (mostly to give an actionable error if somebody forgot to initialize the submodules):

sedona-db/rust/sedona-testing/src/data.rs

Lines 43 to 48 in a15844b

/// Find the most likely path to the geoarrow-data testing directory if it exists

///

/// This looks for a geoarrow-data checkout using the value of SEDONA_GEOARROW_DATA_DIR,

/// the directory that would be valid if running cargo run from the repository root,

/// or the directory that would be valid if running cargo test (in that order).

pub fn geoarrow_data_dir() -> Result<String> {

Ah right, that's for geoarrow-data and not sedona-testing. No need to add a second helper, this is great as is!

Good idea! Done in cac92f8

2010YOUY01 · 2025-10-04T14:33:20Z

Thank you all for the reviews

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

2010YOUY01 added 4 commits October 2, 2025 16:43

add metrics for spatial filter pruning

93cdcfd

Merge remote-tracking branch 'upstream/main' into spatial-filter-metrics

cc94fb1

cleanup

b143a95

cleanup

5bb489d

2010YOUY01 commented Oct 2, 2025

View reviewed changes

Add Apache Header

5ef0cc1

jiayuasu requested a review from Copilot October 2, 2025 15:38

Copilot AI reviewed Oct 2, 2025

View reviewed changes

rust/sedona-geoparquet/src/format.rs Outdated Show resolved Hide resolved

jiayuasu requested review from Kontinuation, Copilot, paleolimbot and zhangfengcdt October 2, 2025 15:39

Copilot AI reviewed Oct 2, 2025

View reviewed changes

zhangfengcdt reviewed Oct 2, 2025

View reviewed changes

paleolimbot approved these changes Oct 2, 2025

View reviewed changes

Kontinuation approved these changes Oct 3, 2025

View reviewed changes

2010YOUY01 added 2 commits October 4, 2025 22:00

clarify row_groups_spatial_pruned in the comment

3df3a0e

use a utility to get test directory path

cac92f8

Update rust/sedona-geoparquet/src/format.rs

be45fd3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

zhangfengcdt approved these changes Oct 4, 2025

View reviewed changes

jiayuasu merged commit b320337 into apache:main Oct 4, 2025
12 checks passed

paleolimbot added this to the 0.2.0 milestone Nov 27, 2025

	#[rstest]
	#[tokio::test]
	async fn pruning_geoparquet_metadata(#[values("st_intersects", "st_contains")] udf_name: &str) {

	/// Find the most likely path to the geoarrow-data testing directory if it exists
	///
	/// This looks for a geoarrow-data checkout using the value of SEDONA_GEOARROW_DATA_DIR,
	/// the directory that would be valid if running cargo run from the repository root,
	/// or the directory that would be valid if running cargo test (in that order).
	pub fn geoarrow_data_dir() -> Result<String> {

Conversation

2010YOUY01 commented Oct 2, 2025

Rationale

Implementation

Uh oh!

2010YOUY01 Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

zhangfengcdt Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

paleolimbot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

paleolimbot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

paleolimbot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

paleolimbot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 commented Oct 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

2010YOUY01 Oct 2, 2025 •

edited

Loading

2010YOUY01 Oct 4, 2025 •

edited

Loading