Skip to content

feat(query): Runtime Filter support spatial index join#19530

Merged
b41sh merged 4 commits intodatabendlabs:mainfrom
b41sh:feat-spatial-join
Mar 17, 2026
Merged

feat(query): Runtime Filter support spatial index join#19530
b41sh merged 4 commits intodatabendlabs:mainfrom
b41sh:feat-spatial-join

Conversation

@b41sh
Copy link
Member

@b41sh b41sh commented Mar 11, 2026

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR targets spatial join predicates where the probe side can benefit from block pruning, such as:

  • JOIN ... ON st_within(t1.geom, t2.geom)
  • JOIN ... ON st_intersects(t1.geom, t2.geom)
  • JOIN ... ON st_contains(t1.geom, t2.geom)
  • JOIN ... ON st_equals(t1.geom, t2.geom)

It is especially useful when the build side is smaller and the probe side is a large spatial table with a spatial index.

Implementation Overview

  • Build side creates a runtime spatial filter: we construct an RTree from build-side geometries and transmit it through the existing runtime filter framework.
  • To prevent large filters, we respect spatial_runtime_filter_threshold. If the number of RTree items exceeds the threshold, we compact the tree to reduce its size before sending.
  • Probe side pruning uses a two-step strategy:
    1. Coarse pruning with SpatialStatistics (block-level bbox + SRID).
    2. Fine pruning with the spatial index (RTree intersection checks).

Other Changes

  • Removed spatial index srid and invalid_rows columns:
    • SRID can be derived from SpatialStatistics, so the extra column is redundant.
    • invalid_rows was only used for row-level checks, which are no longer required in the block-level pruning flow.
  • Extends FuseBlockPartInfo to carry spatial index location and spatial statistics for runtime pruning

for example

CREATE TABLE stores (store_id INT, category String, store_name String, status String, location Geometry, SPATIAL INDEX idx (location));

INSERT INTO stores VALUES
(1, 'coffee', 'Alpha', 'active', TO_GEOMETRY('POINT(10 10)')),
(2, 'coffee', 'Beta', 'active', TO_GEOMETRY('POINT(11 11)')),
(3, 'tea', 'Gamma', 'active', TO_GEOMETRY('POINT(50 50)')),
(4, 'coffee', 'Delta', 'inactive', TO_GEOMETRY('POINT(51 51)')),
(5, 'coffee', 'Epsilon', 'active', TO_GEOMETRY('POINT(-10 -10)')),
(6, 'tea', 'Zeta', 'inactive', TO_GEOMETRY('POINT(-11 -11)')),
(7, 'coffee', 'Eta', 'active', TO_GEOMETRY('POINT(100 100)')),
(8, 'tea', 'Theta', 'active', TO_GEOMETRY('POINT(101 101)'));

CREATE TABLE districts (district_id INT, district_name String, geom Geometry);

INSERT INTO districts VALUES
(1, 'Central', TO_GEOMETRY('POLYGON((8 8, 8 13, 13 13, 13 8, 8 8))')),
(2, 'West', TO_GEOMETRY('POLYGON((-2 -2, -2 2, 2 2, 2 -2, -2 -2))')),
(3, 'South', TO_GEOMETRY('POLYGON((18 -2, 18 2, 32 2, 32 -2, 18 -2))')),
(4, 'North', TO_GEOMETRY('POLYGON((-2 18, -2 32, 2 32, 2 18, -2 18))'));

EXPLAIN SELECT d.district_name, s.store_name FROM districts d JOIN stores s ON st_within(s.location, d.geom)
-[ EXPLAIN ]-----------------------------------
HashJoin
├── output columns: [s.store_name (#5), d.district_name (#1)]
├── join type: INNER
├── build keys: []
├── probe keys: []
├── keys is null equal: []
├── filters: [st_within(s.location (#7), d.geom (#2))]
├── build join filters:
│   └── filter id:0, build key:d.geom (#2), probe targets:[s.location (#7)@scan1], filter type:spatial
├── estimated rows: 32.00
├── TableScan(Build)
│   ├── table: default.default.districts
│   ├── scan id: 0
│   ├── output columns: [district_name (#1), geom (#2)]
│   ├── read rows: 4
│   ├── read size: < 1 KiB
│   ├── partitions total: 1
│   ├── partitions scanned: 1
│   ├── pruning stats: [segments: <range pruning: 1 to 1 cost: 1 ms>, blocks: <range pruning: 1 to 1 cost: 1 ms>]
│   ├── push downs: [filters: [], limit: NONE]
│   └── estimated rows: 4.00
└── TableScan(Probe)
    ├── table: default.default.stores
    ├── scan id: 1
    ├── output columns: [store_name (#5), location (#7)]
    ├── read rows: 8
    ├── read size: < 1 KiB
    ├── partitions total: 1
    ├── partitions scanned: 1
    ├── pruning stats: [segments: <range pruning: 1 to 1 cost: 1 ms>, blocks: <range pruning: 1 to 1 cost: 1 ms>]
    ├── push downs: [filters: [], limit: NONE]
    ├── apply join filters: [#0]
    └── estimated rows: 8.00

SELECT d.district_name, s.store_name AS store_count FROM districts d JOIN stores s ON st_within(s.location, d.geom)

╭─────────────────────────────────────╮
│   district_name  │    store_name    │
│ Nullable(String) │ Nullable(String) │
├──────────────────┼──────────────────┤
│ Central          │ Alpha            │
│ Central          │ Beta             │
╰─────────────────────────────────────╯
  • fixes: #[Link the issue here]

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Mar 11, 2026
@b41sh b41sh force-pushed the feat-spatial-join branch from 77a6dc2 to 30db491 Compare March 16, 2026 10:02
@b41sh b41sh force-pushed the feat-spatial-join branch from cf6c29b to d3076c0 Compare March 16, 2026 10:16
@b41sh b41sh marked this pull request as ready for review March 16, 2026 10:21
@b41sh b41sh requested review from SkyFan2002 and Copilot March 16, 2026 10:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds runtime filter support for spatial index joins, enabling block-level pruning for probe-side tables with spatial indexes when using predicates like st_within, st_intersects, st_contains, and st_equals. It also simplifies the spatial index format by removing redundant srid and invalid_rows columns.

Changes:

  • Adds spatial runtime filter infrastructure: build-side constructs an RTree, transmits it through the runtime filter framework, and probe-side uses it for two-level pruning (stats-based coarse + RTree fine)
  • Removes srid and invalid_rows columns from spatial index, deriving SRID from SpatialStatistics and adding has_empty_rect/is_valid fields
  • Extends FuseBlockPartInfo to carry spatial index location and spatial statistics for runtime pruning

Reviewed changes

Copilot reviewed 58 out of 59 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/query/catalog/src/runtime_filter_info.rs Adds RuntimeFilterSpatial struct and spatial stats tracking
src/query/functions/src/lib.rs Adds GENERAL_SPATIAL_FUNCTIONS constant
src/query/settings/src/settings_default.rs Adds spatial_runtime_filter_threshold setting
src/query/service/src/physical_plans/physical_hash_join.rs Extracts spatial join conditions from non-equi filters for runtime filter building
src/query/service/src/physical_plans/runtime_filter/builder.rs Marks spatial filters with is_spatial flag
src/query/service/src/physical_plans/runtime_filter/types.rs Adds is_spatial field to PhysicalRuntimeFilter
src/query/service/src/pipelines/processors/transforms/hash_join/runtime_filter/spatial.rs New: RTree building, compaction, merging, and bounds extraction
src/query/service/src/pipelines/processors/transforms/hash_join/runtime_filter/local_builder.rs Adds spatial bbox extraction during build-side processing
src/query/service/src/pipelines/processors/transforms/hash_join/runtime_filter/convert.rs Converts spatial packets to RuntimeFilterSpatial entries
src/query/service/src/pipelines/processors/transforms/hash_join/runtime_filter/merge.rs Merges spatial filter packets across partitions
src/query/service/src/pipelines/processors/transforms/hash_join/runtime_filter/packet.rs Adds SpatialPacket to runtime filter packet
src/query/storages/fuse/src/pruning/spatial_runtime_pruner.rs New: Two-level spatial pruning (stats + RTree index)
src/query/storages/fuse/src/pruning/spatial_index_pruner.rs Refactored to use SpatialIndexReader and SpatialStatistics
src/query/storages/fuse/src/io/read/spatial_index/spatial_index_reader.rs New: Shared spatial index file reader
src/query/storages/fuse/src/io/write/spatial_index_writer.rs Removes srid/invalid_rows columns, only stores RTree with valid rects
src/query/storages/fuse/src/fuse_part.rs Adds spatial index location and spatial stats to FuseBlockPartInfo
src/query/storages/fuse/src/operations/read_partitions.rs Propagates spatial stats and index location to part info
src/query/storages/fuse/src/operations/read/parquet_data_transform_reader.rs Integrates spatial runtime pruner into read pipeline
src/query/storages/fuse/src/operations/read/native_data_transform_reader.rs Integrates spatial runtime pruner into native read pipeline
src/query/storages/fuse/src/statistics/spatial_stats.rs Adds has_empty_rect, is_valid fields; changes finalize to always return stats
src/query/storages/common/table_meta/src/meta/v2/statistics.rs Adds has_empty_rect and is_valid to SpatialStatistics
src/query/storages/common/index/src/range_index.rs Handles Option<Rect> for empty geometries
src/query/storages/common/index/src/spatial_predicate.rs Uses GENERAL_SPATIAL_FUNCTIONS constant, handles empty rects
Various join implementation files Thread spatial_threshold through join types
Test files Adds spatial join tests and runtime filter unit tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3076c0fac

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@b41sh b41sh merged commit 9feff14 into databendlabs:main Mar 17, 2026
89 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature this PR introduces a new feature to the codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants