Skip to content

Comments

feat: Refactor nearest to use upstream NearestProvider UDTF#309

Merged
mwiewior merged 3 commits intomasterfrom
feat/nearest-provider-udtf
Feb 17, 2026
Merged

feat: Refactor nearest to use upstream NearestProvider UDTF#309
mwiewior merged 3 commits intomasterfrom
feat/nearest-provider-udtf

Conversation

@mwiewior
Copy link
Collaborator

Summary

  • Replace SQL-generation-based nearest with direct NearestProvider usage from datafusion-bio-function-ranges (upstream PR #13)
  • Expose new parameters: k (number of nearest neighbors), overlap (include/exclude overlapping intervals), distance (opt-in distance column)
  • Delete the legacy nearest_query() SQL generator and coitreesnearest algorithm override
  • Uses an optimized NearestIntervalIndex (COITree + sorted boundary arrays) under the hood

Changes

File Change
Cargo.toml Bump datafusion-bio-function-ranges to rev dae17206
src/option.rs Add nearest_k, include_overlaps, compute_distance to RangeOptions
src/operation.rs Rewrite do_nearest() using NearestProvider with column rename + distance SQL
src/query.rs Delete nearest_query() function (~100 lines)
polars_bio/range_op.py Add k, overlap, distance params to nearest()
polars_bio/range_op_helpers.py Gate distance column on compute_distance flag
polars_bio/polars_ext.py Add k, overlap, distance to Polars extension method
tests/test_native.py Add tests for k>1, overlap=False, distance=False
tests/test_polars.py Add k>1 tests for DataFrame and LazyFrame

Test plan

  • cargo check — Rust compilation passes
  • pytest tests/test_native.py -k nearest — 10 tests pass (2 existing + 8 new)
  • pytest tests/test_polars.py -k nearest — 5 tests pass (3 existing + 2 new)
  • pytest tests/test_bioframe.py -k nearest — 2 existing tests pass (backward compat)
  • pytest tests/test_streaming.py -k nearest — 3 existing tests pass
  • pytest tests/test_pandas.py -k nearest — 2 existing tests pass
  • All 79 tests across main test files pass

🤖 Generated with Claude Code

@mwiewior mwiewior force-pushed the feat/nearest-provider-udtf branch 5 times, most recently from a1c2d04 to 799ae64 Compare February 17, 2026 14:27
Replace SQL-generation-based nearest with direct NearestProvider usage
from datafusion-bio-function-ranges, exposing k, overlap, and distance
parameters. This uses an optimized NearestIntervalIndex (COITree +
sorted boundary arrays) instead of the legacy coitreesnearest SQL JOIN.

- Bump datafusion-bio-function-ranges to rev dae17206 (PR #13)
- Add nearest_k, include_overlaps, compute_distance to RangeOptions
- Rewrite do_nearest() using NearestProvider with column rename SQL
- Delete nearest_query() SQL generator from query.rs
- Add k, overlap, distance params to Python API and Polars extension
- Gate distance column on compute_distance flag in schema generation
- Add tests for k>1, overlap=False, and distance=False

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mwiewior mwiewior force-pushed the feat/nearest-provider-udtf branch from 799ae64 to 64f968c Compare February 17, 2026 14:32
@mwiewior
Copy link
Collaborator Author

@claude pls review

@claude
Copy link

claude bot commented Feb 17, 2026

Claude finished @mwiewior's task in 1m 55s —— View job


PR Review: feat/nearest-provider-udtf

  • Gather context and understand the PR
  • Read changed files
  • Analyze diff
  • Post review feedback

Overall this is a solid refactor — replacing the hand-rolled SQL generator with the upstream NearestProvider UDTF is the right direction. The left/right swap with the clear comment explaining the provider's indexing semantics is good defensive documentation. A few concerns worth addressing before merge:


Correctness Issues

Distance formula – boundary semantics (src/operation.rs:188-195)

The CASE expression reads:

WHEN "left_{end_2}" < "right_{start_1}" THEN CAST("right_{start_1}" - "left_{end_2}" AS BIGINT)
WHEN "right_{end_1}" < "left_{start_2}" THEN CAST("left_{start_2}" - "right_{end_1}" AS BIGINT)

For half-open intervals [start, end) this computes start_1 - end_2 when end_2 == start_1 (adjacent intervals), which gives 0 — that is correct.
For closed intervals [start, end], two intervals are touching when end_2 + 1 == start_1, but the formula gives start_1 - end_2 = 1, not 0.

The filter_op (Weak/Strict) carries the coordinate-system information, but the distance computation is not conditioned on it. The old SQL also had the same issue, but since this is an explicit refactor with new tests, it's worth deciding intentionally how distances should be handled across coordinate systems. Fix this →

deregister_table ignores errors (src/operation.rs:152)

ctx.deregister_table(table_name).unwrap();

If the table doesn't exist yet, deregister_table returns Ok(None) so this is safe for the first call. But if a prior call left the table in a bad state, unwrap() could panic at runtime. This is consistent with the rest of the file's error-handling style, but it's worth noting.


Design / API Concerns

nearest_result table name is a global shared namespace (src/operation.rs:151)

ctx is a SessionContext that may be shared across concurrent calls (depending on how Polars-bio constructs it). Using a hardcoded table name "nearest_result" plus a deregister + register pattern is not safe under concurrent nearest() calls. The do_count_overlaps_coverage_naive function uses a per-call unique name; consider the same approach here (e.g., format!("nearest_result_{}", uuid) or a counter). Fix this →

coitreesnearest guard is dead code (src/operation.rs:38-40)

The old code set bio.interval_join_algorithm to "coitreesnearest" before calling do_nearest. That is now removed (good). But the guard in do_range_operation that panics if someone passes "coitreesnearest" as overlap_alg is now technically dead code — do_nearest no longer touches overlap_alg, and the guard runs before dispatching. Consider either removing it or converting it to a proper error return.

Column name collision risk

If a user's DataFrame has a column literally named "nearest_result" (unlikely but possible) or a field whose name happens to collide with another field after suffix application (e.g., both tables have a "chrom_1" column), the dynamically-generated SQL will produce duplicate alias names. The upstream NearestProvider presumably prefixes all output columns, but the suffix-aliasing step in select_parts doesn't deduplicate. Worth checking whether the existing overlap operation handles this and applying the same guard.


Minor Issues

session / ctx clone (src/operation.rs:108)

let session = ctx.clone();
let nearest_provider = NearestProvider::new(Arc::new(session), ...);

ctx is already &SessionContext. SessionContext::clone() is cheap (it's Arc-backed), but wrapping in Arc::new(session) means the provider gets a fresh Arc pointing to the same underlying session. This is correct; just double-check that NearestProvider doesn't need exclusive ownership.

Unused filter_op in nearest path

range_opts.filter_op.unwrap() is called to produce upstream_filter_op for NearestProvider. Make sure NearestProvider actually uses it — if it doesn't (e.g., the upstream treats it as a noop for nearest), this is misleading and could be dropped.

Tests exercise new code paths but not edge cases

  • k=2 tests only assert "at least as many rows as k=1" — they don't check that at most 2 neighbors are returned per query interval.
  • overlap=False tests pass on the current data, but since the test data isn't shown, it's unclear whether the assertion all(d > 0) is exhaustive or the data happens to have all non-overlapping neighbors. Consider a test where some query intervals definitively overlap with database intervals, to confirm they are excluded.
  • No test for k combined with distance=False.

range_op_helpers.py distance schema check (range_op_helpers.py:155-158)

if range_options.range_op == RangeOp.Nearest and (
    range_options.compute_distance is None or range_options.compute_distance
):

The is None branch is checking the Python-side value. Since compute_distance is set from the Python distance parameter with a default of True, None should never reach here from the Python side. But it's a good defensive guard — consider adding a comment explaining it handles the direct-Rust-call path.


Summary

Severity Issue Location
Medium Hardcoded shared table name risks concurrency bugs operation.rs:151
Low Distance formula doesn't account for coordinate system operation.rs:188-195
Low Dead coitreesnearest guard operation.rs:38-40
Low k=2 tests don't assert upper bound on neighbors test_native.py
Nit session clone comment could be clearer operation.rs:108

mwiewior and others added 2 commits February 17, 2026 16:57
…ates

Update datafusion-bio-function-ranges to 8a684caf which fixes the
off-by-one distance bug when using strict (0-based half-open) filter.
The upstream now computes distance using raw coordinates instead of
strict-adjusted ones. Also simplifies do_nearest() to use the native
distance column from NearestProvider instead of a SQL CASE expression.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove dead `coitreesnearest` guard in do_range_operation (no longer
  needed since nearest uses NearestProvider directly)
- Use unique table names via atomic counter for nearest temp tables to
  avoid potential concurrency issues with shared SessionContext
- Add k=2 upper-bound tests asserting at most k neighbors per query
  interval in both native and polars test suites

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mwiewior mwiewior merged commit e63d416 into master Feb 17, 2026
15 checks passed
@mwiewior mwiewior deleted the feat/nearest-provider-udtf branch February 17, 2026 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant