feat: Refactor nearest to use upstream NearestProvider UDTF by mwiewior · Pull Request #309 · biodatageeks/polars-bio

mwiewior · 2026-02-17T11:30:55Z

Summary

Replace SQL-generation-based nearest with direct NearestProvider usage from datafusion-bio-function-ranges (upstream PR #13)
Expose new parameters: k (number of nearest neighbors), overlap (include/exclude overlapping intervals), distance (opt-in distance column)
Delete the legacy nearest_query() SQL generator and coitreesnearest algorithm override
Uses an optimized NearestIntervalIndex (COITree + sorted boundary arrays) under the hood

Changes

File	Change
`Cargo.toml`	Bump `datafusion-bio-function-ranges` to rev `dae17206`
`src/option.rs`	Add `nearest_k`, `include_overlaps`, `compute_distance` to `RangeOptions`
`src/operation.rs`	Rewrite `do_nearest()` using `NearestProvider` with column rename + distance SQL
`src/query.rs`	Delete `nearest_query()` function (~100 lines)
`polars_bio/range_op.py`	Add `k`, `overlap`, `distance` params to `nearest()`
`polars_bio/range_op_helpers.py`	Gate distance column on `compute_distance` flag
`polars_bio/polars_ext.py`	Add `k`, `overlap`, `distance` to Polars extension method
`tests/test_native.py`	Add tests for k>1, overlap=False, distance=False
`tests/test_polars.py`	Add k>1 tests for DataFrame and LazyFrame

Test plan

cargo check — Rust compilation passes
pytest tests/test_native.py -k nearest — 10 tests pass (2 existing + 8 new)
pytest tests/test_polars.py -k nearest — 5 tests pass (3 existing + 2 new)
pytest tests/test_bioframe.py -k nearest — 2 existing tests pass (backward compat)
pytest tests/test_streaming.py -k nearest — 3 existing tests pass
pytest tests/test_pandas.py -k nearest — 2 existing tests pass
All 79 tests across main test files pass

🤖 Generated with Claude Code

Replace SQL-generation-based nearest with direct NearestProvider usage from datafusion-bio-function-ranges, exposing k, overlap, and distance parameters. This uses an optimized NearestIntervalIndex (COITree + sorted boundary arrays) instead of the legacy coitreesnearest SQL JOIN. - Bump datafusion-bio-function-ranges to rev dae17206 (PR #13) - Add nearest_k, include_overlaps, compute_distance to RangeOptions - Rewrite do_nearest() using NearestProvider with column rename SQL - Delete nearest_query() SQL generator from query.rs - Add k, overlap, distance params to Python API and Polars extension - Gate distance column on compute_distance flag in schema generation - Add tests for k>1, overlap=False, and distance=False Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mwiewior · 2026-02-17T14:59:10Z

@claude pls review

claude · 2026-02-17T14:59:33Z

Claude finished @mwiewior's task in 1m 55s —— View job

PR Review: feat/nearest-provider-udtf

Gather context and understand the PR
Read changed files
Analyze diff
Post review feedback

Overall this is a solid refactor — replacing the hand-rolled SQL generator with the upstream NearestProvider UDTF is the right direction. The left/right swap with the clear comment explaining the provider's indexing semantics is good defensive documentation. A few concerns worth addressing before merge:

Correctness Issues

Distance formula – boundary semantics (src/operation.rs:188-195)

The CASE expression reads:

WHEN "left_{end_2}" < "right_{start_1}" THEN CAST("right_{start_1}" - "left_{end_2}" AS BIGINT)
WHEN "right_{end_1}" < "left_{start_2}" THEN CAST("left_{start_2}" - "right_{end_1}" AS BIGINT)

For half-open intervals [start, end) this computes start_1 - end_2 when end_2 == start_1 (adjacent intervals), which gives 0 — that is correct.
For closed intervals [start, end], two intervals are touching when end_2 + 1 == start_1, but the formula gives start_1 - end_2 = 1, not 0.

The filter_op (Weak/Strict) carries the coordinate-system information, but the distance computation is not conditioned on it. The old SQL also had the same issue, but since this is an explicit refactor with new tests, it's worth deciding intentionally how distances should be handled across coordinate systems. Fix this →

deregister_table ignores errors (src/operation.rs:152)

ctx.deregister_table(table_name).unwrap();

If the table doesn't exist yet, deregister_table returns Ok(None) so this is safe for the first call. But if a prior call left the table in a bad state, unwrap() could panic at runtime. This is consistent with the rest of the file's error-handling style, but it's worth noting.

Design / API Concerns

nearest_result table name is a global shared namespace (src/operation.rs:151)

ctx is a SessionContext that may be shared across concurrent calls (depending on how Polars-bio constructs it). Using a hardcoded table name "nearest_result" plus a deregister + register pattern is not safe under concurrent nearest() calls. The do_count_overlaps_coverage_naive function uses a per-call unique name; consider the same approach here (e.g., format!("nearest_result_{}", uuid) or a counter). Fix this →

coitreesnearest guard is dead code (src/operation.rs:38-40)

The old code set bio.interval_join_algorithm to "coitreesnearest" before calling do_nearest. That is now removed (good). But the guard in do_range_operation that panics if someone passes "coitreesnearest" as overlap_alg is now technically dead code — do_nearest no longer touches overlap_alg, and the guard runs before dispatching. Consider either removing it or converting it to a proper error return.

Column name collision risk

If a user's DataFrame has a column literally named "nearest_result" (unlikely but possible) or a field whose name happens to collide with another field after suffix application (e.g., both tables have a "chrom_1" column), the dynamically-generated SQL will produce duplicate alias names. The upstream NearestProvider presumably prefixes all output columns, but the suffix-aliasing step in select_parts doesn't deduplicate. Worth checking whether the existing overlap operation handles this and applying the same guard.

Minor Issues

session / ctx clone (src/operation.rs:108)

let session = ctx.clone();
let nearest_provider = NearestProvider::new(Arc::new(session), ...);

ctx is already &SessionContext. SessionContext::clone() is cheap (it's Arc-backed), but wrapping in Arc::new(session) means the provider gets a fresh Arc pointing to the same underlying session. This is correct; just double-check that NearestProvider doesn't need exclusive ownership.

Unused filter_op in nearest path

range_opts.filter_op.unwrap() is called to produce upstream_filter_op for NearestProvider. Make sure NearestProvider actually uses it — if it doesn't (e.g., the upstream treats it as a noop for nearest), this is misleading and could be dropped.

Tests exercise new code paths but not edge cases

k=2 tests only assert "at least as many rows as k=1" — they don't check that at most 2 neighbors are returned per query interval.
overlap=False tests pass on the current data, but since the test data isn't shown, it's unclear whether the assertion all(d > 0) is exhaustive or the data happens to have all non-overlapping neighbors. Consider a test where some query intervals definitively overlap with database intervals, to confirm they are excluded.
No test for k combined with distance=False.

range_op_helpers.py distance schema check (range_op_helpers.py:155-158)

if range_options.range_op == RangeOp.Nearest and (
    range_options.compute_distance is None or range_options.compute_distance
):

The is None branch is checking the Python-side value. Since compute_distance is set from the Python distance parameter with a default of True, None should never reach here from the Python side. But it's a good defensive guard — consider adding a comment explaining it handles the direct-Rust-call path.

Summary

Severity	Issue	Location
Medium	Hardcoded shared table name risks concurrency bugs	`operation.rs:151`
Low	Distance formula doesn't account for coordinate system	`operation.rs:188-195`
Low	Dead `coitreesnearest` guard	`operation.rs:38-40`
Low	`k=2` tests don't assert upper bound on neighbors	`test_native.py`
Nit	`session` clone comment could be clearer	`operation.rs:108`

…ates Update datafusion-bio-function-ranges to 8a684caf which fixes the off-by-one distance bug when using strict (0-based half-open) filter. The upstream now computes distance using raw coordinates instead of strict-adjusted ones. Also simplifies do_nearest() to use the native distance column from NearestProvider instead of a SQL CASE expression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove dead `coitreesnearest` guard in do_range_operation (no longer needed since nearest uses NearestProvider directly) - Use unique table names via atomic counter for nearest temp tables to avoid potential concurrency issues with shared SessionContext - Add k=2 upper-bound tests asserting at most k neighbors per query interval in both native and polars test suites Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mwiewior force-pushed the feat/nearest-provider-udtf branch 5 times, most recently from a1c2d04 to 799ae64 Compare February 17, 2026 14:27

mwiewior force-pushed the feat/nearest-provider-udtf branch from 799ae64 to 64f968c Compare February 17, 2026 14:32

mwiewior mentioned this pull request Feb 17, 2026

feat: Add optional distance column to NearestProvider output biodatageeks/datafusion-bio-functions#14

Closed

mwiewior and others added 2 commits February 17, 2026 16:57

mwiewior merged commit e63d416 into master Feb 17, 2026
15 checks passed

mwiewior deleted the feat/nearest-provider-udtf branch February 17, 2026 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: Refactor nearest to use upstream NearestProvider UDTF#309

feat: Refactor nearest to use upstream NearestProvider UDTF#309
mwiewior merged 3 commits intomasterfrom
feat/nearest-provider-udtf

mwiewior commented Feb 17, 2026

Uh oh!

mwiewior commented Feb 17, 2026

Uh oh!

claude bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

mwiewior commented Feb 17, 2026

Summary

Changes

Test plan

Uh oh!

mwiewior commented Feb 17, 2026

Uh oh!

claude bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat/nearest-provider-udtf

Correctness Issues

Design / API Concerns

Minor Issues

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Feb 17, 2026 •

edited

Loading