feat: Refactor nearest to use upstream NearestProvider UDTF#309
feat: Refactor nearest to use upstream NearestProvider UDTF#309
Conversation
a1c2d04 to
799ae64
Compare
Replace SQL-generation-based nearest with direct NearestProvider usage from datafusion-bio-function-ranges, exposing k, overlap, and distance parameters. This uses an optimized NearestIntervalIndex (COITree + sorted boundary arrays) instead of the legacy coitreesnearest SQL JOIN. - Bump datafusion-bio-function-ranges to rev dae17206 (PR #13) - Add nearest_k, include_overlaps, compute_distance to RangeOptions - Rewrite do_nearest() using NearestProvider with column rename SQL - Delete nearest_query() SQL generator from query.rs - Add k, overlap, distance params to Python API and Polars extension - Gate distance column on compute_distance flag in schema generation - Add tests for k>1, overlap=False, and distance=False Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
799ae64 to
64f968c
Compare
|
@claude pls review |
|
Claude finished @mwiewior's task in 1m 55s —— View job PR Review: feat/nearest-provider-udtf
Overall this is a solid refactor — replacing the hand-rolled SQL generator with the upstream Correctness IssuesDistance formula – boundary semantics ( The CASE expression reads: WHEN "left_{end_2}" < "right_{start_1}" THEN CAST("right_{start_1}" - "left_{end_2}" AS BIGINT)
WHEN "right_{end_1}" < "left_{start_2}" THEN CAST("left_{start_2}" - "right_{end_1}" AS BIGINT)For half-open intervals The
ctx.deregister_table(table_name).unwrap();If the table doesn't exist yet, Design / API Concerns
The old code set Column name collision risk If a user's DataFrame has a column literally named Minor Issues
let session = ctx.clone();
let nearest_provider = NearestProvider::new(Arc::new(session), ...);
Unused
Tests exercise new code paths but not edge cases
if range_options.range_op == RangeOp.Nearest and (
range_options.compute_distance is None or range_options.compute_distance
):The Summary
|
…ates Update datafusion-bio-function-ranges to 8a684caf which fixes the off-by-one distance bug when using strict (0-based half-open) filter. The upstream now computes distance using raw coordinates instead of strict-adjusted ones. Also simplifies do_nearest() to use the native distance column from NearestProvider instead of a SQL CASE expression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove dead `coitreesnearest` guard in do_range_operation (no longer needed since nearest uses NearestProvider directly) - Use unique table names via atomic counter for nearest temp tables to avoid potential concurrency issues with shared SessionContext - Add k=2 upper-bound tests asserting at most k neighbors per query interval in both native and polars test suites Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
nearestwith directNearestProviderusage fromdatafusion-bio-function-ranges(upstream PR #13)k(number of nearest neighbors),overlap(include/exclude overlapping intervals),distance(opt-in distance column)nearest_query()SQL generator andcoitreesnearestalgorithm overrideNearestIntervalIndex(COITree + sorted boundary arrays) under the hoodChanges
Cargo.tomldatafusion-bio-function-rangesto revdae17206src/option.rsnearest_k,include_overlaps,compute_distancetoRangeOptionssrc/operation.rsdo_nearest()usingNearestProviderwith column rename + distance SQLsrc/query.rsnearest_query()function (~100 lines)polars_bio/range_op.pyk,overlap,distanceparams tonearest()polars_bio/range_op_helpers.pycompute_distanceflagpolars_bio/polars_ext.pyk,overlap,distanceto Polars extension methodtests/test_native.pytests/test_polars.pyTest plan
cargo check— Rust compilation passespytest tests/test_native.py -k nearest— 10 tests pass (2 existing + 8 new)pytest tests/test_polars.py -k nearest— 5 tests pass (3 existing + 2 new)pytest tests/test_bioframe.py -k nearest— 2 existing tests pass (backward compat)pytest tests/test_streaming.py -k nearest— 3 existing tests passpytest tests/test_pandas.py -k nearest— 2 existing tests pass🤖 Generated with Claude Code