feat: Use upstream OverlapProvider and MergeProvider by mwiewior · Pull Request #310 · biodatageeks/polars-bio

mwiewior · 2026-02-17T18:30:44Z

Summary

Replace hand-crafted SQL generation for overlap in src/query.rs with the upstream OverlapProvider from datafusion-bio-function-ranges (PR Add overlap and merge UDTVFs datafusion-bio-functions#16)
Replace the ~160-line pure-Python window-function merge implementation with the upstream MergeProvider routed through the Rust backend
Add RangeOp::Merge variant and min_dist field to RangeOptions for merge configuration
Remove overlap_query() and unused projection_columns field from QueryParams
Clean up unused Python imports (pyarrow, set_coordinate_system, get_coordinate_system, logger)

Merge now gains file-path and LazyFrame streaming support (previously DataFrame-only). Net effect: ~260 lines removed.

Depends on: biodatageeks/datafusion-bio-functions#16 (update rev to main after merge)

Test plan

cargo check passes with no warnings
maturin develop --release builds successfully
All 30 targeted tests pass (test_bioframe.py, test_overlap_algorithms.py, test_suffix_handling.py)
Full test suite passes (401 tests)
Verify merge works with file paths and LazyFrame inputs (new capability)
Update datafusion-bio-function-ranges rev to main branch commit after upstream PR merges

🤖 Generated with Claude Code

mwiewior · 2026-02-17T19:21:51Z

@claude pls review

claude · 2026-02-17T19:22:11Z

Claude finished @mwiewior's task in 1m 39s —— View job

PR Review

Gather context and understand the request
Read changed files
Analyze code changes
Post review feedback

Overall this is a clean, well-motivated refactor. Replacing the hand-crafted Python merge implementation and the SQL-generation overlap logic with upstream providers shrinks the codebase meaningfully while gaining new capabilities (file-path and LazyFrame inputs for merge). Below are the findings, ordered roughly by severity.

Issues / Concerns

1. `min_dist` silently truncates floats — type mismatch in public API

polars_bio/range_op.py:594

range_options = RangeOptions(
    ...
    min_dist=int(min_dist),   # float → int truncation
)

The public signature accepts float (min_dist: float = 0) but the Rust field is Option<i64>. Fractional values like min_dist=0.5 are silently truncated to 0. The docstring says "Minimum distance between intervals to merge" without warning about this. Either:

change the public signature to int, or
add a validation/round-and-warn before casting.

Fix this →

2. Hard-coded `"n_intervals"` schema column name for Merge

polars_bio/range_op_helpers.py:148-155 and 236-245

elif range_options.range_op == RangeOp.Merge:
    cols = range_options.columns_1
    merged_schema = pl.Schema(
        {
            cols[0]: pl.Utf8,
            cols[1]: pl.Int64,
            cols[2]: pl.Int64,
            "n_intervals": pl.Int64,
        }
    )

The schema is duplicated verbatim in two branches (file-path path and DataFrame path). If MergeProvider ever changes its output column name or adds/removes columns, both places silently break. A small helper constant or function would reduce the coupling.

3. `do_overlap` constructs SQL with backtick-quoted identifiers — injection risk on unusual column names

src/operation.rs:166-185

for field in left_schema.fields() {
    select_parts.push(format!(
        "`left_{}` AS `{}{}`",
        field.name(),
        field.name(),
        suffixes.0
    ));
}

Column names containing backticks themselves (e.g. field`name) would break the generated SQL. This is a pre-existing pattern (also present in nearest_query), but worth flagging since do_overlap is new in this PR. The upstream OverlapProvider presumably surfaces these columns, so consider using DataFusion's expression API instead of SQL string building if feasible.

4. Table name collision risk for `"overlap_result"` and `"merge_result"`

src/operation.rs:160-163 and 251-254

let table_name = "overlap_result".to_string();
ctx.deregister_table(table_name.clone()).unwrap();
ctx.register_table(table_name.clone(), Arc::new(overlap_provider)).unwrap();

Using a shared SessionContext with fixed table names means concurrent calls (e.g. from multi-threaded Python code) would race. The previous SQL approach had the same issue for other operations, but the pattern is now used for two more operations. A random/unique suffix or a per-call context would be safer for future parallelism.

5. `do_range_operation`: `RangeOp::Merge` logs the algorithm even though merge doesn't use it

src/operation.rs:57-68

info!(
    "Running {} operation with algorithm {} and {} thread(s)...",
    range_options.range_op,
    ctx.state().config()...interval_join_algorithm,
    ...
);

For RangeOp::Merge, the algorithm field is irrelevant (merge doesn't do an interval join). The log message will print whatever algorithm was set by a prior call (or the default), which is misleading. The log should be moved inside the match arms that actually use the algorithm, or guarded.

6. `range_operation`: merge with a DataFrame input passes `df` as both `df1` and `df2`

polars_bio/range_op.py:597-604

return range_operation(
    df,
    df,     # same object passed twice
    range_options,
    ...
)

range_operation will call _df_to_reader(df1, ...) and _df_to_reader(df2, ...) on the same object. Depending on whether _df_to_reader returns a one-shot iterator, the second reader might be exhausted. This was presumably the same in the old implementation; verify that _df_to_reader handles this correctly (e.g. by re-creating the reader each time). The do_merge Rust function only receives left_table, so the double registration may be harmless — but range_operation will still register the same DataFrame under two table names, which is wasteful.

Nits / Minor Points

polars_bio/range_op.py:30-31: if TYPE_CHECKING: pass is a no-op. The import on line 32 is unconditional and could just be merged with the top-level imports.
polars_bio/range_op.py:582-583: Inside merge, suffixes = ("_1", "_2") is constructed only to be passed to _validate_overlap_input. Merge doesn't use suffixes; consider adding a dedicated _validate_merge_input or making the validation more targeted.
src/operation.rs:91: _ => panic!("Unsupported operation") — with the RangeOp enum, a CountOverlaps (non-naive?) variant seems to be missing from the match. Complement and Cluster variants exist in the enum but are not handled, so a Complement call would panic at runtime with a cryptic message. Consider either removing those enum variants or handling them explicitly.
Cargo.toml:17: The dependency pinned to a specific commit 6893418... as noted in the PR is intentional (pending upstream merge). The PR description already tracks this as a follow-up. Just ensure the rev is updated before merging to master.

Summary

The architecture is sound and the net ~260-line reduction is a genuine improvement. The main concerns are:

The float → int silent truncation for min_dist (user-facing correctness issue)
The duplicated hard-coded schema for merge output columns (maintainability)
The shared fixed table names (concurrency safety — lower priority if the library is single-threaded by design)

The upstream dependency pin is the blocking item per the PR description.

Replace hand-crafted SQL generation for overlap with the upstream OverlapProvider from datafusion-bio-function-ranges, and replace the ~160-line Python window-function merge implementation with the upstream MergeProvider routed through the Rust backend. This gains file-path and LazyFrame streaming support for merge (previously DataFrame-only) and removes ~260 lines of code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mwiewior force-pushed the feat/use-upstream-overlap-merge-providers branch 2 times, most recently from 7d8ab1a to 0e3c52b Compare February 17, 2026 19:42

mwiewior force-pushed the feat/use-upstream-overlap-merge-providers branch from 0e3c52b to 9f4ebc2 Compare February 17, 2026 19:45

mwiewior merged commit ff1242e into master Feb 17, 2026
14 checks passed

mwiewior deleted the feat/use-upstream-overlap-merge-providers branch February 17, 2026 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: Use upstream OverlapProvider and MergeProvider#310

feat: Use upstream OverlapProvider and MergeProvider#310
mwiewior merged 1 commit intomasterfrom
feat/use-upstream-overlap-merge-providers

mwiewior commented Feb 17, 2026

Uh oh!

mwiewior commented Feb 17, 2026

Uh oh!

claude bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

mwiewior commented Feb 17, 2026

Summary

Test plan

Uh oh!

mwiewior commented Feb 17, 2026

Uh oh!

claude bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

Issues / Concerns

1. min_dist silently truncates floats — type mismatch in public API

2. Hard-coded "n_intervals" schema column name for Merge

3. do_overlap constructs SQL with backtick-quoted identifiers — injection risk on unusual column names

4. Table name collision risk for "overlap_result" and "merge_result"

5. do_range_operation: RangeOp::Merge logs the algorithm even though merge doesn't use it

6. range_operation: merge with a DataFrame input passes df as both df1 and df2

Nits / Minor Points

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Feb 17, 2026 •

edited

Loading

1. `min_dist` silently truncates floats — type mismatch in public API

2. Hard-coded `"n_intervals"` schema column name for Merge

3. `do_overlap` constructs SQL with backtick-quoted identifiers — injection risk on unusual column names

4. Table name collision risk for `"overlap_result"` and `"merge_result"`

5. `do_range_operation`: `RangeOp::Merge` logs the algorithm even though merge doesn't use it

6. `range_operation`: merge with a DataFrame input passes `df` as both `df1` and `df2`