Skip to content

Comments

feat: Use upstream OverlapProvider and MergeProvider#310

Merged
mwiewior merged 1 commit intomasterfrom
feat/use-upstream-overlap-merge-providers
Feb 17, 2026
Merged

feat: Use upstream OverlapProvider and MergeProvider#310
mwiewior merged 1 commit intomasterfrom
feat/use-upstream-overlap-merge-providers

Conversation

@mwiewior
Copy link
Collaborator

Summary

  • Replace hand-crafted SQL generation for overlap in src/query.rs with the upstream OverlapProvider from datafusion-bio-function-ranges (PR Add overlap and merge UDTVFs datafusion-bio-functions#16)
  • Replace the ~160-line pure-Python window-function merge implementation with the upstream MergeProvider routed through the Rust backend
  • Add RangeOp::Merge variant and min_dist field to RangeOptions for merge configuration
  • Remove overlap_query() and unused projection_columns field from QueryParams
  • Clean up unused Python imports (pyarrow, set_coordinate_system, get_coordinate_system, logger)

Merge now gains file-path and LazyFrame streaming support (previously DataFrame-only). Net effect: ~260 lines removed.

Depends on: biodatageeks/datafusion-bio-functions#16 (update rev to main after merge)

Test plan

  • cargo check passes with no warnings
  • maturin develop --release builds successfully
  • All 30 targeted tests pass (test_bioframe.py, test_overlap_algorithms.py, test_suffix_handling.py)
  • Full test suite passes (401 tests)
  • Verify merge works with file paths and LazyFrame inputs (new capability)
  • Update datafusion-bio-function-ranges rev to main branch commit after upstream PR merges

🤖 Generated with Claude Code

@mwiewior
Copy link
Collaborator Author

@claude pls review

@claude
Copy link

claude bot commented Feb 17, 2026

Claude finished @mwiewior's task in 1m 39s —— View job


PR Review

  • Gather context and understand the request
  • Read changed files
  • Analyze code changes
  • Post review feedback

Overall this is a clean, well-motivated refactor. Replacing the hand-crafted Python merge implementation and the SQL-generation overlap logic with upstream providers shrinks the codebase meaningfully while gaining new capabilities (file-path and LazyFrame inputs for merge). Below are the findings, ordered roughly by severity.


Issues / Concerns

1. min_dist silently truncates floats — type mismatch in public API

polars_bio/range_op.py:594

range_options = RangeOptions(
    ...
    min_dist=int(min_dist),   # float → int truncation
)

The public signature accepts float (min_dist: float = 0) but the Rust field is Option<i64>. Fractional values like min_dist=0.5 are silently truncated to 0. The docstring says "Minimum distance between intervals to merge" without warning about this. Either:

  • change the public signature to int, or
  • add a validation/round-and-warn before casting.

Fix this →


2. Hard-coded "n_intervals" schema column name for Merge

polars_bio/range_op_helpers.py:148-155 and 236-245

elif range_options.range_op == RangeOp.Merge:
    cols = range_options.columns_1
    merged_schema = pl.Schema(
        {
            cols[0]: pl.Utf8,
            cols[1]: pl.Int64,
            cols[2]: pl.Int64,
            "n_intervals": pl.Int64,
        }
    )

The schema is duplicated verbatim in two branches (file-path path and DataFrame path). If MergeProvider ever changes its output column name or adds/removes columns, both places silently break. A small helper constant or function would reduce the coupling.


3. do_overlap constructs SQL with backtick-quoted identifiers — injection risk on unusual column names

src/operation.rs:166-185

for field in left_schema.fields() {
    select_parts.push(format!(
        "`left_{}` AS `{}{}`",
        field.name(),
        field.name(),
        suffixes.0
    ));
}

Column names containing backticks themselves (e.g. field`name) would break the generated SQL. This is a pre-existing pattern (also present in nearest_query), but worth flagging since do_overlap is new in this PR. The upstream OverlapProvider presumably surfaces these columns, so consider using DataFusion's expression API instead of SQL string building if feasible.


4. Table name collision risk for "overlap_result" and "merge_result"

src/operation.rs:160-163 and 251-254

let table_name = "overlap_result".to_string();
ctx.deregister_table(table_name.clone()).unwrap();
ctx.register_table(table_name.clone(), Arc::new(overlap_provider)).unwrap();

Using a shared SessionContext with fixed table names means concurrent calls (e.g. from multi-threaded Python code) would race. The previous SQL approach had the same issue for other operations, but the pattern is now used for two more operations. A random/unique suffix or a per-call context would be safer for future parallelism.


5. do_range_operation: RangeOp::Merge logs the algorithm even though merge doesn't use it

src/operation.rs:57-68

info!(
    "Running {} operation with algorithm {} and {} thread(s)...",
    range_options.range_op,
    ctx.state().config()...interval_join_algorithm,
    ...
);

For RangeOp::Merge, the algorithm field is irrelevant (merge doesn't do an interval join). The log message will print whatever algorithm was set by a prior call (or the default), which is misleading. The log should be moved inside the match arms that actually use the algorithm, or guarded.


6. range_operation: merge with a DataFrame input passes df as both df1 and df2

polars_bio/range_op.py:597-604

return range_operation(
    df,
    df,     # same object passed twice
    range_options,
    ...
)

range_operation will call _df_to_reader(df1, ...) and _df_to_reader(df2, ...) on the same object. Depending on whether _df_to_reader returns a one-shot iterator, the second reader might be exhausted. This was presumably the same in the old implementation; verify that _df_to_reader handles this correctly (e.g. by re-creating the reader each time). The do_merge Rust function only receives left_table, so the double registration may be harmless — but range_operation will still register the same DataFrame under two table names, which is wasteful.


Nits / Minor Points

  • polars_bio/range_op.py:30-31: if TYPE_CHECKING: pass is a no-op. The import on line 32 is unconditional and could just be merged with the top-level imports.

  • polars_bio/range_op.py:582-583: Inside merge, suffixes = ("_1", "_2") is constructed only to be passed to _validate_overlap_input. Merge doesn't use suffixes; consider adding a dedicated _validate_merge_input or making the validation more targeted.

  • src/operation.rs:91: _ => panic!("Unsupported operation") — with the RangeOp enum, a CountOverlaps (non-naive?) variant seems to be missing from the match. Complement and Cluster variants exist in the enum but are not handled, so a Complement call would panic at runtime with a cryptic message. Consider either removing those enum variants or handling them explicitly.

  • Cargo.toml:17: The dependency pinned to a specific commit 6893418... as noted in the PR is intentional (pending upstream merge). The PR description already tracks this as a follow-up. Just ensure the rev is updated before merging to master.


Summary

The architecture is sound and the net ~260-line reduction is a genuine improvement. The main concerns are:

  1. The float → int silent truncation for min_dist (user-facing correctness issue)
  2. The duplicated hard-coded schema for merge output columns (maintainability)
  3. The shared fixed table names (concurrency safety — lower priority if the library is single-threaded by design)

The upstream dependency pin is the blocking item per the PR description.

@mwiewior mwiewior force-pushed the feat/use-upstream-overlap-merge-providers branch 2 times, most recently from 7d8ab1a to 0e3c52b Compare February 17, 2026 19:42
Replace hand-crafted SQL generation for overlap with the upstream
OverlapProvider from datafusion-bio-function-ranges, and replace
the ~160-line Python window-function merge implementation with the
upstream MergeProvider routed through the Rust backend.

This gains file-path and LazyFrame streaming support for merge
(previously DataFrame-only) and removes ~260 lines of code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mwiewior mwiewior force-pushed the feat/use-upstream-overlap-merge-providers branch from 0e3c52b to 9f4ebc2 Compare February 17, 2026 19:45
@mwiewior mwiewior merged commit ff1242e into master Feb 17, 2026
14 checks passed
@mwiewior mwiewior deleted the feat/use-upstream-overlap-merge-providers branch February 17, 2026 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant