Skip to content

Comments

feat: Add cluster, complement, and subtract range operations#311

Merged
mwiewior merged 7 commits intomasterfrom
feat/cluster-complement-subtract
Feb 19, 2026
Merged

feat: Add cluster, complement, and subtract range operations#311
mwiewior merged 7 commits intomasterfrom
feat/cluster-complement-subtract

Conversation

@mwiewior
Copy link
Collaborator

Summary

  • Integrate three new bioframe-compatible interval operations from upstream datafusion-bio-functions PR #17:
    • cluster: Assign cluster IDs to overlapping/nearby intervals (single-table, like merge)
    • complement: Compute gaps between intervals with optional chromsizes view table (new pattern)
    • subtract: Remove overlapping portions from one set of intervals using another (two-table, like overlap)
  • All operations support the full polars-bio pipeline: DataFrame, LazyFrame, pandas, file paths, and projection pushdown
  • Added 6 new tests comparing output against bioframe reference (all 16 bioframe tests pass)

Changes

Rust:

  • Updated datafusion-bio-function-ranges upstream dependency to rev with ClusterProvider, ComplementProvider, SubtractProvider
  • Added Subtract = 5 to RangeOp enum, view_table/view_columns fields to RangeOptions
  • Added do_cluster, do_complement, do_subtract dispatch functions in operation.rs

Python:

  • Added schema generators: _generate_cluster_schema, _generate_complement_schema, _generate_subtract_schema
  • Added cluster(), complement(), subtract() methods on IntervalOperations
  • Added _register_view_table() helper for complement's chromsizes view registration
  • Exposed pb.cluster(), pb.complement(), pb.subtract() in public API

Test plan

  • cargo check passes
  • maturin develop --release builds successfully
  • python -m pytest tests/test_bioframe.py -v — all 16 tests pass (6 new)
  • Smoke test with simple DataFrames for all 3 operations

🤖 Generated with Claude Code

@mwiewior mwiewior force-pushed the feat/cluster-complement-subtract branch 2 times, most recently from 88147be to bd62c71 Compare February 17, 2026 21:56
@mwiewior mwiewior force-pushed the feat/cluster-complement-subtract branch 2 times, most recently from b74cbec to b069f28 Compare February 18, 2026 13:38
Integrate three new bioframe-compatible interval operations from upstream
datafusion-bio-functions PR #17:

- cluster: Assign cluster IDs to overlapping/nearby intervals
- complement: Compute gaps between intervals (with optional chromsizes view)
- subtract: Remove overlapping portions of intervals from another set

All operations support the full polars-bio pipeline: DataFrame, LazyFrame,
pandas, file paths, and projection pushdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mwiewior mwiewior force-pushed the feat/cluster-complement-subtract branch from b069f28 to 6f22c04 Compare February 19, 2026 05:35
Verifies that operations correctly handle DataFrames with extra columns
beyond the core contig/pos_start/pos_end triplet. Covers overlap, nearest,
count_overlaps, coverage, merge, cluster, complement, and subtract across
pandas, polars DataFrame, and polars LazyFrame output types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mwiewior mwiewior force-pushed the feat/cluster-complement-subtract branch 2 times, most recently from f741b26 to d96527f Compare February 19, 2026 07:54
…entation

Update feature comparison table, API comparison table, coordinate system
mermaid diagram, and algorithm description to include the newly implemented
cluster, complement, and subtract operations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mwiewior mwiewior force-pushed the feat/cluster-complement-subtract branch from d96527f to 4dac3b7 Compare February 19, 2026 08:59
mwiewior and others added 2 commits February 19, 2026 10:23
Verify and correct feature matrix against actual library APIs:
- subtract: add PyRanges0, PyGenomics, GenomicRanges support
- complement: add GenomicRanges support (via gaps())
- merge: add PyGenomics support (via iterate_reduced())
- count_overlaps: remove incorrect PyGenomics checkmark

Add documentation links in API comparison table:
- PyRanges1: cluster_overlaps, complement_ranges, subtract_overlaps,
  join_overlaps, nearest_ranges
- PyRanges0: subtract
- GenomicRanges: gaps (complement), subtract

Add footnotes 6-9 with links for methods with different names
(iterate_reduced, reduce, iterate_complement, gaps).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mwiewior
Copy link
Collaborator Author

@claude pls review

@claude
Copy link

claude bot commented Feb 19, 2026

Claude finished @mwiewior's task in 2m 45s —— View job


Code Review

  • Read changed files and gather context
  • Analyze Rust changes (operation.rs, option.rs, Cargo.toml)
  • Analyze Python changes (range_op.py, range_op_helpers.py, init.py)
  • Analyze tests
  • Analyze docs changes

Overall the implementation is clean, well-tested, and consistent with the existing codebase patterns. The new operations follow the established dispatch model (do_cluster, do_complement, do_subtract mirroring do_merge), bioframe tests pass, and the wide-DataFrame tests cover the expected column-dropping behavior.

A few findings are below — one is a meaningful bug risk, the rest are minor.


1. _register_view_table: id() for naming is unreliable (polars_bio/range_op.py:826)

table_name = f"_view_{id(view_df)}_{hash(contig_col) & 0xFFFFFFFF:08x}"

id() returns the memory address of the Python object, which can be reused after garbage collection. If view_df is a temporary value that gets GC'd between complement calls, a newly-allocated object can land at the same address, producing an identical table_name — and py_from_polars may then silently overwrite or fail on the already-registered table.

All other operations in this codebase use monotonically-incrementing AtomicU64 counters (e.g., CLUSTER_TABLE_COUNTER). The same pattern should be used here — either a Python-side itertools.count() or a new Rust-side counter. Fix this →


2. View tables are never deregistered (polars_bio/range_op.py:814-848)

Every call to complement(view_df=...) registers a new table in the global DataFusion context and never removes it. In a long-running session where complement is called repeatedly with different view DataFrames, these tables accumulate permanently. Consider either:

  • Deregistering the table after the operation completes, or
  • Caching by content hash so repeated calls with the same view reuse the existing registration.

3. _generate_complement_schema and _generate_subtract_schema are identical (polars_bio/range_op_helpers.py:110-135)

Both functions return {columns[0]: pl.Utf8, columns[1]: pl.Int64, columns[2]: pl.Int64}. One could call the other or both could delegate to a shared _generate_interval_only_schema(columns). Minor but worth noting.


4. Behavioral difference from bioframe not prominently documented

subtract and cluster drop extra columns from the input DataFrames, which differs from bioframe's behavior (bioframe preserves extra columns). This is correctly noted in test_wide_dataframes.py's module docstring but is not mentioned in the docstrings of cluster(), complement(), or subtract() themselves.

For cluster() specifically, the docstring currently says:

Returns ... with original interval columns plus cluster, cluster_start, cluster_end.

"original interval columns" could be read as all input columns. It actually means only the 3 coordinate columns from cols. A note like "Additional input columns are not preserved" would prevent confusion. Fix this →


5. Rust: do_cluster vs do_mergemin_dist type

In src/operation.rs, do_merge and do_cluster both pass min_dist to their respective providers:

let min_dist = range_opts.min_dist.unwrap_or(0);

min_dist in RangeOptions is Option<i64>. If ClusterProvider expects a u64 or usize, there's a potential silent truncation/cast. Not necessarily a bug if the upstream accepts i64, but worth verifying against the datafusion-bio-functions upstream signature.


6. complement with no view_df produces very large intervals

The docstring correctly states "each contig spans [0, i64::MAX)" when view_df=None. This means calling complement() without a view on real-world genomic data will return complement intervals stretching from coordinate 0 to 9.2×10¹⁸. Users are likely to be surprised. A warning via logging.warning() when view_df is None would be helpful for usability.


Nits

  • cluster() is missing a read_options parameter for file-path inputs (consistent with merge() which also lacks it, but overlap() has it). Noting for future parity.
  • The new RangeOp enum values (Complement=1, Cluster=2, Subtract=5) are interleaved with existing ones. This is fine functionally, but the non-contiguous integers (gap between 4 and 6) may be confusing in future maintenance.

Summary

The implementation is solid and well-integrated into the existing architecture. The primary issue worth fixing before merge is the id()-based table naming in _register_view_table (#1), as it is a real correctness risk in any session that GC's view DataFrames. The rest are improvements or documentation gaps.

mwiewior and others added 2 commits February 19, 2026 14:18
Integrate upstream PR #21 (datafusion-bio-functions) which adds extra
column preservation for ClusterProvider and SubtractProvider using a
two-path design with FullBatchCollector + Arrow take kernel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace unreliable id()-based view table naming with monotonic counter
  (itertools.count) to prevent name collisions after GC (#1)
- Add warning when complement() is called without view_df, since contigs
  will span [0, i64::MAX) which is rarely useful (#6)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mwiewior mwiewior force-pushed the feat/cluster-complement-subtract branch from b339ea4 to 412b246 Compare February 19, 2026 13:47
@mwiewior mwiewior merged commit d9a00f2 into master Feb 19, 2026
15 checks passed
mwiewior added a commit that referenced this pull request Feb 19, 2026
These range operations were added in #311 but were missing from the
Polars LazyFrame .pb extension namespace (polars_ext.py).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mwiewior added a commit that referenced this pull request Feb 19, 2026
)

These range operations were added in #311 but were missing from the
Polars LazyFrame .pb extension namespace (polars_ext.py).

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant