Skip to content

Commit 4dac3b7

Browse files
mwiewiorclaude
andcommitted
docs: Add cluster, complement, and subtract to range operations documentation
Update feature comparison table, API comparison table, coordinate system mermaid diagram, and algorithm description to include the newly implemented cluster, complement, and subtract operations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 603ad92 commit 4dac3b7

File tree

5 files changed

+27
-14
lines changed

5 files changed

+27
-14
lines changed

Cargo.lock

Lines changed: 3 additions & 3 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,8 @@ datafusion-bio-format-bed = { git = "https://github.com/biodatageeks/datafusion-
3535
datafusion-bio-format-fasta = { git = "https://github.com/biodatageeks/datafusion-bio-formats.git", rev = "4ba1ca3e108a5edc5d31d03bacbe04f2ddf0b64d" }
3636
datafusion-bio-format-pairs = { git = "https://github.com/biodatageeks/datafusion-bio-formats.git", rev = "4ba1ca3e108a5edc5d31d03bacbe04f2ddf0b64d" }
3737

38-
datafusion-bio-function-ranges = { git = "https://github.com/biodatageeks/datafusion-bio-functions.git", rev = "d56c1aead2c28634f01e3c889e0b7bf34a7f477f" }
39-
datafusion-bio-function-pileup = { git = "https://github.com/biodatageeks/datafusion-bio-functions.git", rev = "d56c1aead2c28634f01e3c889e0b7bf34a7f477f", default-features = false }
38+
datafusion-bio-function-ranges = { git = "https://github.com/biodatageeks/datafusion-bio-functions.git", rev = "8e9acb0ad32d7e8990e6df2f50b9d4bcb9100ab5" }
39+
datafusion-bio-function-pileup = { git = "https://github.com/biodatageeks/datafusion-bio-functions.git", rev = "8e9acb0ad32d7e8990e6df2f50b9d4bcb9100ab5", default-features = false }
4040

4141
async-trait = "0.1.86"
4242
futures = "0.3.31"

docs/api.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ polars-bio API is grouped into the following categories:
22

33
- **[File I/O](#polars_bio.data_input)**: Reading files in various biological formats from **local** and **[cloud](/polars-bio/features/#cloud-storage)** storage.
44
- **[Data Processing](#polars_bio.data_processing)**: Exposing end user to the rich **SQL** programming interface powered by [Apache Datafusion](https://datafusion.apache.org/user-guide/sql/index.html) for operations, such as sorting, filtering and other transformations on input bioinformatic datasets registered as tables. You can easily query and process file formats such as *VCF*, *GFF*, *BAM*, *FASTQ*, *Pairs* using SQL syntax.
5-
- **[Interval Operations](#polars_bio.range_operations)**: Functions for performing common interval operations, such as *overlap*, *nearest*, *coverage*.
5+
- **[Interval Operations](#polars_bio.range_operations)**: Functions for performing common interval operations, such as *overlap*, *nearest*, *coverage*, *merge*, *cluster*, *complement*, and *subtract*.
66
- **[Pileup Operations](#polars_bio.pileup_operations)**: Per-base read depth computation from BAM/SAM/CRAM files using CIGAR operations, similar to mosdepth/samtools depth.
77

88
There are 2 ways of using polars-bio API:

docs/features.md

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,10 @@
55
| [overlap](api.md#polars_bio.overlap) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
66
| [nearest](api.md#polars_bio.nearest) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
77
| [count_overlaps](api.md#polars_bio.count_overlaps) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
8-
| cluster | :white_check_mark: | | :white_check_mark: | :white_check_mark: | | |
8+
| [cluster](api.md#polars_bio.cluster) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | |
99
| [merge](api.md#polars_bio.merge) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
10-
| complement | :white_check_mark: | :construction: | | :white_check_mark: | :white_check_mark: | |
10+
| [complement](api.md#polars_bio.complement) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | :white_check_mark: | |
11+
| [subtract](api.md#polars_bio.subtract) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | |
1112
| [coverage](api.md#polars_bio.coverage) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
1213
| [expand](api.md#polars_bio.LazyFrame.expand) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
1314
| [sort](api.md#polars_bio.LazyFrame.sort_bedframe) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
@@ -78,6 +79,9 @@ flowchart TB
7879
count["count_overlaps()"]
7980
coverage["coverage()"]
8081
merge["merge()"]
82+
cluster["cluster()"]
83+
complement["complement()"]
84+
subtract["subtract()"]
8185
end
8286
8387
subgraph Validation["Metadata Validation"]
@@ -97,13 +101,19 @@ flowchart TB
97101
polars_meta --> count
98102
polars_meta --> coverage
99103
polars_meta --> merge
104+
polars_meta --> cluster
105+
polars_meta --> complement
106+
polars_meta --> subtract
100107
pandas_meta --> overlap
101108
102109
overlap --> validate
103110
nearest --> validate
104111
count --> validate
105112
coverage --> validate
106113
merge --> validate
114+
cluster --> validate
115+
complement --> validate
116+
subtract --> validate
107117
108118
validate --> |"metadata missing"| check
109119
validate --> |"metadata mismatch"| error2
@@ -524,11 +534,14 @@ result = pb.sql(f"SELECT COUNT(*) FROM {table_name}")
524534
There is no standard API for genomic ranges operations in Python.
525535
This table compares the API of the libraries. The table is not exhaustive and only shows the most common operations used in benchmarking.
526536

527-
| operation | Bioframe | polars-bio | PyRanges0 | PyRanges1 | Pybedtools | GenomicRanges |
528-
|------------|---------------------------------------------------------------------------------------------------------|--------------------------------------------|---------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|
529-
| overlap | [overlap](https://bioframe.readthedocs.io/en/latest/api-intervalops.html#bioframe.ops.overlap) | [overlap](api.md#polars_bio.overlap) | [join](https://pyranges.readthedocs.io/en/latest/autoapi/pyranges/index.html#pyranges.PyRanges.join)<sup>1</sup> | [join_ranges](https://pyranges1.readthedocs.io/en/latest/pyranges_objects.html#pyranges.PyRanges.join_ranges) | [intersect](https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html?highlight=intersect#usage-and-option-summary)<sup>2</sup> | [find_overlaps](https://biocpy.github.io/GenomicRanges/api/genomicranges.html#genomicranges.GenomicRanges.GenomicRanges.find_overlaps)<sup>3</sup> |
530-
| nearest | [closest](https://bioframe.readthedocs.io/en/latest/api-intervalops.html#bioframe.ops.closest) | [nearest](api.md#polars_bio.nearest) | [nearest](https://pyranges.readthedocs.io/en/latest/autoapi/pyranges/index.html#pyranges.PyRanges.nearest) | [nearest](https://pyranges1.readthedocs.io/en/latest/pyranges_objects.html#pyranges.PyRanges.nearest) | [closest](https://daler.github.io/pybedtools/autodocs/pybedtools.bedtool.BedTool.closest.html#pybedtools.bedtool.BedTool.closest)<sup>4</sup> | [nearest](https://biocpy.github.io/GenomicRanges/api/genomicranges.html#genomicranges.GenomicRanges.GenomicRanges.nearest)<sup>5</sup> |
531-
| read_table | [read_table](https://bioframe.readthedocs.io/en/latest/api-fileops.html#bioframe.io.fileops.read_table) | [read_table](api.md#polars_bio.read_table) | [read_bed](https://pyranges.readthedocs.io/en/latest/autoapi/pyranges/readers/index.html#pyranges.readers.read_bed) | [read_bed](https://pyranges1.readthedocs.io/en/latest/pyranges_module.html#pyranges.read_bed) | [BedTool](https://daler.github.io/pybedtools/topical-create-a-bedtool.html#creating-a-bedtool) | [read_bed](https://biocpy.github.io/GenomicRanges/tutorial.html#from-bioinformatic-file-formats) |
537+
| operation | Bioframe | polars-bio | PyRanges0 | PyRanges1 | Pybedtools | GenomicRanges |
538+
|----------------|---------------------------------------------------------------------------------------------------------|--------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|
539+
| overlap | [overlap](https://bioframe.readthedocs.io/en/latest/api-intervalops.html#bioframe.ops.overlap) | [overlap](api.md#polars_bio.overlap) | [join](https://pyranges.readthedocs.io/en/latest/autoapi/pyranges/index.html#pyranges.PyRanges.join)<sup>1</sup> | [join_ranges](https://pyranges1.readthedocs.io/en/latest/pyranges_objects.html#pyranges.PyRanges.join_ranges) | [intersect](https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html?highlight=intersect#usage-and-option-summary)<sup>2</sup> | [find_overlaps](https://biocpy.github.io/GenomicRanges/api/genomicranges.html#genomicranges.GenomicRanges.GenomicRanges.find_overlaps)<sup>3</sup> |
540+
| nearest | [closest](https://bioframe.readthedocs.io/en/latest/api-intervalops.html#bioframe.ops.closest) | [nearest](api.md#polars_bio.nearest) | [nearest](https://pyranges.readthedocs.io/en/latest/autoapi/pyranges/index.html#pyranges.PyRanges.nearest) | [nearest](https://pyranges1.readthedocs.io/en/latest/pyranges_objects.html#pyranges.PyRanges.nearest) | [closest](https://daler.github.io/pybedtools/autodocs/pybedtools.bedtool.BedTool.closest.html#pybedtools.bedtool.BedTool.closest)<sup>4</sup> | [nearest](https://biocpy.github.io/GenomicRanges/api/genomicranges.html#genomicranges.GenomicRanges.GenomicRanges.nearest)<sup>5</sup> |
541+
| cluster | [cluster](https://bioframe.readthedocs.io/en/latest/api-intervalops.html#bioframe.ops.cluster) | [cluster](api.md#polars_bio.cluster) | [cluster](https://pyranges.readthedocs.io/en/latest/autoapi/pyranges/index.html#pyranges.PyRanges.cluster) | | [cluster](https://bedtools.readthedocs.io/en/latest/content/tools/cluster.html) | |
542+
| complement | [complement](https://bioframe.readthedocs.io/en/latest/api-intervalops.html#bioframe.ops.complement) | [complement](api.md#polars_bio.complement) | | | [complement](https://bedtools.readthedocs.io/en/latest/content/tools/complement.html) | |
543+
| subtract | [subtract](https://bioframe.readthedocs.io/en/latest/api-intervalops.html#bioframe.ops.subtract) | [subtract](api.md#polars_bio.subtract) | | | [subtract](https://bedtools.readthedocs.io/en/latest/content/tools/subtract.html) | |
544+
| read_table | [read_table](https://bioframe.readthedocs.io/en/latest/api-fileops.html#bioframe.io.fileops.read_table) | [read_table](api.md#polars_bio.read_table) | [read_bed](https://pyranges.readthedocs.io/en/latest/autoapi/pyranges/readers/index.html#pyranges.readers.read_bed) | [read_bed](https://pyranges1.readthedocs.io/en/latest/pyranges_module.html#pyranges.read_bed) | [BedTool](https://daler.github.io/pybedtools/topical-create-a-bedtool.html#creating-a-bedtool) | [read_bed](https://biocpy.github.io/GenomicRanges/tutorial.html#from-bioinformatic-file-formats) |
532545

533546
!!! note
534547
1. There is an [overlap](https://pyranges.readthedocs.io/en/latest/autoapi/pyranges/index.html#pyranges.PyRanges.overlap) method in PyRanges, but its output is only limited to indices of intervals from the other Dataframe that overlap.

docs/supplement.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
This document provides additional information about the algorithms, benchmarking setup, data, and results that were presented in the manuscript.
44

55
## Algorithm description
6-
`polars-bio` implements a set of *binary* interval operations on genomic ranges, such as *overlap*, *nearest*, *count-overlaps*, and *coverage*. All these operations share the very similar algorithmic structure, which is presented in the diagram below.
6+
`polars-bio` implements a set of interval operations on genomic ranges, including *binary* operations (*overlap*, *nearest*, *count-overlaps*, *coverage*, *subtract*) and *unary* operations (*merge*, *cluster*, *complement*). The binary operations share a very similar algorithmic structure, which is presented in the diagram below. The unary operations (*merge*, *cluster*, *complement*) take a single set of intervals and produce transformed output — merged intervals, cluster assignments, or gap intervals respectively.
77

88

99
``` mermaid

0 commit comments

Comments
 (0)