Skip to content

Commit a9947ac

Browse files
authored
Release 0.24.0 and blog post (#317)
1 parent 4f77835 commit a9947ac

File tree

7 files changed

+134
-4
lines changed

7 files changed

+134
-4
lines changed

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "polars_bio"
3-
version = "0.23.0"
3+
version = "0.24.0"
44
edition = "2021"
55

66
[lib]
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
---
2+
draft: false
3+
date:
4+
created: 2026-02-20
5+
categories:
6+
- performance
7+
- benchmarks
8+
---
9+
10+
# Interval operations benchmark — update February 2026
11+
12+
## Introduction
13+
14+
Back in [September 2025](benchmark-operations-2025-09.md) we benchmarked three libraries across three operations. A lot has changed since then. In December 2025, pyranges1 published a [preprint](https://www.biorxiv.org/content/10.64898/2025.12.11.693639v1) describing its Rust-powered backend ([ruranges](https://github.com/pyranges/ruranges)) and an expanded set of interval operations. On the [polars-bio](https://github.com/biodatageeks/polars-bio) side, version 0.24.0 ships a fully rewritten range-operations engine built on upstream DataFusion UDTF providers (OverlapProvider, NearestProvider, and the new coverage/cluster/complement/merge/subtract providers from [datafusion-bio-function-ranges](https://github.com/biodatageeks/datafusion-bio-functions)), replacing the earlier sequila-native backend.
15+
16+
<!-- more -->
17+
18+
This rewrite also expanded the operation set from three to eight. In addition to **overlap**, **nearest**, and **count_overlaps**, polars-bio 0.24.0 supports **coverage**, **cluster**, **complement**, **merge**, and **subtract** — covering all the everyday interval manipulation tasks that genomics workflows depend on.
19+
20+
We also added a fourth contender: [Bioframe](https://github.com/open2c/bioframe), a pandas-based genomic interval library widely used in the 3D genomics community. This gives us a broader view of the Python genomic interval landscape.
21+
22+
For comparability with our previous benchmarks, we continue to use the same [AIList](/polars-bio/supplement/#real-dataset) dataset. All benchmark code and raw results are available in the [polars-bio-bench](https://github.com/biodatageeks/polars-bio-bench) repository.
23+
24+
## Setup
25+
26+
### Software versions
27+
28+
| Library | Version |
29+
|---|---|
30+
| polars-bio | 0.24.0 |
31+
| pyranges1 | 1.2.0 |
32+
| GenomicRanges | 0.8.4 |
33+
| bioframe | 0.8.0 |
34+
35+
### Benchmark test cases
36+
37+
**Binary operations** (overlap, nearest, count_overlaps, coverage):
38+
39+
| Dataset pairs | Size | # of overlaps (1-based) |
40+
|---|---|---|
41+
| 1-2 & 2-1 | Small | 54,246 |
42+
| 3-7 & 7-3 | Medium | 4,408,383 |
43+
| 7-8 & 8-7 | Large | 307,184,634 |
44+
45+
**Unary operations** (cluster, complement, merge, subtract):
46+
47+
| Dataset | Size | Name (intervals) |
48+
|---|---|---|
49+
| 1 | Small | fBrain (199K) |
50+
| 2 | Small | exons (439K) |
51+
| 7 | Medium | ex-anno (1,194K) |
52+
| 3 | Medium | chainOrnAna1 (1,957K) |
53+
| 8 | Large | ex-rna (9,945K) |
54+
| 5 | Large | chainXenTro3Link (50,981K) |
55+
56+
### Operations and tool support
57+
58+
| Operation | polars-bio | PyRanges1 | GenomicRanges | Bioframe |
59+
|---|---|---|---|---|
60+
| overlap | yes | yes | yes | yes |
61+
| nearest | yes | yes | yes | yes |
62+
| count_overlaps | yes | yes | yes | yes |
63+
| coverage | yes | -- | yes | yes |
64+
| cluster | yes | yes | -- | yes |
65+
| complement | yes | yes | yes | yes |
66+
| merge | yes | yes | yes | yes |
67+
| subtract | yes | yes | -- | yes |
68+
69+
## Results
70+
71+
### Speedup comparison across all operations
72+
73+
![all_operations_speedup_comparison.png](figures/benchmark-operations-2026-02/all_operations_speedup_comparison.png)
74+
!!! info
75+
1. Missing bars indicate that the operation is not supported by the library for the given dataset.
76+
2. Crash bars indicate that the library failed to complete.
77+
78+
Key takeaways:
79+
80+
- **polars-bio** is the fastest library in 7 out of 8 operations on the large dataset (8-7). The sole exception is **nearest**, where **GenomicRanges** holds a 1.63x advantage.
81+
- On small datasets (1-2, 2-1), **GenomicRanges** leads in overlap (1.74x) and count_overlaps, reflecting lower per-call overhead for small inputs.
82+
- **Bioframe** is consistently the slowest library, falling 5-50x behind polars-bio depending on the operation and dataset size.
83+
- For the new operations (coverage, cluster, complement, merge, subtract), **polars-bio** leads across the board with **PyRanges1** a respectable second (0.23-0.61x relative to polars-bio on the large dataset).
84+
85+
**Summary — dataset 8-7 speedup relative to polars-bio (higher is better for polars-bio):**
86+
87+
| Operation | polars-bio | PyRanges1 | GenomicRanges | Bioframe |
88+
|---|---|---|---|---|
89+
| overlap | 1.00x | 0.18x | 0.73x | 0.13x |
90+
| nearest | 1.00x | 0.06x | 1.63x | 0.04x |
91+
| count_overlaps | 1.00x | 0.24x | 0.19x | 0.02x |
92+
| coverage | 1.00x | -- | 0.36x | 0.05x |
93+
| cluster | 1.00x | 0.61x | -- | 0.20x |
94+
| complement | 1.00x | 0.53x | 0.02x | 0.06x |
95+
| merge | 1.00x | 0.51x | 0.02x | 0.12x |
96+
| subtract | 1.00x | 0.23x | -- | 0.05x |
97+
98+
99+
### Thread scalability (dataset 8-7)
100+
101+
![polars_bio_scalability_8_7.png](figures/benchmark-operations-2026-02/polars_bio_scalability_8_7.png)
102+
One of polars-bio's key advantages is transparent multithreaded execution. The table below shows wall-clock times for all eight operations on the large dataset (8-7) as thread count increases:
103+
104+
| Operation | 1 thread | 8 threads | Speedup |
105+
|---|---|---|---|
106+
| overlap | 4.03s | 0.74s | 5.43x |
107+
| nearest | 2.55s | 0.44s | 5.77x |
108+
| count_overlaps | 1.55s | 0.27s | 5.75x |
109+
| coverage | 1.20s | 0.22s | 5.48x |
110+
| subtract | 1.15s | 0.17s | 6.59x |
111+
| merge | 0.29s | 0.05s | 5.34x |
112+
| complement | 0.29s | 0.06s | 4.99x |
113+
| cluster | 0.48s | 0.15s | 3.27x |
114+
115+
Key takeaways:
116+
117+
- Most operations achieve near-linear scaling, with 5-6x speedup at 8 threads.
118+
- **subtract** scales best at 6.59x, while **cluster** shows more modest scaling (3.27x) — expected, as clustering is inherently sequential within each chromosome.
119+
- At 8 threads, **overlap** on 307M result pairs completes in just 0.74 seconds.
120+
121+
## Summary
122+
123+
- **polars-bio** is the fastest single-threaded library for 7 out of 8 operations at scale, with speedups ranging from 1.5x to 25x over alternatives.
124+
- **GenomicRanges** wins the nearest operation and leads on small datasets, making it a solid choice when input sizes are modest.
125+
- polars-bio delivers excellent thread scaling (5-6x at 8 threads), turning already-fast single-threaded times into sub-second performance on large datasets.
126+
- **PyRanges1**, despite its recent [preprint](https://www.biorxiv.org/content/10.64898/2025.12.11.693639v1) claiming "ultrafast" performance, is slower than polars-bio in every operation tested on large datasets (0.06-0.61x). While it is a solid improvement over its predecessor, the "ultrafast" characterization does not hold when compared against polars-bio's [DataFusion](https://datafusion.apache.org/)-based engine.
127+
- **Bioframe** is consistently the slowest across all operations and dataset sizes.
128+
- The expanded eight-operation benchmark confirms that polars-bio's advantage extends beyond overlap and nearest to all standard range operation categories — coverage, cluster, complement, merge, and subtract.
129+
130+
All benchmark code, raw results, and additional figures are available at [polars-bio-bench](https://github.com/biodatageeks/polars-bio-bench).
1.13 MB
Loading
382 KB
Loading

polars_bio/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@
107107

108108
POLARS_BIO_MAX_THREADS = "datafusion.execution.target_partitions"
109109

110-
__version__ = "0.23.0"
110+
__version__ = "0.24.0"
111111
__all__ = [
112112
"ctx",
113113
"FilterOp",

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "maturin"
44

55
[project]
66
name = "polars-bio"
7-
version = "0.23.0"
7+
version = "0.24.0"
88
description = "Blazing fast genomic operations on large Python dataframes"
99
authors = []
1010
requires-python = ">=3.10"

0 commit comments

Comments
 (0)