|
| 1 | +--- |
| 2 | +draft: false |
| 3 | +date: |
| 4 | + created: 2026-02-20 |
| 5 | +categories: |
| 6 | + - performance |
| 7 | + - benchmarks |
| 8 | +--- |
| 9 | + |
| 10 | +# Interval operations benchmark — update February 2026 |
| 11 | + |
| 12 | +## Introduction |
| 13 | + |
| 14 | +Back in [September 2025](benchmark-operations-2025-09.md) we benchmarked three libraries across three operations. A lot has changed since then. In December 2025, pyranges1 published a [preprint](https://www.biorxiv.org/content/10.64898/2025.12.11.693639v1) describing its Rust-powered backend ([ruranges](https://github.com/pyranges/ruranges)) and an expanded set of interval operations. On the [polars-bio](https://github.com/biodatageeks/polars-bio) side, version 0.24.0 ships a fully rewritten range-operations engine built on upstream DataFusion UDTF providers (OverlapProvider, NearestProvider, and the new coverage/cluster/complement/merge/subtract providers from [datafusion-bio-function-ranges](https://github.com/biodatageeks/datafusion-bio-functions)), replacing the earlier sequila-native backend. |
| 15 | + |
| 16 | +<!-- more --> |
| 17 | + |
| 18 | +This rewrite also expanded the operation set from three to eight. In addition to **overlap**, **nearest**, and **count_overlaps**, polars-bio 0.24.0 supports **coverage**, **cluster**, **complement**, **merge**, and **subtract** — covering all the everyday interval manipulation tasks that genomics workflows depend on. |
| 19 | + |
| 20 | +We also added a fourth contender: [Bioframe](https://github.com/open2c/bioframe), a pandas-based genomic interval library widely used in the 3D genomics community. This gives us a broader view of the Python genomic interval landscape. |
| 21 | + |
| 22 | +For comparability with our previous benchmarks, we continue to use the same [AIList](/polars-bio/supplement/#real-dataset) dataset. All benchmark code and raw results are available in the [polars-bio-bench](https://github.com/biodatageeks/polars-bio-bench) repository. |
| 23 | + |
| 24 | +## Setup |
| 25 | + |
| 26 | +### Software versions |
| 27 | + |
| 28 | +| Library | Version | |
| 29 | +|---|---| |
| 30 | +| polars-bio | 0.24.0 | |
| 31 | +| pyranges1 | 1.2.0 | |
| 32 | +| GenomicRanges | 0.8.4 | |
| 33 | +| bioframe | 0.8.0 | |
| 34 | + |
| 35 | +### Benchmark test cases |
| 36 | + |
| 37 | +**Binary operations** (overlap, nearest, count_overlaps, coverage): |
| 38 | + |
| 39 | +| Dataset pairs | Size | # of overlaps (1-based) | |
| 40 | +|---|---|---| |
| 41 | +| 1-2 & 2-1 | Small | 54,246 | |
| 42 | +| 3-7 & 7-3 | Medium | 4,408,383 | |
| 43 | +| 7-8 & 8-7 | Large | 307,184,634 | |
| 44 | + |
| 45 | +**Unary operations** (cluster, complement, merge, subtract): |
| 46 | + |
| 47 | +| Dataset | Size | Name (intervals) | |
| 48 | +|---|---|---| |
| 49 | +| 1 | Small | fBrain (199K) | |
| 50 | +| 2 | Small | exons (439K) | |
| 51 | +| 7 | Medium | ex-anno (1,194K) | |
| 52 | +| 3 | Medium | chainOrnAna1 (1,957K) | |
| 53 | +| 8 | Large | ex-rna (9,945K) | |
| 54 | +| 5 | Large | chainXenTro3Link (50,981K) | |
| 55 | + |
| 56 | +### Operations and tool support |
| 57 | + |
| 58 | +| Operation | polars-bio | PyRanges1 | GenomicRanges | Bioframe | |
| 59 | +|---|---|---|---|---| |
| 60 | +| overlap | yes | yes | yes | yes | |
| 61 | +| nearest | yes | yes | yes | yes | |
| 62 | +| count_overlaps | yes | yes | yes | yes | |
| 63 | +| coverage | yes | -- | yes | yes | |
| 64 | +| cluster | yes | yes | -- | yes | |
| 65 | +| complement | yes | yes | yes | yes | |
| 66 | +| merge | yes | yes | yes | yes | |
| 67 | +| subtract | yes | yes | -- | yes | |
| 68 | + |
| 69 | +## Results |
| 70 | + |
| 71 | +### Speedup comparison across all operations |
| 72 | + |
| 73 | + |
| 74 | +!!! info |
| 75 | + 1. Missing bars indicate that the operation is not supported by the library for the given dataset. |
| 76 | + 2. Crash bars indicate that the library failed to complete. |
| 77 | + |
| 78 | +Key takeaways: |
| 79 | + |
| 80 | +- **polars-bio** is the fastest library in 7 out of 8 operations on the large dataset (8-7). The sole exception is **nearest**, where **GenomicRanges** holds a 1.63x advantage. |
| 81 | +- On small datasets (1-2, 2-1), **GenomicRanges** leads in overlap (1.74x) and count_overlaps, reflecting lower per-call overhead for small inputs. |
| 82 | +- **Bioframe** is consistently the slowest library, falling 5-50x behind polars-bio depending on the operation and dataset size. |
| 83 | +- For the new operations (coverage, cluster, complement, merge, subtract), **polars-bio** leads across the board with **PyRanges1** a respectable second (0.23-0.61x relative to polars-bio on the large dataset). |
| 84 | + |
| 85 | +**Summary — dataset 8-7 speedup relative to polars-bio (higher is better for polars-bio):** |
| 86 | + |
| 87 | +| Operation | polars-bio | PyRanges1 | GenomicRanges | Bioframe | |
| 88 | +|---|---|---|---|---| |
| 89 | +| overlap | 1.00x | 0.18x | 0.73x | 0.13x | |
| 90 | +| nearest | 1.00x | 0.06x | 1.63x | 0.04x | |
| 91 | +| count_overlaps | 1.00x | 0.24x | 0.19x | 0.02x | |
| 92 | +| coverage | 1.00x | -- | 0.36x | 0.05x | |
| 93 | +| cluster | 1.00x | 0.61x | -- | 0.20x | |
| 94 | +| complement | 1.00x | 0.53x | 0.02x | 0.06x | |
| 95 | +| merge | 1.00x | 0.51x | 0.02x | 0.12x | |
| 96 | +| subtract | 1.00x | 0.23x | -- | 0.05x | |
| 97 | + |
| 98 | + |
| 99 | +### Thread scalability (dataset 8-7) |
| 100 | + |
| 101 | + |
| 102 | +One of polars-bio's key advantages is transparent multithreaded execution. The table below shows wall-clock times for all eight operations on the large dataset (8-7) as thread count increases: |
| 103 | + |
| 104 | +| Operation | 1 thread | 8 threads | Speedup | |
| 105 | +|---|---|---|---| |
| 106 | +| overlap | 4.03s | 0.74s | 5.43x | |
| 107 | +| nearest | 2.55s | 0.44s | 5.77x | |
| 108 | +| count_overlaps | 1.55s | 0.27s | 5.75x | |
| 109 | +| coverage | 1.20s | 0.22s | 5.48x | |
| 110 | +| subtract | 1.15s | 0.17s | 6.59x | |
| 111 | +| merge | 0.29s | 0.05s | 5.34x | |
| 112 | +| complement | 0.29s | 0.06s | 4.99x | |
| 113 | +| cluster | 0.48s | 0.15s | 3.27x | |
| 114 | + |
| 115 | +Key takeaways: |
| 116 | + |
| 117 | +- Most operations achieve near-linear scaling, with 5-6x speedup at 8 threads. |
| 118 | +- **subtract** scales best at 6.59x, while **cluster** shows more modest scaling (3.27x) — expected, as clustering is inherently sequential within each chromosome. |
| 119 | +- At 8 threads, **overlap** on 307M result pairs completes in just 0.74 seconds. |
| 120 | + |
| 121 | +## Summary |
| 122 | + |
| 123 | +- **polars-bio** is the fastest single-threaded library for 7 out of 8 operations at scale, with speedups ranging from 1.5x to 25x over alternatives. |
| 124 | +- **GenomicRanges** wins the nearest operation and leads on small datasets, making it a solid choice when input sizes are modest. |
| 125 | +- polars-bio delivers excellent thread scaling (5-6x at 8 threads), turning already-fast single-threaded times into sub-second performance on large datasets. |
| 126 | +- **PyRanges1**, despite its recent [preprint](https://www.biorxiv.org/content/10.64898/2025.12.11.693639v1) claiming "ultrafast" performance, is slower than polars-bio in every operation tested on large datasets (0.06-0.61x). While it is a solid improvement over its predecessor, the "ultrafast" characterization does not hold when compared against polars-bio's [DataFusion](https://datafusion.apache.org/)-based engine. |
| 127 | +- **Bioframe** is consistently the slowest across all operations and dataset sizes. |
| 128 | +- The expanded eight-operation benchmark confirms that polars-bio's advantage extends beyond overlap and nearest to all standard range operation categories — coverage, cluster, complement, merge, and subtract. |
| 129 | + |
| 130 | +All benchmark code, raw results, and additional figures are available at [polars-bio-bench](https://github.com/biodatageeks/polars-bio-bench). |
0 commit comments