Skip to content

Commit 7f132d0

Browse files
authored
Use compressed-lists for GRangesList, classes extend BiocObject. (#160)
* bump packages versions * extend granges with biocobject * rename `_validate` to `validate` * run actions from 3.10-3.14
1 parent 6cb0473 commit 7f132d0

30 files changed

+783
-1293
lines changed

.github/workflows/run-tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ jobs:
2828
test:
2929
strategy:
3030
matrix:
31-
python: ["3.9", "3.10", "3.11", "3.12", "3.13"]
31+
python: ["3.10", "3.11", "3.12", "3.13"]
3232
platform:
3333
- ubuntu-latest
3434
- macos-latest

CHANGELOG.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
# Changelog
22

3+
## Version 0.8.0
4+
5+
- Rename module files to follow PEP guidelines
6+
- Rename `GenomicRangesList` to `CompressedGenomicRangesList` and now extends compressed-lists
7+
- Classes extend `BiocObject` from biocutils, provides a default metadata attribute and helper functions.
8+
- rename `validate` to `_validate` for consistency with the rest of the packages and classes.
9+
310
## Version 0.7.0 - 0.7.3
411

512
- Changes to switch to LTLA/nclist-cpp in the iranges package for overlap and search operations.
@@ -62,7 +69,6 @@ An rewrite of the package to use the new and improve IRanges packages (>= 0.4.2)
6269
- Coerce `GenomicRangesList` to `GenomicRanges`.
6370
- Add tests and documentation.
6471

65-
6672
## Version 0.4.21 - 0.4.24
6773

6874
- Optimize `intersect` operation on large number of genomic regions

README.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ print(len(gg), len(df))
4848

4949
## output
5050
## 77 77> [!NOTE]
51+
5152
> `ends` are expected to be inclusive to be consistent with Bioconductor representations. If they are not, we recommend subtracting 1 from the `ends`.
5253
5354
#### UCSC or GTF file
@@ -212,16 +213,16 @@ print(hits)
212213
[1] 1 1677082
213214
[2] 2 1003411
214215

215-
## `GenomicRangesList`
216+
## `CompressedGenomicRangesList`
216217

217-
Just as it sounds, a `GenomicRangesList` is a named-list like object. If you are wondering why you need this class, a `GenomicRanges` object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub-regions, e.g. exons. `GenomicRangesList` allows us to represent this nested structure.
218+
Just as it sounds, a `CompressedGenomicRangesList` is a named-list like object. If you are wondering why you need this class, a `GenomicRanges` object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub-regions, e.g. exons. `CompressedGenomicRangesList` allows us to represent this nested structure.
218219

219220
**Currently, this class is limited in functionality.**
220221

221-
To construct a GenomicRangesList
222+
To construct a CompressedGenomicRangesList
222223

223224
```python
224-
from genomicranges import GenomicRanges, GenomicRangesList
225+
from genomicranges import GenomicRanges, CompressedGenomicRangesList
225226
from iranges import IRanges
226227
from biocframe import BiocFrame
227228

@@ -238,12 +239,12 @@ gr2 = GenomicRanges(
238239
strand=["-", "+", "*"],
239240
mcols=BiocFrame({"score": [2, 3, 4]}),
240241
)
241-
grl = GenomicRangesList(ranges=[gr1, gr2], names=["gene1", "gene2"])
242+
grl = CompressedGenomicRangesList.from_list(lst=[gr1, gr2], names=["gene1", "gene2"])
242243
print(grl)
243244
```
244245

245246
## output
246-
GenomicRangesList with 2 ranges and 2 metadata columns
247+
CompressedGenomicRangesList with 2 ranges and 2 metadata columns
247248

248249
Name: gene1
249250
GenomicRanges with 4 ranges and 4 metadata columns
@@ -270,12 +271,12 @@ print(grl)
270271

271272
Performance comparison between Python and R GenomicRanges implementations. The query dataset contains approximately 564,000 intervals, while the subject dataset contains approximately 71 million intervals.
272273

273-
| Operation | Python/GenomicRanges | Python/GenomicRanges (5 threads) | R/GenomicRanges |
274-
|-----------|---------------------|-----------------------------------|-----------------|
275-
| Overlap | 2.80s | 2.06s | 4.40s |
276-
| Overlap (single chromosome) | 6.73s | 5.19s | 10.06s |
277-
| Nearest | 2.27s | 1.5s | 42.16s |
278-
| Nearest (single chromosome) | 4.7s | 4.67s | 11.01s |
274+
| Operation | Python/GenomicRanges | Python/GenomicRanges (5 threads) | R/GenomicRanges |
275+
| --------------------------- | -------------------- | -------------------------------- | --------------- |
276+
| Overlap | 2.80s | 2.06s | 4.40s |
277+
| Overlap (single chromosome) | 6.73s | 5.19s | 10.06s |
278+
| Nearest | 2.27s | 1.5s | 42.16s |
279+
| Nearest (single chromosome) | 4.7s | 4.67s | 11.01s |
279280

280281
> [!NOTE]
281282
> The single chromosome benchmark ignores chromosome/sequence information and performs overlap operations solely on intervals.

docs/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -315,6 +315,7 @@
315315
"biocutils": ("https://biocpy.github.io/BiocUtils", None),
316316
"iranges": ("https://biocpy.github.io/IRanges", None),
317317
"polars": ("https://docs.pola.rs/api/python/stable/", None),
318+
"compressed-lists": ("https://biocpy.github.io/compressed-lists", None),
318319
}
319320

320321
print(f"loading configurations for {project} {version} ...", file=sys.stderr)

docs/tutorial.md

Lines changed: 13 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ kernelspec:
1010

1111
An `IRanges` holds a **start** position and a **width**, and is typically used to represent coordinates along a genomic sequence. The interpretation of the **start** position depends on the application; for sequences, the **start** is usually a 1-based position, but other use cases may allow zero or even negative values, e.g., circular genomes. Ends are considered inclusive. `IRanges` uses [LTLa/nclist-cpp](https://github.com/LTLA/nclist-cpp) under the hood to perform fast overlap and search-based operations.
1212

13-
The package provides a `GenomicRanges` class to specify multiple genomic elements, typically where genes start and end. Genes are themselves made of many subregions, such as exons, and a `GenomicRangesList` enables the representation of this nested structure.
13+
The package provides a `GenomicRanges` class to specify multiple genomic elements, typically where genes start and end. Genes are themselves made of many subregions, such as exons, and a `CompressedGenomicRangesList` enables the representation of this nested structure.
1414

1515
Moreover, the package also provides a `SeqInfo` class to update or modify sequence information stored in the object. Learn more about this in the [GenomeInfoDb package](https://bioconductor.org/packages/release/bioc/html/GenomeInfoDb.html).
1616

@@ -68,10 +68,9 @@ human_gr = genomicranges.read_ucsc(genome="hg19")
6868
print(human_gr)
6969
```
7070

71-
7271
## Preferred way
7372

74-
To construct a `GenomicRanges` object, we need to provide sequence information and genomic coordinates. This is achieved through the combination of the `seqnames` and `ranges` parameters. Additionally, you have the option to specify the `strand`, represented as a list of "+" (or 1) for the forward strand, "-" (or -1) for the reverse strand, or "*" (or 0) if the strand is unknown. You can also provide a NumPy vector that utilizes either the string or numeric representation to specify the `strand`. Optionally, you can use the `mcols` parameter to provide additional metadata about each genomic region.
73+
To construct a `GenomicRanges` object, we need to provide sequence information and genomic coordinates. This is achieved through the combination of the `seqnames` and `ranges` parameters. Additionally, you have the option to specify the `strand`, represented as a list of "+" (or 1) for the forward strand, "-" (or -1) for the reverse strand, or "\*" (or 0) if the strand is unknown. You can also provide a NumPy vector that utilizes either the string or numeric representation to specify the `strand`. Optionally, you can use the `mcols` parameter to provide additional metadata about each genomic region.
7574

7675
```{code-cell}
7776
from genomicranges import GenomicRanges
@@ -427,7 +426,7 @@ print(binned_avg_gr)
427426
```
428427

429428
::: {tip}
430-
Now you might wonder how can I generate these ***bins***?
429+
Now you might wonder how can I generate these **_bins_**?
431430
:::
432431

433432
# Generate tiles or bins
@@ -469,7 +468,7 @@ print(tiles)
469468
```{code-cell}
470469
seqlengths = {"chr1": 100, "chr2": 75, "chr3": 200}
471470
472-
tiles = GenomicRanges.tile_genome(seqlengths=seqlengths, n=10)
471+
tiles = GenomicRanges.tile_genome(seqlengths=seqlengths, ntile=10)
473472
print(tiles)
474473
```
475474

@@ -547,8 +546,6 @@ query_hits = gr.nearest(find_regions)
547546
548547
query_hits = gr.precede(find_regions)
549548
550-
query_hits = gr.follow(find_regions)
551-
552549
print(query_hits)
553550
```
554551

@@ -609,7 +606,7 @@ print(combined)
609606
# Misc operations
610607

611608
- **invert_strand**: flip the strand for each interval
612-
- **sample**: randomly choose ***k*** intervals
609+
- **sample**: randomly choose **_k_** intervals
613610

614611
```{code-cell}
615612
# invert strand
@@ -619,20 +616,22 @@ inv_gr = gr.invert_strand()
619616
samp_gr = gr.sample(k=4)
620617
```
621618

622-
# `GenomicRangesList` class
619+
# `CompressedGenomicRangesList` class
623620

624-
Just as it sounds, a `GenomicRangesList` is a named-list like object.
621+
Just as it sounds, a `CompressedGenomicRangesList` is a named-list like object.
625622

626623
If you are wondering why you need this class, a `GenomicRanges` object enables the
627624
specification of multiple genomic elements, usually where genes start and end.
628625
Genes, in turn, consist of various subregions, such as exons.
629-
The `GenomicRangesList` allows us to represent this nested structure.
626+
The `CompressedGenomicRangesList` allows us to represent this nested structure.
630627

631628
As of now, this class has limited functionality, serving as a read-only class with basic accessors.
632629

633630
```{code-cell}
631+
from genomicranges import CompressedGenomicRangesList, GenomicRanges
632+
from iranges import IRanges
633+
from biocframe import BiocFrame
634634
635-
from genomicranges import GenomicRangesList
636635
a = GenomicRanges(
637636
seqnames=["chr1", "chr2", "chr1", "chr3"],
638637
ranges=IRanges([1, 3, 2, 4], [10, 30, 50, 60]),
@@ -647,33 +646,17 @@ b = GenomicRanges(
647646
mcols=BiocFrame({"score": [2, 3, 4]}),
648647
)
649648
650-
grl = GenomicRangesList(ranges=[a,b], names=["gene1", "gene2"])
649+
grl = CompressedGenomicRangesList.from_list(lst=[a,b], names=["gene1", "gene2"])
651650
print(grl)
652651
```
653652

654-
655653
## Properties
656654

657655
```{code-cell}
658656
grl.start
659657
grl.width
660658
```
661659

662-
## Combine `GenomicRangeslist` object
663-
664-
Similar to the combine function from `GenomicRanges`,
665-
666-
```{code-cell}
667-
grla = GenomicRangesList(ranges=[a], names=["a"])
668-
grlb = GenomicRangesList(ranges=[b, a], names=["b", "c"])
669-
670-
# or use the combine generic
671-
from biocutils.combine import combine
672-
cgrl = combine(grla, grlb)
673-
```
674-
675-
The functionality in `GenomicRangesLlist` is limited to read-only and a few methods. Updates are expected to be made as more features become available.
676-
677660
## Empty ranges
678661

679662
Both of these classes can also contain no range information, and they tend to be useful when incorporates into larger data structures but do not contain any data themselves.
@@ -686,15 +669,7 @@ empty_gr = GenomicRanges.empty()
686669
print(empty_gr)
687670
```
688671

689-
Similarly, an empty `GenomicRangesList` can be created:
690-
691-
```{code-cell}
692-
empty_grl = GenomicRangesList.empty(n=100)
693-
694-
print(empty_grl)
695-
```
696-
697-
----
672+
---
698673

699674
## Futher reading
700675

setup.cfg

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,10 +49,11 @@ python_requires = >=3.9
4949
# For more information, check out https://semver.org/.
5050
install_requires =
5151
importlib-metadata; python_version<"3.8"
52-
biocframe>=0.6.2
53-
iranges>=0.5.4
54-
biocutils>=0.2.1
52+
biocframe>=0.7.1
53+
iranges>=0.7.0
54+
biocutils>=0.3.1
5555
numpy
56+
compressed_lists>=0.4.0
5657

5758
[options.packages.find]
5859
where = src

0 commit comments

Comments
 (0)