Skip to content

Commit 99c87cc

Browse files
committed
add accessors back, fix tests and doc tutorial
1 parent 17a93d4 commit 99c87cc

File tree

11 files changed

+390
-58
lines changed

11 files changed

+390
-58
lines changed

README.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ print(len(gg), len(df))
4848

4949
## output
5050
## 77 77> [!NOTE]
51+
5152
> `ends` are expected to be inclusive to be consistent with Bioconductor representations. If they are not, we recommend subtracting 1 from the `ends`.
5253
5354
#### UCSC or GTF file
@@ -212,16 +213,16 @@ print(hits)
212213
[1] 1 1677082
213214
[2] 2 1003411
214215

215-
## `GenomicRangesList`
216+
## `CompressedGenomicRangesList`
216217

217-
Just as it sounds, a `GenomicRangesList` is a named-list like object. If you are wondering why you need this class, a `GenomicRanges` object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub-regions, e.g. exons. `GenomicRangesList` allows us to represent this nested structure.
218+
Just as it sounds, a `CompressedGenomicRangesList` is a named-list like object. If you are wondering why you need this class, a `GenomicRanges` object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub-regions, e.g. exons. `CompressedGenomicRangesList` allows us to represent this nested structure.
218219

219220
**Currently, this class is limited in functionality.**
220221

221-
To construct a GenomicRangesList
222+
To construct a CompressedGenomicRangesList
222223

223224
```python
224-
from genomicranges import GenomicRanges, GenomicRangesList
225+
from genomicranges import GenomicRanges, CompressedGenomicRangesList
225226
from iranges import IRanges
226227
from biocframe import BiocFrame
227228

@@ -238,12 +239,12 @@ gr2 = GenomicRanges(
238239
strand=["-", "+", "*"],
239240
mcols=BiocFrame({"score": [2, 3, 4]}),
240241
)
241-
grl = GenomicRangesList(ranges=[gr1, gr2], names=["gene1", "gene2"])
242+
grl = CompressedGenomicRangesList.from_list(lst=[gr1, gr2], names=["gene1", "gene2"])
242243
print(grl)
243244
```
244245

245246
## output
246-
GenomicRangesList with 2 ranges and 2 metadata columns
247+
CompressedGenomicRangesList with 2 ranges and 2 metadata columns
247248

248249
Name: gene1
249250
GenomicRanges with 4 ranges and 4 metadata columns
@@ -270,12 +271,12 @@ print(grl)
270271

271272
Performance comparison between Python and R GenomicRanges implementations. The query dataset contains approximately 564,000 intervals, while the subject dataset contains approximately 71 million intervals.
272273

273-
| Operation | Python/GenomicRanges | Python/GenomicRanges (5 threads) | R/GenomicRanges |
274-
|-----------|---------------------|-----------------------------------|-----------------|
275-
| Overlap | 2.80s | 2.06s | 4.40s |
276-
| Overlap (single chromosome) | 6.73s | 5.19s | 10.06s |
277-
| Nearest | 2.27s | 1.5s | 42.16s |
278-
| Nearest (single chromosome) | 4.7s | 4.67s | 11.01s |
274+
| Operation | Python/GenomicRanges | Python/GenomicRanges (5 threads) | R/GenomicRanges |
275+
| --------------------------- | -------------------- | -------------------------------- | --------------- |
276+
| Overlap | 2.80s | 2.06s | 4.40s |
277+
| Overlap (single chromosome) | 6.73s | 5.19s | 10.06s |
278+
| Nearest | 2.27s | 1.5s | 42.16s |
279+
| Nearest (single chromosome) | 4.7s | 4.67s | 11.01s |
279280

280281
> [!NOTE]
281282
> The single chromosome benchmark ignores chromosome/sequence information and performs overlap operations solely on intervals.

docs/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -315,6 +315,7 @@
315315
"biocutils": ("https://biocpy.github.io/BiocUtils", None),
316316
"iranges": ("https://biocpy.github.io/IRanges", None),
317317
"polars": ("https://docs.pola.rs/api/python/stable/", None),
318+
"compressed-lists": ("https://biocpy.github.io/compressed-lists", None),
318319
}
319320

320321
print(f"loading configurations for {project} {version} ...", file=sys.stderr)

docs/tutorial.md

Lines changed: 13 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ kernelspec:
1010

1111
An `IRanges` holds a **start** position and a **width**, and is typically used to represent coordinates along a genomic sequence. The interpretation of the **start** position depends on the application; for sequences, the **start** is usually a 1-based position, but other use cases may allow zero or even negative values, e.g., circular genomes. Ends are considered inclusive. `IRanges` uses [LTLa/nclist-cpp](https://github.com/LTLA/nclist-cpp) under the hood to perform fast overlap and search-based operations.
1212

13-
The package provides a `GenomicRanges` class to specify multiple genomic elements, typically where genes start and end. Genes are themselves made of many subregions, such as exons, and a `GenomicRangesList` enables the representation of this nested structure.
13+
The package provides a `GenomicRanges` class to specify multiple genomic elements, typically where genes start and end. Genes are themselves made of many subregions, such as exons, and a `CompressedGenomicRangesList` enables the representation of this nested structure.
1414

1515
Moreover, the package also provides a `SeqInfo` class to update or modify sequence information stored in the object. Learn more about this in the [GenomeInfoDb package](https://bioconductor.org/packages/release/bioc/html/GenomeInfoDb.html).
1616

@@ -68,10 +68,9 @@ human_gr = genomicranges.read_ucsc(genome="hg19")
6868
print(human_gr)
6969
```
7070

71-
7271
## Preferred way
7372

74-
To construct a `GenomicRanges` object, we need to provide sequence information and genomic coordinates. This is achieved through the combination of the `seqnames` and `ranges` parameters. Additionally, you have the option to specify the `strand`, represented as a list of "+" (or 1) for the forward strand, "-" (or -1) for the reverse strand, or "*" (or 0) if the strand is unknown. You can also provide a NumPy vector that utilizes either the string or numeric representation to specify the `strand`. Optionally, you can use the `mcols` parameter to provide additional metadata about each genomic region.
73+
To construct a `GenomicRanges` object, we need to provide sequence information and genomic coordinates. This is achieved through the combination of the `seqnames` and `ranges` parameters. Additionally, you have the option to specify the `strand`, represented as a list of "+" (or 1) for the forward strand, "-" (or -1) for the reverse strand, or "\*" (or 0) if the strand is unknown. You can also provide a NumPy vector that utilizes either the string or numeric representation to specify the `strand`. Optionally, you can use the `mcols` parameter to provide additional metadata about each genomic region.
7574

7675
```{code-cell}
7776
from genomicranges import GenomicRanges
@@ -427,7 +426,7 @@ print(binned_avg_gr)
427426
```
428427

429428
::: {tip}
430-
Now you might wonder how can I generate these ***bins***?
429+
Now you might wonder how can I generate these **_bins_**?
431430
:::
432431

433432
# Generate tiles or bins
@@ -469,7 +468,7 @@ print(tiles)
469468
```{code-cell}
470469
seqlengths = {"chr1": 100, "chr2": 75, "chr3": 200}
471470
472-
tiles = GenomicRanges.tile_genome(seqlengths=seqlengths, n=10)
471+
tiles = GenomicRanges.tile_genome(seqlengths=seqlengths, ntile=10)
473472
print(tiles)
474473
```
475474

@@ -547,8 +546,6 @@ query_hits = gr.nearest(find_regions)
547546
548547
query_hits = gr.precede(find_regions)
549548
550-
query_hits = gr.follow(find_regions)
551-
552549
print(query_hits)
553550
```
554551

@@ -609,7 +606,7 @@ print(combined)
609606
# Misc operations
610607

611608
- **invert_strand**: flip the strand for each interval
612-
- **sample**: randomly choose ***k*** intervals
609+
- **sample**: randomly choose **_k_** intervals
613610

614611
```{code-cell}
615612
# invert strand
@@ -619,20 +616,22 @@ inv_gr = gr.invert_strand()
619616
samp_gr = gr.sample(k=4)
620617
```
621618

622-
# `GenomicRangesList` class
619+
# `CompressedGenomicRangesList` class
623620

624-
Just as it sounds, a `GenomicRangesList` is a named-list like object.
621+
Just as it sounds, a `CompressedGenomicRangesList` is a named-list like object.
625622

626623
If you are wondering why you need this class, a `GenomicRanges` object enables the
627624
specification of multiple genomic elements, usually where genes start and end.
628625
Genes, in turn, consist of various subregions, such as exons.
629-
The `GenomicRangesList` allows us to represent this nested structure.
626+
The `CompressedGenomicRangesList` allows us to represent this nested structure.
630627

631628
As of now, this class has limited functionality, serving as a read-only class with basic accessors.
632629

633630
```{code-cell}
631+
from genomicranges import CompressedGenomicRangesList, GenomicRanges
632+
from iranges import IRanges
633+
from biocframe import BiocFrame
634634
635-
from genomicranges import GenomicRangesList
636635
a = GenomicRanges(
637636
seqnames=["chr1", "chr2", "chr1", "chr3"],
638637
ranges=IRanges([1, 3, 2, 4], [10, 30, 50, 60]),
@@ -647,33 +646,17 @@ b = GenomicRanges(
647646
mcols=BiocFrame({"score": [2, 3, 4]}),
648647
)
649648
650-
grl = GenomicRangesList(ranges=[a,b], names=["gene1", "gene2"])
649+
grl = CompressedGenomicRangesList.from_list(lst=[a,b], names=["gene1", "gene2"])
651650
print(grl)
652651
```
653652

654-
655653
## Properties
656654

657655
```{code-cell}
658656
grl.start
659657
grl.width
660658
```
661659

662-
## Combine `GenomicRangeslist` object
663-
664-
Similar to the combine function from `GenomicRanges`,
665-
666-
```{code-cell}
667-
grla = GenomicRangesList(ranges=[a], names=["a"])
668-
grlb = GenomicRangesList(ranges=[b, a], names=["b", "c"])
669-
670-
# or use the combine generic
671-
from biocutils.combine import combine
672-
cgrl = combine(grla, grlb)
673-
```
674-
675-
The functionality in `GenomicRangesLlist` is limited to read-only and a few methods. Updates are expected to be made as more features become available.
676-
677660
## Empty ranges
678661

679662
Both of these classes can also contain no range information, and they tend to be useful when incorporates into larger data structures but do not contain any data themselves.
@@ -686,15 +669,7 @@ empty_gr = GenomicRanges.empty()
686669
print(empty_gr)
687670
```
688671

689-
Similarly, an empty `GenomicRangesList` can be created:
690-
691-
```{code-cell}
692-
empty_grl = GenomicRangesList.empty(n=100)
693-
694-
print(empty_grl)
695-
```
696-
697-
----
672+
---
698673

699674
## Futher reading
700675

src/genomicranges/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,4 @@
1919
from .grangeslist import CompressedGenomicRangesList
2020
from .io.gtf import read_gtf
2121
from .io.ucsc import read_ucsc
22-
from .SeqInfo import SeqInfo
22+
from .sequence_info import SeqInfo

src/genomicranges/granges.py

Lines changed: 76 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
from biocframe import BiocFrame
1111
from iranges import IRanges
1212

13-
from .SeqInfo import SeqInfo, merge_SeqInfo
13+
from .sequence_info import SeqInfo, merge_SeqInfo
1414
from .utils import (
1515
STRAND_MAP,
1616
compute_up_down,
@@ -3122,7 +3122,7 @@ def binned_average(
31223122
#######################
31233123

31243124
def split(self, groups: list) -> "CompressedGenomicRangesList":
3125-
"""Split the `GenomicRanges` object into a :py:class:`~genomicranges.GenomicRangesList.GenomicRangesList`.
3125+
"""Split the `GenomicRanges` object into a :py:class:`~genomicranges.grangeslist.CompressedGenomicRangesList`.
31263126
31273127
Args:
31283128
groups:
@@ -3132,7 +3132,7 @@ def split(self, groups: list) -> "CompressedGenomicRangesList":
31323132
in the object.
31333133
31343134
Returns:
3135-
A `GenomicRangesList` containing the groups and their
3135+
A `CompressedGenomicRangesList` containing the groups and their
31363136
corresponding elements.
31373137
"""
31383138

@@ -3188,7 +3188,7 @@ def subtract(
31883188
Defaults to False.
31893189
31903190
Returns:
3191-
A `GenomicRangesList` with the same size as ``self`` containing
3191+
A `CompressedGenomicRangesList` with the same size as ``self`` containing
31923192
the subtracted regions.
31933193
"""
31943194

@@ -3220,6 +3220,78 @@ def subtract(
32203220

32213221
return CompressedGenomicRangesList.from_list(lst=psetdiff.values(), names=list(psetdiff.keys()))
32223222

3223+
##########################
3224+
######>> pairwise <<######
3225+
##########################
3226+
3227+
def pintersect(self, other: GenomicRanges, ignore_strand: bool = False) -> GenomicRanges:
3228+
"""Parallel intersection of genomic ranges.
3229+
3230+
Computes the intersection for each parallel pair of ranges in ``self`` and ``other``.
3231+
If seqnames mismatch or strands are incompatible (and not ignored), the result
3232+
for that index is an empty range (width 0).
3233+
3234+
Args:
3235+
other:
3236+
The other ``GenomicRanges`` object. Must have the same length as ``self``.
3237+
3238+
ignore_strand:
3239+
Whether to ignore strands. Defaults to False.
3240+
3241+
Returns:
3242+
A new ``GenomicRanges`` object.
3243+
"""
3244+
if len(self) != len(other):
3245+
raise ValueError("'self' and 'other' must have the same length.")
3246+
3247+
merged_seqinfo = merge_SeqInfo([self.seqinfo, other.seqinfo])
3248+
3249+
s_names = self.get_seqnames(as_type="list")
3250+
o_names = other.get_seqnames(as_type="list")
3251+
3252+
s_strand = self.get_strand(as_type="numpy")
3253+
o_strand = other.get_strand(as_type="numpy")
3254+
3255+
new_starts = np.maximum(self.start, other.start)
3256+
new_ends = np.minimum(self.end, other.end)
3257+
3258+
match_seqnames = np.array([x == y for x, y in zip(s_names, o_names)])
3259+
3260+
if not ignore_strand:
3261+
match_strands = (s_strand * o_strand) != -1
3262+
mask = match_seqnames & match_strands
3263+
else:
3264+
mask = match_seqnames
3265+
3266+
no_overlap = new_starts > new_ends
3267+
invalid = (~mask) | no_overlap
3268+
3269+
final_starts = new_starts.copy()
3270+
final_ends = new_ends.copy()
3271+
3272+
final_starts[invalid] = 1
3273+
final_ends[invalid] = 0
3274+
3275+
final_widths = final_ends - final_starts + 1
3276+
final_widths[final_widths < 0] = 0
3277+
3278+
if ignore_strand:
3279+
new_strands = np.zeros(len(self), dtype=int)
3280+
else:
3281+
new_strands = s_strand.copy()
3282+
use_other = s_strand == 0
3283+
new_strands[use_other] = o_strand[use_other]
3284+
new_strands[invalid] = 0
3285+
3286+
new_ranges = IRanges(final_starts, final_widths)
3287+
3288+
return GenomicRanges(
3289+
seqnames=s_names,
3290+
ranges=new_ranges,
3291+
strand=new_strands,
3292+
seqinfo=merged_seqinfo,
3293+
)
3294+
32233295

32243296
def _fast_combine_GenomicRanges(*x: GenomicRanges) -> GenomicRanges:
32253297
return GenomicRanges(

0 commit comments

Comments
 (0)