Skip to content

Commit c56546c

Browse files
Merge pull request #581 from tskit-dev/more-docs-2
Work on document Dataset interface
2 parents e164bc3 + 7162ef9 commit c56546c

File tree

10 files changed

+251
-50
lines changed

10 files changed

+251
-50
lines changed

docs/alignments_analysis.md

Lines changed: 136 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,9 +53,144 @@ and {attr}`Dataset.num_variants` attributes.
5353

5454
To get information on the metadata fields that are present, we can use
5555

56-
5756
```{code-cell}
5857
ds.metadata.field_descriptors()
5958
```
59+
:::{warning}
60+
The ``description`` column is currently empty because of a bug in the
61+
data ingest pipeline for the Virian data. Later versions will include
62+
this information so that the dataset is self-describing.
63+
See [GitHub issue](https://github.com/tskit-dev/sc2ts/issues/579).
64+
:::
65+
66+
67+
68+
## Accessing per-sample information
69+
70+
The easiest way to get information about a single sample is through the
71+
the ``.metadata`` and ``.haplotypes`` interfaces. First, let's get
72+
the sample IDs for the first 10 samples:
73+
74+
```{code-cell}
75+
ds.sample_id[:10]
76+
```
77+
Then, we can get the metadata for a given sample as a dictionary using
78+
the {attr}`Dataset.metadata` interface:
79+
80+
```{code-cell}
81+
ds.metadata["SRR11597146"]
82+
```
83+
84+
Similarly, we can get the integer encoded alignment for a sample using
85+
the {attr}`Dataset.alignment` interface:
86+
87+
```{code-cell}
88+
ds.alignment["SRR11597146"]
89+
```
90+
91+
:::{seealso}
92+
See the section {ref}`sec_alignments_analysis_data_encoding` for
93+
details on the integer encoding for alignment data used here.
94+
:::
95+
96+
Both the ``.metadata`` and ``.aligments`` interfaces are **cached**
97+
(avoiding repeated decompression of the same underlying Zarr chunks)
98+
and support iteration, and so provide an efficient way of accessing
99+
data in bulk. For example, here we compute the mean number of
100+
gap ("-") characters per sample:
101+
102+
```{code-cell}
103+
import numpy as np
104+
105+
GAP = sc2ts.IUPAC_ALLELES.index("-")
106+
107+
gap_count = np.zeros(ds.num_samples)
108+
for j, a in enumerate(ds.alignment.values()):
109+
gap_count[j] = np.sum(a == GAP)
110+
np.mean(gap_count)
111+
```
112+
113+
:::{warning}
114+
The arrays returned by the ``alignment`` interface are **zero based** and you
115+
must compensate to use **one-based** coordinates.
116+
:::
117+
118+
If you want to access
119+
specific slices of the array based on **one-based** coordinates, it's important
120+
to take the zero-based nature of this into account. Suppose we wanted to
121+
access the first 10 bases of Spike for a given sample. The first
122+
base of Spike is 21563 in standard one-based coordinates. While we could do
123+
some arithmetic to compensate, the simplest way to translate is to simply
124+
prepend some value to the alignment array:
125+
126+
```{code-cell}
127+
a = np.append([-1], ds.alignment["SRR11597146"])
128+
spike_start = 21_563
129+
a[spike_start: spike_start + 10]
130+
```
131+
132+
(sec_alignments_analysis_data_encoding)=
133+
134+
## Alignment data encoding
135+
136+
A key element of processing data efficiently in [tskit](https://tskit.dev) and VCF
137+
Zarr is to use numpy
138+
arrays of integers to represent allelic states, instead of the classical
139+
approach of using strings. In sc2ts, alleles are given fixed integer
140+
representations, such that A=0, C=1, G=2, and T=3. So, to represent the DNA
141+
string "AACTG" we would use the numpy array [0, 0, 1, 3, 2] instead. This has
142+
many advantages and makes it much easier to write efficient code.
143+
144+
The drawback of this is that it's not as easy to inspect and debug, and we must
145+
always be aware of the translation required.
146+
147+
Sc2ts provides some utilities for doing this. The easiest way to get the string
148+
values is to use {func}`decode_alleles` function:
149+
150+
```{code-cell}
151+
a = sc2ts.decode_alleles(ds.alignment["SRR11597146"])
152+
a
153+
```
154+
This is a numpy string array, which can still be processed quite efficiently.
155+
However, it is best to stay in native integer encoding where possible, as it
156+
is much more efficient.
157+
158+
159+
Sc2ts uses the [IUPAC](https://www.bioinformatics.org/sms/iupac.html)
160+
uncertainty codes to encode ambiguous bases, and the {attr}`sc2ts.IUPAC_ALLELES`
161+
variable stores the mapping from these values to their integer indexes.
162+
163+
```{code-cell}
164+
sc2ts.IUPAC_ALLELES
165+
```
166+
167+
Thus, "A" corresponds to 0, "-" to 4 and so on.
168+
169+
170+
### Missing data
171+
172+
Missing data is an important element of the data model. Usually, missing data is
173+
encoded as an "N" character in the alignments. Howevever, there is no "N"
174+
in the ``IUPAC_ALLELES`` list above. This is because missing data is handled specially
175+
in VCF Zarr by mapping to the reserved ``-1`` value. Missing data can therefore be flagged
176+
easily and handled correctly by downstream utilities.
177+
178+
:::{warning}
179+
It is important to take this into account when translating the integer encoded data into
180+
strings, because -1 is interpreted as the last element of the list in Python. Please
181+
use the {func}`decode_alleles` function to avoid this tripwire.
182+
:::
183+
184+
185+
## Accessing by variant
186+
187+
A unique feature of the VCF Zarr encoding used here is that we can efficiently access
188+
the alignment data by sample **and** by site. The best way to access data by site
189+
is to use the {meth}`Dataset.variants` method.
60190

191+
:::{note}
192+
The {meth}`Dataset.variants` method is deliberately designed to mirror the API
193+
of the corresponding [tskit](https://tskit.dev) function
194+
({meth}`tskit.TreeSequence.variants`).
195+
:::
61196

docs/api.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ notebooks.
3535
```{eval-rst}
3636
.. autosummary::
3737
Dataset
38-
decode_alignment
38+
decode_alleles
3939
mask_ambiguous
4040
mask_flanking_deletions
4141
```
@@ -44,7 +44,7 @@ notebooks.
4444
.. autoclass:: Dataset
4545
:members:
4646
47-
.. autofunction:: decode_alignment
47+
.. autofunction:: decode_alleles
4848
4949
.. autofunction:: mask_ambiguous
5050

sc2ts/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@
22

33
# star imports are fine here as it's just a bunch of constants
44
from .core import *
5-
from .dataset import mask_ambiguous, mask_flanking_deletions, decode_alignment, Dataset
5+
from .dataset import mask_ambiguous, mask_flanking_deletions, decode_alleles, Dataset
66
from .stats import node_data, mutation_data

sc2ts/cli.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ def import_alignments(dataset, fastas, initialise, progress, verbose):
157157
position=1,
158158
)
159159
for k, v in a_bar:
160-
alignments[k] = jit.encode_alignment(v)
160+
alignments[k] = jit.encode_alleles(v)
161161
sc2ts.Dataset.append_alignments(dataset, alignments)
162162

163163

sc2ts/dataset.py

Lines changed: 45 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -24,18 +24,22 @@
2424
DEFAULT_ZARR_COMPRESSOR = numcodecs.Blosc(cname="zstd", clevel=7, shuffle=0)
2525

2626

27-
def decode_alignment(a):
27+
def decode_alleles(a):
2828
"""
29-
Decode an array of integer-encoded alleles into their IUPAC string values.
29+
Decode an array of integer-encoded alleles into their IUPAC string values
30+
returned as a numpy array.
3031
3132
The input should use the encoding defined by ``core.IUPAC_ALLELES``,
32-
with ``-1`` representing missing data; a trailing ``\"N\"`` allele is
33-
added here for convenience when working with masked arrays.
33+
with ``-1`` representing missing data.
3434
3535
:param numpy.ndarray a: Integer-encoded alignment array.
3636
:return: Array of single-character IUPAC allele codes.
3737
:rtype: numpy.ndarray
3838
"""
39+
if np.any(a < -1):
40+
raise ValueError("Negative values < -1 not supported")
41+
if np.any(a >= len(core.IUPAC_ALLELES)):
42+
raise ValueError("Unknown allele value")
3943
alleles = np.array(tuple(core.IUPAC_ALLELES + "N"), dtype=str)
4044
return alleles[a]
4145

@@ -246,7 +250,7 @@ def __init__(self, path, chunk_cache_size=1, date_field=None):
246250
247251
:param str path: Path to a directory or ``.zip`` Zarr store.
248252
:param int chunk_cache_size: Maximum number of chunks to cache for
249-
haplotypes and metadata. Defaults to 1.
253+
alignments and metadata. Defaults to 1.
250254
:param str date_field: Name of the metadata field to use as the
251255
sample date, or ``None`` to disable date handling. Defaults
252256
to ``None``.
@@ -259,16 +263,16 @@ def __init__(self, path, chunk_cache_size=1, date_field=None):
259263
else:
260264
self.store = zarr.DirectoryStore(path)
261265
self.root = zarr.open(self.store, mode="r")
262-
self.sample_id = self.root["sample_id"][:].astype(str)
266+
self._sample_id = self.root["sample_id"][:].astype(str)
263267

264268
# TODO we should be storing this mapping in the Zarr somehow.
265269
self.sample_id_map = {
266-
sample_id: k for k, sample_id in enumerate(self.sample_id)
270+
sample_id: k for k, sample_id in enumerate(self._sample_id)
267271
}
268-
self.haplotypes = CachedHaplotypeMapping(
272+
self._alignment = CachedHaplotypeMapping(
269273
self.root, self.sample_id_map, chunk_cache_size
270274
)
271-
self.metadata = CachedMetadataMapping(
275+
self._metadata = CachedMetadataMapping(
272276
self.root,
273277
self.sample_id_map,
274278
date_field,
@@ -284,6 +288,35 @@ def __iter__(self):
284288
def __len__(self):
285289
return len(self.root)
286290

291+
@property
292+
def alignment(self):
293+
"""
294+
Efficient mapping of sample ID strings to integer encoded alignment data.
295+
296+
The returned object is dictionary-like implemening the Mapping protocol.
297+
Access to the underlying Zarr store is mediated by a chunk cache, so that
298+
chunks are not repeatedly decompressed.
299+
"""
300+
return self._alignment
301+
302+
@property
303+
def metadata(self):
304+
"""
305+
Efficient mapping of sample ID strings to metadata dictionaries.
306+
307+
The returned object is dictionary-like implemening the Mapping protocol.
308+
Access to the underlying Zarr store is mediated by a chunk cache, so that
309+
chunks are not repeatedly decompressed.
310+
"""
311+
return self._metadata
312+
313+
@property
314+
def sample_id(self):
315+
"""
316+
The sample IDs as a numpy string array.
317+
"""
318+
return self._sample_id
319+
287320
@property
288321
def samples_chunk_size(self):
289322
return self.root["call_genotype"].chunks[1]
@@ -380,8 +413,8 @@ def write_fasta(self, out, sample_id=None):
380413
sample_id = self.sample_id
381414

382415
for sid in sample_id:
383-
h = self.haplotypes[sid]
384-
a = decode_alignment(h)
416+
h = self.alignment[sid]
417+
a = decode_alleles(h)
385418
print(f">{sid}", file=out)
386419
# FIXME this is probably a terrible way to write a large numpy string to
387420
# a file
@@ -409,7 +442,7 @@ def copy(
409442
alignments = {}
410443
bar = tqdm.tqdm(sample_id, desc="Samples", disable=not show_progress)
411444
for s in bar:
412-
alignments[s] = self.haplotypes[s]
445+
alignments[s] = self.alignment[s]
413446
if len(alignments) == samples_chunk_size:
414447
Dataset.append_alignments(path, alignments)
415448
alignments = {}

sc2ts/inference.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -489,7 +489,7 @@ def preprocess(
489489
samples = []
490490
bar = get_progress(strains, progress_title, "preprocess", show_progress)
491491
for strain in bar:
492-
alignment = dataset.haplotypes[strain]
492+
alignment = dataset.alignment[strain]
493493
alignment = _dataset.mask_flanking_deletions(alignment)
494494
sample = Sample(strain)
495495
# No padding zero site in the alignment
@@ -1227,10 +1227,10 @@ def make_tsb(ts, num_alleles, mirror_coordinates=False):
12271227
ts = tree_ops.insert_vestigial_root_edge(ts)
12281228

12291229
# Convert arrays for numba compatibility
1230-
ancestral_state = jit.encode_alignment(
1230+
ancestral_state = jit.encode_alleles(
12311231
np.asarray(ts.sites_ancestral_state, dtype="U1")
12321232
)
1233-
derived_state = jit.encode_alignment(
1233+
derived_state = jit.encode_alleles(
12341234
np.asarray(ts.mutations_derived_state, dtype="U1")
12351235
)
12361236

sc2ts/jit.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@ def count(ts):
203203

204204
# FIXME make cache optional.
205205
@numba.njit(cache=True)
206-
def encode_alignment(h):
206+
def encode_alleles(h):
207207
# Just so numba knows this is a constant string.
208208
alleles = "ACGT-RYSWKMBDHV."
209209
n = h.shape[0]

tests/sc2ts_fixtures.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ def encoded_alignments(path):
4747
fr = data_import.FastaReader(path)
4848
alignments = {}
4949
for k, v in fr.items():
50-
alignments[k] = jit.encode_alignment(v[1:])
50+
alignments[k] = jit.encode_alleles(v[1:])
5151
return alignments
5252

5353

@@ -205,8 +205,8 @@ def recombinant_alignments(dataset):
205205
Generate some recombinant alignments from existing haplotypes
206206
"""
207207
strains = ["SRR11597188", "SRR11597163"]
208-
left_a = dataset.haplotypes[strains[0]]
209-
right_a = dataset.haplotypes[strains[1]]
208+
left_a = dataset.alignment[strains[0]]
209+
right_a = dataset.alignment[strains[1]]
210210
# Recombine in the middle
211211
bp = 9_999
212212
h = left_a.copy()
@@ -243,7 +243,7 @@ def recombinant_example_2(tmp_path, fx_ts_map, fx_dataset, ds_path):
243243
# Pick a distinct strain to be the root of our two new haplotypes added
244244
# on the first day.
245245
root_strain = "SRR11597116"
246-
a = fx_dataset.haplotypes[root_strain]
246+
a = fx_dataset.alignment[root_strain]
247247
base_ts = fx_ts_map["2020-02-13"]
248248
# This sequence has a bunch of Ns at the start, so we have to go inwards
249249
# from them to make sure we're not masking them out.
@@ -310,7 +310,7 @@ def recombinant_example_3(tmp_path, fx_ts_map, fx_dataset, ds_path):
310310
# Pick a distinct strain to be the root of our three new haplotypes added
311311
# on the first day.
312312
root_strain = "SRR11597116"
313-
a = fx_dataset.haplotypes[root_strain]
313+
a = fx_dataset.alignment[root_strain]
314314
base_ts = fx_ts_map["2020-02-13"]
315315
# This sequence has a bunch of Ns at the start, so we have to go inwards
316316
# from them to make sure we're not masking them out.

0 commit comments

Comments
 (0)