@@ -53,9 +53,144 @@ and {attr}`Dataset.num_variants` attributes.
5353
5454To get information on the metadata fields that are present, we can use
5555
56-
5756``` {code-cell}
5857ds.metadata.field_descriptors()
5958```
59+ :::{warning}
60+ The `` description `` column is currently empty because of a bug in the
61+ data ingest pipeline for the Virian data. Later versions will include
62+ this information so that the dataset is self-describing.
63+ See [ GitHub issue] ( https://github.com/tskit-dev/sc2ts/issues/579 ) .
64+ :::
65+
66+
67+
68+ ## Accessing per-sample information
69+
70+ The easiest way to get information about a single sample is through the
71+ the `` .metadata `` and `` .haplotypes `` interfaces. First, let's get
72+ the sample IDs for the first 10 samples:
73+
74+ ``` {code-cell}
75+ ds.sample_id[:10]
76+ ```
77+ Then, we can get the metadata for a given sample as a dictionary using
78+ the {attr}` Dataset.metadata ` interface:
79+
80+ ``` {code-cell}
81+ ds.metadata["SRR11597146"]
82+ ```
83+
84+ Similarly, we can get the integer encoded alignment for a sample using
85+ the {attr}` Dataset.alignment ` interface:
86+
87+ ``` {code-cell}
88+ ds.alignment["SRR11597146"]
89+ ```
90+
91+ :::{seealso}
92+ See the section {ref}` sec_alignments_analysis_data_encoding ` for
93+ details on the integer encoding for alignment data used here.
94+ :::
95+
96+ Both the `` .metadata `` and `` .aligments `` interfaces are ** cached**
97+ (avoiding repeated decompression of the same underlying Zarr chunks)
98+ and support iteration, and so provide an efficient way of accessing
99+ data in bulk. For example, here we compute the mean number of
100+ gap ("-") characters per sample:
101+
102+ ``` {code-cell}
103+ import numpy as np
104+
105+ GAP = sc2ts.IUPAC_ALLELES.index("-")
106+
107+ gap_count = np.zeros(ds.num_samples)
108+ for j, a in enumerate(ds.alignment.values()):
109+ gap_count[j] = np.sum(a == GAP)
110+ np.mean(gap_count)
111+ ```
112+
113+ :::{warning}
114+ The arrays returned by the `` alignment `` interface are ** zero based** and you
115+ must compensate to use ** one-based** coordinates.
116+ :::
117+
118+ If you want to access
119+ specific slices of the array based on ** one-based** coordinates, it's important
120+ to take the zero-based nature of this into account. Suppose we wanted to
121+ access the first 10 bases of Spike for a given sample. The first
122+ base of Spike is 21563 in standard one-based coordinates. While we could do
123+ some arithmetic to compensate, the simplest way to translate is to simply
124+ prepend some value to the alignment array:
125+
126+ ``` {code-cell}
127+ a = np.append([-1], ds.alignment["SRR11597146"])
128+ spike_start = 21_563
129+ a[spike_start: spike_start + 10]
130+ ```
131+
132+ (sec_alignments_analysis_data_encoding)=
133+
134+ ## Alignment data encoding
135+
136+ A key element of processing data efficiently in [ tskit] ( https://tskit.dev ) and VCF
137+ Zarr is to use numpy
138+ arrays of integers to represent allelic states, instead of the classical
139+ approach of using strings. In sc2ts, alleles are given fixed integer
140+ representations, such that A=0, C=1, G=2, and T=3. So, to represent the DNA
141+ string "AACTG" we would use the numpy array [ 0, 0, 1, 3, 2] instead. This has
142+ many advantages and makes it much easier to write efficient code.
143+
144+ The drawback of this is that it's not as easy to inspect and debug, and we must
145+ always be aware of the translation required.
146+
147+ Sc2ts provides some utilities for doing this. The easiest way to get the string
148+ values is to use {func}` decode_alleles ` function:
149+
150+ ``` {code-cell}
151+ a = sc2ts.decode_alleles(ds.alignment["SRR11597146"])
152+ a
153+ ```
154+ This is a numpy string array, which can still be processed quite efficiently.
155+ However, it is best to stay in native integer encoding where possible, as it
156+ is much more efficient.
157+
158+
159+ Sc2ts uses the [ IUPAC] ( https://www.bioinformatics.org/sms/iupac.html )
160+ uncertainty codes to encode ambiguous bases, and the {attr}` sc2ts.IUPAC_ALLELES `
161+ variable stores the mapping from these values to their integer indexes.
162+
163+ ``` {code-cell}
164+ sc2ts.IUPAC_ALLELES
165+ ```
166+
167+ Thus, "A" corresponds to 0, "-" to 4 and so on.
168+
169+
170+ ### Missing data
171+
172+ Missing data is an important element of the data model. Usually, missing data is
173+ encoded as an "N" character in the alignments. Howevever, there is no "N"
174+ in the `` IUPAC_ALLELES `` list above. This is because missing data is handled specially
175+ in VCF Zarr by mapping to the reserved `` -1 `` value. Missing data can therefore be flagged
176+ easily and handled correctly by downstream utilities.
177+
178+ :::{warning}
179+ It is important to take this into account when translating the integer encoded data into
180+ strings, because -1 is interpreted as the last element of the list in Python. Please
181+ use the {func}` decode_alleles ` function to avoid this tripwire.
182+ :::
183+
184+
185+ ## Accessing by variant
186+
187+ A unique feature of the VCF Zarr encoding used here is that we can efficiently access
188+ the alignment data by sample ** and** by site. The best way to access data by site
189+ is to use the {meth}` Dataset.variants ` method.
60190
191+ :::{note}
192+ The {meth}` Dataset.variants ` method is deliberately designed to mirror the API
193+ of the corresponding [ tskit] ( https://tskit.dev ) function
194+ ({meth}` tskit.TreeSequence.variants ` ).
195+ :::
61196
0 commit comments