Skip to content

Commit 25cf173

Browse files
Merge pull request #582 from tskit-dev/more-docs-3
More docs 3
2 parents c56546c + 606d278 commit 25cf173

File tree

6 files changed

+113
-37
lines changed

6 files changed

+113
-37
lines changed

docs/alignments_analysis.md

Lines changed: 70 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -189,8 +189,76 @@ the alignment data by sample **and** by site. The best way to access data by sit
189189
is to use the {meth}`Dataset.variants` method.
190190

191191
:::{note}
192-
The {meth}`Dataset.variants` method is deliberately designed to mirror the API
192+
The {meth}`Dataset.variants` method is deliberately designed to follow the semantics
193193
of the corresponding [tskit](https://tskit.dev) function
194-
({meth}`tskit.TreeSequence.variants`).
194+
({meth}`tskit.TreeSequence.variants`), enabling straightforward joint analysis of the
195+
ARG and alignments.
195196
:::
196197

198+
Here we use this interface to count the number of samples that carry the gap characters
199+
at each site along the genome:
200+
201+
```{code-cell}
202+
GAP = sc2ts.IUPAC_ALLELES.index("-")
203+
204+
gap_count = np.zeros(ds.num_variants)
205+
for j, var in enumerate(ds.variants()):
206+
gap_count[j] = np.sum(var.genotypes == GAP)
207+
208+
gap_count
209+
```
210+
211+
Here, we can see that all 1000 samples in our small subset have flanking deletions called.
212+
213+
We can use the ``position`` argument to supply a list of (**one-based**) site positions
214+
of interest:
215+
216+
```{code-cell}
217+
spike_pos = np.arange(21_563, 25_385)
218+
gap_count = np.zeros_like(spike_pos)
219+
for j, var in enumerate(ds.variants(position=spike_pos)):
220+
gap_count[j] = np.sum(var.genotypes == GAP)
221+
222+
gap_count
223+
```
224+
225+
We can also use the ``sample_id`` argument to specify subsets of samples.
226+
227+
## Bulk metadata analysis
228+
229+
Accessing the metadata row-by-row using the ``.metadata`` mapping above is
230+
inefficient when we want to look at large numbers of samples. In this case,
231+
it is much more convenient to export the metadata to a Pandas dataframe
232+
using the {meth}`Dataset.metadata_dataframe` and then work with this.
233+
234+
```{code-cell}
235+
df = ds.metadata_dataframe()
236+
df
237+
```
238+
239+
Then, suppose we want to find all samples from the USA:
240+
241+
```{code-cell}
242+
usa_samples = df[df["Country"] == "USA"].index
243+
usa_samples
244+
```
245+
246+
:::{important}
247+
For performance reasons it's a good idea to use the ``fields`` parameter to
248+
{meth}`Dataset.metadata_dataframe` to limit the amount of metadata decoded
249+
to what you actually need.
250+
:::
251+
252+
253+
## Getting FASTA output
254+
255+
Getting FASTA output is straightforward using the {meth}`Dataset.write_fasta`
256+
method. Here, we use the ``sample_id`` argument to write the FASTA aligments
257+
of the USA samples found in the last example:
258+
259+
```{code-cell}
260+
with open("/tmp/usa.fa", "w") as f:
261+
ds.write_fasta(f, sample_id=usa_samples)
262+
```
263+
264+

docs/api.md

Lines changed: 5 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,6 @@ Inference is driven via the command line interface (see the
88
listed here are intended for working with tree sequences and datasets
99
that have already been generated.
1010

11-
The reference documentation is concise and exhaustive; for higher level
12-
discussion and worked examples, see the project README and example
13-
notebooks.
1411

1512
```{eval-rst}
1613
.. currentmodule:: sc2ts
@@ -30,7 +27,7 @@ notebooks.
3027
.. autofunction:: mutation_data
3128
```
3229

33-
## Dataset access
30+
## Alignment and metadata analysis
3431

3532
```{eval-rst}
3633
.. autosummary::
@@ -44,6 +41,9 @@ notebooks.
4441
.. autoclass:: Dataset
4542
:members:
4643
44+
.. autoclass:: Variant
45+
:members:
46+
4747
.. autofunction:: decode_alleles
4848
4949
.. autofunction:: mask_ambiguous
@@ -55,25 +55,12 @@ notebooks.
5555

5656
```{eval-rst}
5757
.. autosummary::
58-
REFERENCE_STRAIN
59-
REFERENCE_DATE
60-
REFERENCE_GENBANK
61-
REFERENCE_SEQUENCE_LENGTH
62-
IUPAC_ALLELES
6358
decode_flags
6459
flags_summary
6560
```
6661

6762
```{eval-rst}
68-
.. autodata:: REFERENCE_STRAIN
69-
70-
.. autodata:: REFERENCE_DATE
71-
72-
.. autodata:: REFERENCE_GENBANK
73-
74-
.. autodata:: REFERENCE_SEQUENCE_LENGTH
75-
76-
.. autodata:: IUPAC_ALLELES
63+
.. data:: IUPAC_ALLELES
7764
7865
.. autofunction:: decode_flags
7966

sc2ts/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@
22

33
# star imports are fine here as it's just a bunch of constants
44
from .core import *
5-
from .dataset import mask_ambiguous, mask_flanking_deletions, decode_alleles, Dataset
5+
from .dataset import mask_ambiguous, mask_flanking_deletions, decode_alleles, Dataset, Variant
66
from .stats import node_data, mutation_data

sc2ts/core.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,11 @@
2222
# NOTE!! This string is also used in the jit module where it's
2323
# hard-coded into a numba function, so if this ever changes
2424
# it needs to be updated there also!
25+
2526
IUPAC_ALLELES = "ACGT-RYSWKMBDHV."
27+
"""
28+
The allele-integer encoding used by sc2ts.
29+
"""
2630

2731
NODE_IS_MUTATION_OVERLAP = 1 << 21
2832
NODE_IS_REVERSION_PUSH = 1 << 22

sc2ts/dataset.py

Lines changed: 25 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -232,29 +232,27 @@ def field_descriptors(self):
232232

233233
@dataclasses.dataclass
234234
class Variant:
235+
"""
236+
Represents a single variant, including the genomic position and the integer encoded
237+
genotypes.
238+
"""
235239
position: int
236240
genotypes: np.ndarray
237241
alleles: list
238242

239243

240244
class Dataset(collections.abc.Mapping):
245+
"""
246+
Open a sc2ts VCF Zarr dataset for convenient access to alignments and metadata.
241247
248+
The dataset is opened read-only from ``path``, which may be either a
249+
directory store or a consolidated ``.zip`` file.
250+
251+
:param str path: Path to a directory or ``.zip`` Zarr store.
252+
:param int chunk_cache_size: Maximum number of chunks to cache for
253+
alignments and metadata. Defaults to 1.
254+
"""
242255
def __init__(self, path, chunk_cache_size=1, date_field=None):
243-
"""
244-
Open a sc2ts VCF Zarr dataset for convenient access to alignments and metadata.
245-
246-
The dataset is opened read-only from ``path``, which may be either a
247-
directory store or a consolidated ``.zip`` file. The ``date_field``
248-
argument specifies which metadata field should be interpreted as the
249-
sample date when constructing :attr:`metadata`.
250-
251-
:param str path: Path to a directory or ``.zip`` Zarr store.
252-
:param int chunk_cache_size: Maximum number of chunks to cache for
253-
alignments and metadata. Defaults to 1.
254-
:param str date_field: Name of the metadata field to use as the
255-
sample date, or ``None`` to disable date handling. Defaults
256-
to ``None``.
257-
"""
258256
logger.info(f"Loading dataset @{path} using {date_field} as date field")
259257
self.date_field = date_field
260258
self.path = pathlib.Path(path)
@@ -310,6 +308,17 @@ def metadata(self):
310308
"""
311309
return self._metadata
312310

311+
def metadata_dataframe(self, fields=None):
312+
"""
313+
Returns the metadata in this dataset as a Pandas dataframe,
314+
indexed by sample_id.
315+
316+
:param fields: List of metadata fields to include; if ``None``,
317+
all fields are used.
318+
:return: Pandas dataframe
319+
"""
320+
return self.metadata.as_dataframe(fields)
321+
313322
@property
314323
def sample_id(self):
315324
"""
@@ -356,7 +365,7 @@ def variants(self, sample_id=None, position=None):
356365
"""
357366
Iterate over variants at the specified positions for the given samples.
358367
359-
Yields Variant objects containing the genomic position, encoded
368+
Yields :class:`.Variant` objects containing the genomic position, encoded
360369
genotypes, and allele labels for each requested site.
361370
362371
:param sample_id: Iterable of sample IDs to include; if ``None``,

tests/test_dataset.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -507,6 +507,14 @@ def test_samples_for_date(self, fx_dataset):
507507
samples = fx_dataset.metadata.samples_for_date("2020-01-19")
508508
assert samples == ["SRR11772659"]
509509

510+
def test_metadata_dataframe(self, fx_dataset):
511+
df1 = fx_dataset.metadata.as_dataframe()
512+
df2 = fx_dataset.metadata_dataframe()
513+
assert df1.shape[0] == df2.shape[0]
514+
for col, data1 in df2.items():
515+
data2 = df2[col]
516+
nt.assert_array_equal(data1.to_numpy(), data2.to_numpy())
517+
510518
def test_as_dataframe(self, fx_dataset, fx_metadata_df):
511519
df1 = fx_dataset.metadata.as_dataframe()
512520
df2 = fx_metadata_df.loc[df1.index]

0 commit comments

Comments
 (0)