Skip to content

Commit 06a36a7

Browse files
Merge pull request #577 from tskit-dev/add-dataset-docs-1
Add subset dataset and start on documenting dataset
2 parents b73627a + 72a2419 commit 06a36a7

File tree

6 files changed

+109
-17
lines changed

6 files changed

+109
-17
lines changed

docs/alignments_analysis.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,61 @@
1+
---
2+
jupytext:
3+
text_representation:
4+
extension: .md
5+
format_name: myst
6+
format_version: 0.12
7+
jupytext_version: 1.9.1
8+
kernelspec:
9+
display_name: Python 3
10+
language: python
11+
name: python3
12+
---
13+
14+
```{eval-rst}
15+
.. currentmodule:: sc2ts
16+
```
17+
18+
119
(sec_alignments_analysis)=
20+
221
# Alignments analysis
22+
23+
## Prerequisites
24+
25+
Download the first 1000 samples of the Viridian dataset (450K):
26+
27+
```
28+
curl -O https://raw.githubusercontent.com/tskit-dev/sc2ts/refs/heads/main/docs/viridian_mafft_subset_1000_v1.vcz.zip
29+
```
30+
31+
We'll use this small subset as an example throughout.
32+
33+
34+
## Loading and getting information
35+
36+
To load up a dataset, we use the {class}`Dataset` constructor:
37+
38+
39+
```{code-cell}
40+
import sc2ts
41+
42+
ds = sc2ts.Dataset("viridian_mafft_subset_1000_v1.vcz.zip")
43+
ds
44+
```
45+
46+
When we return the dataset object from a notebook cell (as here) it prints
47+
out a summary of the contents. Here, we're working with a small subset
48+
of the Viridian dataset consisting of the first 1000 samples at the 29903
49+
sites in the SARS-CoV-2 genome.
50+
51+
The basic information is also available in the {attr}`Dataset.num_samples`
52+
and {attr}`Dataset.num_variants` attributes.
53+
54+
To get information on the metadata fields that are present, we can use
55+
56+
57+
```{code-cell}
58+
ds.metadata.field_descriptors()
59+
```
60+
61+

docs/build.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
REPORTDIR=_build/html/reports
88

9-
jupyter-book build .
9+
jupyter-book build -W .
1010
RETVAL=$?
1111
if [ $RETVAL -ne 0 ]; then
1212
if [ -e $REPORTDIR ]; then
Lines changed: 29 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,22 @@
1-
.. _sc2ts_sec_cli:
21

3-
Command line interface
4-
======================
2+
(sc2ts_sec_cli)=
3+
4+
# Command line interface
55

66
The ``sc2ts`` package provides a command line interface for running
77
inference and working with sc2ts datasets. After installation, the
8-
``sc2ts`` entry point should be available::
8+
``sc2ts`` entry point should be available
99

10-
$ sc2ts --help
10+
```
11+
$ sc2ts --help
12+
```
1113

1214
You can also invoke the CLI via the module::
15+
```
16+
$ python -m sc2ts --help
17+
```
1318

14-
$ python -m sc2ts --help
15-
16-
Order of high-level commands
17-
----------------------------
19+
## Order of high-level commands
1820

1921
In a typical end-to-end workflow, the main subcommands are used in the
2022
following order:
@@ -28,11 +30,22 @@ following order:
2830
4. ``minimise-metadata`` to generate an analysis-ready ARG with compact
2931
metadata suitable for use with the Python analysis APIs.
3032

31-
Below we list all subcommands and options provided by the CLI. This
32-
output is generated directly from the Click definitions in
33-
``sc2ts.cli`` using the ``sphinx-click`` extension, and so stays in
34-
sync with the implementation.
3533

36-
.. click:: sc2ts.cli:cli
37-
:prog: sc2ts
38-
:nested: full
34+
## CLI reference
35+
36+
<!-- Below we list all subcommands and options provided by the CLI. This -->
37+
<!-- output is generated directly from the Click definitions in -->
38+
<!-- ``sc2ts.cli`` using the ``sphinx-click`` extension, and so stays in -->
39+
<!-- sync with the implementation. -->
40+
41+
:::{todo}
42+
Add the sphinx-click output here somehow.
43+
:::
44+
45+
<!-- ```{eval-rst} -->
46+
<!-- .. click:: sc2ts.cli:cli -->
47+
<!-- :prog: sc2ts infer -->
48+
<!-- :nested: full -->
49+
<!-- ``` -->
50+
51+

docs/make_viridian_subset.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
import sc2ts
2+
3+
ds = sc2ts.Dataset("../viridian_mafft_2024-10-14_v1.vcz.zip")
4+
print(ds)
5+
6+
samples = ds["sample_id"][:]
7+
k = 1000
8+
samples = samples[:k]
9+
path = f"viridian_mafft_subset_{k}_v1.vcz"
10+
ds.copy(path, sample_id=samples)
11+
sc2ts.Dataset.create_zip(path, path + ".zip")
12+
449 KB
Binary file not shown.

sc2ts/dataset.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -294,10 +294,18 @@ def variants_chunk_size(self):
294294

295295
@property
296296
def num_samples(self):
297+
"""
298+
Return the number of samples in this dataset.
299+
"""
297300
return self.root.call_genotype.shape[1]
298301

299302
@property
300303
def num_variants(self):
304+
"""
305+
Return the number of variants in this dataset. Note that this does not mean that
306+
there's necessarily variation at each site; the terminology is borrowed from
307+
VCF Zarr.
308+
"""
301309
return self.root.call_genotype.shape[0]
302310

303311
def __str__(self):

0 commit comments

Comments
 (0)