Skip to content

Commit 23c6aa3

Browse files
Merge pull request #573 from tskit-dev/more-docs
More docs
2 parents db95518 + f55b3c3 commit 23c6aa3

File tree

14 files changed

+334
-424
lines changed

14 files changed

+334
-424
lines changed

README.md

Lines changed: 4 additions & 270 deletions
Original file line numberDiff line numberDiff line change
@@ -11,274 +11,8 @@ access.
1111
convenient and efficient access to the full Viridian dataset (alignments and metadata)
1212
in a single file using the [VCF Zarr specification](https://doi.org/10.1093/gigascience/giaf049).
1313

14-
Please see the [preprint](https://www.biorxiv.org/content/10.1101/2023.06.08.544212v2)
15-
for details.
16-
17-
## Installation
18-
19-
Install sc2ts from PyPI:
20-
21-
```
22-
python -m pip install sc2ts
23-
```
24-
25-
This installs the minimum requirement to enable the
26-
[ARG analysis](#ARG-analysis-API) and [Dataset](#Dataset-API)s.
27-
To run [inference](#inference), you must install some extra
28-
dependencies using the 'inference' optional extra:
29-
30-
```
31-
python -m pip install sc2ts[inference]
32-
```
33-
34-
## ARG analysis API
35-
36-
The sc2ts API provides two convenience functions to compute summary
37-
dataframes for the nodes and mutations in a sc2ts-output ARG.
38-
39-
To see some examples, first download the (31MB) sc2ts inferred ARG
40-
from [Zenodo](https://zenodo.org/records/17558489/):
41-
42-
```
43-
curl -O https://zenodo.org/records/17558489/files/sc2ts_viridian_v1.2.trees.tsz
44-
```
45-
46-
We can then use these like
47-
48-
```python
49-
import sc2ts
50-
import tszip
51-
52-
ts = tszip.load("sc2ts_viridian_v1.2.trees.tsz")
53-
54-
df_node = sc2ts.node_data(ts)
55-
df_mutation = sc2ts.mutation_data(ts)
56-
```
57-
58-
See the [live demo](https://tskit.dev/explore/lab/index.html?path=sc2ts.ipynb)
59-
for a browser based interactive demo of using these dataframes for
60-
real-time pandemic-scale analysis.
61-
62-
## Dataset API
63-
64-
Sc2ts also provides a convenient API for accessing large-scale
65-
alignments and metadata stored in
66-
[VCF Zarr](https://doi.org/10.1093/gigascience/giaf049) format.
67-
68-
Resources:
69-
70-
- See this [notebook](https://github.com/tskit-dev/sc2ts-paper/blob/main/notebooks/example_data_processing.ipynb)
71-
for an example in which we access the data variant-by-variant and
72-
which explains the low-level data encoding
73-
- See the [VCF Zarr publication](https://doi.org/10.1093/gigascience/giaf049)
74-
for more details on and benchmarks on this dataset.
75-
76-
77-
**TODO** Add some references to API documentation
78-
79-
## Inference
80-
81-
### Command line inference
82-
83-
Inference is intended to be run from the command-line primarily,
84-
and most likely orchestrated via a shell script or Snakemake file, etc.
85-
86-
The CLI is split into subcommands. Get help by running the CLI without arguments:
87-
88-
```
89-
python3 -m sc2ts
90-
```
91-
92-
**TODO document the process of getting a Zarr dataset and using it**
93-
94-
95-
## Inference
96-
97-
Here we'll run through a quick example of how to get inference running
98-
on a local machine using an example config file, using the Viridian data downloaded
99-
from Zenodo.
100-
101-
### Prerequisites
102-
103-
First, install the "inference" version of sc2ts from pypi:
104-
105-
```
106-
python -m pip install sc2ts[inference]
107-
```
108-
109-
**This is essential! The base install of sc2ts contains the minimal
110-
dependencies required to access the analysis utilities outlined above.**
111-
112-
Then, download the (401MB) Viridian dataset in
113-
[VCF Zarr format](https://doi.org/10.1093/gigascience/giaf049) from
114-
[Zenodo](https://zenodo.org/records/16314739):
115-
116-
```
117-
curl -O https://zenodo.org/records/16314739/files/viridian_mafft_2024-10-14_v1.vcz.zip
118-
```
119-
### CLI
120-
121-
Inference is performed using the CLI, which is composed of number of subcommands.
122-
See the online help for more information:
123-
124-
```
125-
python -m sc2ts --help
126-
```
127-
128-
### Primary inference
129-
130-
Primary inference is performed using the ``infer`` subcommand of the CLI,
131-
and all parameters are specified using a toml file.
132-
133-
The [example config file](example_config.toml) can be used to perform
134-
inference over a short period, to demonstrate how sc2ts works:
135-
136-
```
137-
python3 -m sc2ts infer example_config.toml --stop=2020-02-02
138-
```
139-
140-
Once this finishes (it should take a few minutes and requires ~5GB RAM), the results of the
141-
inference will be in the ``example_inference`` directory (as specified in the
142-
config file) and look something like this:
143-
144-
```
145-
$ tree example_inference
146-
example_inference
147-
├── ex1
148-
│   ├── ex1_2020-01-01.ts
149-
│   ├── ex1_2020-01-10.ts
150-
│   ├── ex1_2020-01-12.ts
151-
│   ├── ex1_2020-01-19.ts
152-
│   ├── ex1_2020-01-24.ts
153-
│   ├── ex1_2020-01-25.ts
154-
│   ├── ex1_2020-01-28.ts
155-
│   ├── ex1_2020-01-29.ts
156-
│   ├── ex1_2020-01-30.ts
157-
│   ├── ex1_2020-01-31.ts
158-
│   ├── ex1_2020-02-01.ts
159-
│   └── ex1_init.ts
160-
├── ex1.log
161-
└── ex1.matches.db
162-
```
163-
164-
Here we've run inference for all dates in January 2020 for which we have data, plus the 1st Feb.
165-
The results of inference for each day are stored in the
166-
``example_inference/ex1`` directory as tskit files representing the ARG
167-
inferred up to that day. There is a lot of redundancy in keeping all these
168-
daily files lying around, but it is useful to be able to go back to the
169-
state of the ARG at a particular date and they don't take up much space.
170-
171-
The file ``ex1.log`` contains the log file. The config file set the log-level
172-
to 2, which is full debug output. There is a lot of useful information in there,
173-
and it can be very helpful when debugging, so we recommend keeping the logs.
174-
175-
The ``ex1.matches.db`` is the "match DB" which stores information about the
176-
HMM match for each sample. This is mainly used to store exact matches
177-
found during inference.
178-
179-
The ARGs output during primary inference (this step here) have a lot of
180-
debugging metadata included (see the section on the Debug utilities below)
181-
182-
Primary inference can be stopped and picked up again at any point using
183-
the ``--start`` option.
184-
185-
186-
### Postprocessing
187-
188-
Once we've finished primary inference we can run postprocessing to perform
189-
a few housekeeping tasks. Continuing the example above:
190-
191-
```
192-
$ python3 -m sc2ts postprocess -vv \
193-
--match-db example_inference/ex1.matches.db \
194-
example_inference/ex1/ex1_2020-02-01.ts \
195-
example_inference/ex1_2020-02-01_pp.ts
196-
```
197-
198-
Among other things, this incorporates the exact matches in the match DB
199-
into the final ARG.
200-
201-
### Generating final analysis file
202-
203-
To generate the final analysis-ready file (used as input to the analysis
204-
APIs above) we need to run ``minimise-metadata``. This removes all but
205-
the most necessary metadata from the ARG, and recodes node metadata
206-
using the [struct codec](https://tskit.dev/tskit/docs/stable/metadata.html#structured-array-metadata)
207-
for efficiency. On our example above:
208-
209-
```
210-
$ python -m sc2ts minimise-metadata \
211-
-m strain sample_id \
212-
-m Viridian_pangolin pango \
213-
example_inference/ex1_2020-02-01_pp.ts \
214-
example_inference/ex1_2020-02-01_pp_mm.ts
215-
```
216-
217-
This recodes the metadata in the input tree sequence such that
218-
the existing ``strain`` field is renamed to ``sample_id``
219-
(for compatibility with VCF Zarr) and the ``Viridian_pangolin``
220-
field (extracted from the Viridian metadata) is renamed to ``pango``.
221-
222-
We can then use the analysis APIs on this file:
223-
224-
```python
225-
import sc2ts
226-
import tskit
227-
228-
ts = tskit.load("example_inference/ex1_2020-02-01_pp_mm.ts")
229-
dfn = sc2ts.node_data(ts)
230-
print(dfn)
231-
```
232-
233-
giving something like:
234-
235-
```
236-
pango sample_id node_id is_sample is_recombinant num_mutations date
237-
0 Vestigial_ignore 0 False False 0 2019-12-25
238-
1 Wuhan/Hu-1/2019 1 False False 0 2019-12-26
239-
2 A SRR11772659 2 True False 1 2020-01-19
240-
3 B SRR11397727 3 True False 0 2020-01-24
241-
4 B SRR11397730 4 True False 0 2020-01-24
242-
.. ... ... ... ... ... ... ...
243-
60 A SRR11597177 60 True False 0 2020-01-30
244-
61 A SRR11597197 61 True False 0 2020-01-30
245-
62 B SRR11597144 62 True False 0 2020-02-01
246-
63 B SRR11597148 63 True False 0 2020-02-01
247-
64 B SRR25229386 64 True False 0 2020-02-01
248-
```
249-
250-
## Development
251-
252-
To run the unit tests, use
253-
254-
```
255-
python3 -m pytest
256-
```
257-
258-
You may need to regenerate some cached test fixtures occasionaly (particularly
259-
if getting cryptic errors when running the test suite). To do this, run
260-
261-
```
262-
rm -fR tests/data/cache/
263-
```
264-
265-
and rerun tests as above.
266-
267-
### Debug utilities
268-
269-
The tree sequence files output during primary inference have a lot
270-
of debugging metadata, and there are some developer tools for inspecting
271-
this in the ``sc2ts.debug`` package. In particular, the ``ArgInfo``
272-
class has a lot of useful utilities designed to be used in a Jupyter
273-
notebook. Note that ``matplotlib`` is required for these. Use it like:
274-
275-
```python
276-
import sc2ts.debug as sd
277-
import tskit
278-
279-
ts = tskit.load("path_to_daily_inference.ts")
280-
ai = sd.ArgInfo(ts)
281-
ai # view summary in notebook
282-
```
283-
14+
Please see the online [documentation](https://tskit.dev/sc2ts/docs) for details
15+
on the software
16+
and the [preprint](https://www.biorxiv.org/content/10.1101/2023.06.08.544212v2)
17+
for information on the method and the inferred ARG.
28418

docs/_config.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
title: sc2ts manual
55
author: sc2ts developers
66
logo: sc2ts.png
7-
copyright: "2024"
7+
copyright: "2025"
88
only_build_toc_files: true
99

1010
execute:

docs/_toc.yml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,18 @@
11
format: jb-book
22
root: intro
33
parts:
4+
- caption: Getting started
5+
chapters:
6+
- file: installation
7+
- caption: Usage
8+
chapters:
9+
- file: inference
10+
- file: arg_analysis
11+
- file: alignments_analysis
412
- caption: Interfaces
513
chapters:
614
- file: cli
715
- file: api
16+
- caption: Misc
17+
chapters:
18+
- file: development

docs/alignments_analysis.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
(sec_alignments_analysis)=
2+
# Alignments analysis

docs/api.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
(sec_python_api)=
2+
13
# Python API
24

35
This page documents the public Python API exposed by ``sc2ts``.

docs/arg_analysis.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
(sec_arg_analysis)=
2+
# ARG analysis
3+
4+
5+
## ARG analysis API
6+
7+
The sc2ts API provides two convenience functions to compute summary
8+
dataframes for the nodes and mutations in a sc2ts-output ARG.
9+
10+
To see some examples, first download the (31MB) sc2ts inferred ARG
11+
from [Zenodo](https://zenodo.org/records/17558489/):
12+
13+
```
14+
curl -O https://zenodo.org/records/17558489/files/sc2ts_viridian_v1.2.trees.tsz
15+
```
16+
17+
We can then use these like
18+
19+
```python
20+
import sc2ts
21+
import tszip
22+
23+
ts = tszip.load("sc2ts_viridian_v1.2.trees.tsz")
24+
25+
df_node = sc2ts.node_data(ts)
26+
df_mutation = sc2ts.mutation_data(ts)
27+
```
28+
29+
See the [live demo](https://tskit.dev/explore/lab/index.html?path=sc2ts.ipynb)
30+
for a browser based interactive demo of using these dataframes for
31+
real-time pandemic-scale analysis.
32+
33+
## Dataset API
34+
35+
Sc2ts also provides a convenient API for accessing large-scale
36+
alignments and metadata stored in
37+
[VCF Zarr](https://doi.org/10.1093/gigascience/giaf049) format.
38+
39+
Resources:
40+
41+
- See this [notebook](https://github.com/jeromekelleher/sc2ts-paper/blob/main/notebooks/example_data_processing.ipynb)
42+
for an example in which we access the data variant-by-variant and
43+
which explains the low-level data encoding
44+
- See the [VCF Zarr publication](https://doi.org/10.1093/gigascience/giaf049)
45+
for more details on and benchmarks on this dataset.
46+
47+
48+
**TODO** Add some references to API documentation
49+
50+
51+

0 commit comments

Comments
 (0)