Skip to content

Commit 8ff1582

Browse files
Port the inference docs
1 parent e6d1243 commit 8ff1582

File tree

3 files changed

+159
-169
lines changed

3 files changed

+159
-169
lines changed

README.md

Lines changed: 0 additions & 169 deletions
Original file line numberDiff line numberDiff line change
@@ -76,176 +76,7 @@ for more details on and benchmarks on this dataset.
7676

7777
**TODO** Add some references to API documentation
7878

79-
## Inference
8079

81-
### Command line inference
82-
83-
Inference is intended to be run from the command-line primarily,
84-
and most likely orchestrated via a shell script or Snakemake file, etc.
85-
86-
The CLI is split into subcommands. Get help by running the CLI without arguments:
87-
88-
```
89-
python3 -m sc2ts
90-
```
91-
92-
**TODO document the process of getting a Zarr dataset and using it**
93-
94-
95-
## Inference
96-
97-
Here we'll run through a quick example of how to get inference running
98-
on a local machine using an example config file, using the Viridian data downloaded
99-
from Zenodo.
100-
101-
### Prerequisites
102-
103-
First, install the "inference" version of sc2ts from pypi:
104-
105-
```
106-
python -m pip install sc2ts[inference]
107-
```
108-
109-
**This is essential! The base install of sc2ts contains the minimal
110-
dependencies required to access the analysis utilities outlined above.**
111-
112-
Then, download the (401MB) Viridian dataset in
113-
[VCF Zarr format](https://doi.org/10.1093/gigascience/giaf049) from
114-
[Zenodo](https://zenodo.org/records/16314739):
115-
116-
```
117-
curl -O https://zenodo.org/records/16314739/files/viridian_mafft_2024-10-14_v1.vcz.zip
118-
```
119-
### CLI
120-
121-
Inference is performed using the CLI, which is composed of number of subcommands.
122-
See the online help for more information:
123-
124-
```
125-
python -m sc2ts --help
126-
```
127-
128-
### Primary inference
129-
130-
Primary inference is performed using the ``infer`` subcommand of the CLI,
131-
and all parameters are specified using a toml file.
132-
133-
The [example config file](example_config.toml) can be used to perform
134-
inference over a short period, to demonstrate how sc2ts works:
135-
136-
```
137-
python3 -m sc2ts infer example_config.toml --stop=2020-02-02
138-
```
139-
140-
Once this finishes (it should take a few minutes and requires ~5GB RAM), the results of the
141-
inference will be in the ``example_inference`` directory (as specified in the
142-
config file) and look something like this:
143-
144-
```
145-
$ tree example_inference
146-
example_inference
147-
├── ex1
148-
│   ├── ex1_2020-01-01.ts
149-
│   ├── ex1_2020-01-10.ts
150-
│   ├── ex1_2020-01-12.ts
151-
│   ├── ex1_2020-01-19.ts
152-
│   ├── ex1_2020-01-24.ts
153-
│   ├── ex1_2020-01-25.ts
154-
│   ├── ex1_2020-01-28.ts
155-
│   ├── ex1_2020-01-29.ts
156-
│   ├── ex1_2020-01-30.ts
157-
│   ├── ex1_2020-01-31.ts
158-
│   ├── ex1_2020-02-01.ts
159-
│   └── ex1_init.ts
160-
├── ex1.log
161-
└── ex1.matches.db
162-
```
163-
164-
Here we've run inference for all dates in January 2020 for which we have data, plus the 1st Feb.
165-
The results of inference for each day are stored in the
166-
``example_inference/ex1`` directory as tskit files representing the ARG
167-
inferred up to that day. There is a lot of redundancy in keeping all these
168-
daily files lying around, but it is useful to be able to go back to the
169-
state of the ARG at a particular date and they don't take up much space.
170-
171-
The file ``ex1.log`` contains the log file. The config file set the log-level
172-
to 2, which is full debug output. There is a lot of useful information in there,
173-
and it can be very helpful when debugging, so we recommend keeping the logs.
174-
175-
The ``ex1.matches.db`` is the "match DB" which stores information about the
176-
HMM match for each sample. This is mainly used to store exact matches
177-
found during inference.
178-
179-
The ARGs output during primary inference (this step here) have a lot of
180-
debugging metadata included (see the section on the Debug utilities below)
181-
182-
Primary inference can be stopped and picked up again at any point using
183-
the ``--start`` option.
184-
185-
186-
### Postprocessing
187-
188-
Once we've finished primary inference we can run postprocessing to perform
189-
a few housekeeping tasks. Continuing the example above:
190-
191-
```
192-
$ python3 -m sc2ts postprocess -vv \
193-
--match-db example_inference/ex1.matches.db \
194-
example_inference/ex1/ex1_2020-02-01.ts \
195-
example_inference/ex1_2020-02-01_pp.ts
196-
```
197-
198-
Among other things, this incorporates the exact matches in the match DB
199-
into the final ARG.
200-
201-
### Generating final analysis file
202-
203-
To generate the final analysis-ready file (used as input to the analysis
204-
APIs above) we need to run ``minimise-metadata``. This removes all but
205-
the most necessary metadata from the ARG, and recodes node metadata
206-
using the [struct codec](https://tskit.dev/tskit/docs/stable/metadata.html#structured-array-metadata)
207-
for efficiency. On our example above:
208-
209-
```
210-
$ python -m sc2ts minimise-metadata \
211-
-m strain sample_id \
212-
-m Viridian_pangolin pango \
213-
example_inference/ex1_2020-02-01_pp.ts \
214-
example_inference/ex1_2020-02-01_pp_mm.ts
215-
```
216-
217-
This recodes the metadata in the input tree sequence such that
218-
the existing ``strain`` field is renamed to ``sample_id``
219-
(for compatibility with VCF Zarr) and the ``Viridian_pangolin``
220-
field (extracted from the Viridian metadata) is renamed to ``pango``.
221-
222-
We can then use the analysis APIs on this file:
223-
224-
```python
225-
import sc2ts
226-
import tskit
227-
228-
ts = tskit.load("example_inference/ex1_2020-02-01_pp_mm.ts")
229-
dfn = sc2ts.node_data(ts)
230-
print(dfn)
231-
```
232-
233-
giving something like:
234-
235-
```
236-
pango sample_id node_id is_sample is_recombinant num_mutations date
237-
0 Vestigial_ignore 0 False False 0 2019-12-25
238-
1 Wuhan/Hu-1/2019 1 False False 0 2019-12-26
239-
2 A SRR11772659 2 True False 1 2020-01-19
240-
3 B SRR11397727 3 True False 0 2020-01-24
241-
4 B SRR11397730 4 True False 0 2020-01-24
242-
.. ... ... ... ... ... ... ...
243-
60 A SRR11597177 60 True False 0 2020-01-30
244-
61 A SRR11597197 61 True False 0 2020-01-30
245-
62 B SRR11597144 62 True False 0 2020-02-01
246-
63 B SRR11597148 63 True False 0 2020-02-01
247-
64 B SRR25229386 64 True False 0 2020-02-01
248-
```
24980

25081
## Development
25182

docs/inference.md

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,162 @@
22

33
# Inference
44

5+
6+
Here we'll run through a quick example of how to get inference running
7+
on a local machine using an example config file, using the Viridian data downloaded
8+
from Zenodo.
9+
10+
Inference is performed using the CLI, which is composed of number of subcommands.
11+
See {ref}`sc2ts_sec_cli` section for more information
12+
13+
## Prerequisites
14+
15+
First, install the "inference" version of sc2ts from pypi:
16+
17+
```
18+
python -m pip install sc2ts[inference]
19+
```
20+
21+
**This is essential! The base install of sc2ts contains the minimal
22+
dependencies required to access the analysis utilities outlined above.**
23+
24+
Then, download the (401MB) Viridian dataset in
25+
[VCF Zarr format](https://doi.org/10.1093/gigascience/giaf049) from
26+
[Zenodo](https://zenodo.org/records/16314739):
27+
28+
```
29+
curl -O https://zenodo.org/records/16314739/files/viridian_mafft_2024-10-14_v1.vcz.zip
30+
```
31+
32+
Also, download the example configuration file:
33+
34+
```
35+
curl -O https://raw.githubusercontent.com/tskit-dev/sc2ts/refs/heads/main/docs/example_config.toml
36+
```
37+
38+
39+
## Primary inference
40+
41+
Primary inference is performed using the ``infer`` subcommand of the CLI,
42+
and all parameters are specified using a toml file.
43+
44+
The [example config file](example_config.toml) can be used to perform
45+
inference over a short period, to demonstrate how sc2ts works:
46+
47+
48+
```
49+
python3 -m sc2ts infer example_config.toml --stop=2020-02-02
50+
```
51+
52+
Once this finishes (it should take a few minutes and requires ~5GB RAM), the results of the
53+
inference will be in the ``example_inference`` directory (as specified in the
54+
config file) and look something like this:
55+
56+
```
57+
$ tree example_inference
58+
example_inference
59+
├── ex1
60+
│   ├── ex1_2020-01-01.ts
61+
│   ├── ex1_2020-01-10.ts
62+
│   ├── ex1_2020-01-12.ts
63+
│   ├── ex1_2020-01-19.ts
64+
│   ├── ex1_2020-01-24.ts
65+
│   ├── ex1_2020-01-25.ts
66+
│   ├── ex1_2020-01-28.ts
67+
│   ├── ex1_2020-01-29.ts
68+
│   ├── ex1_2020-01-30.ts
69+
│   ├── ex1_2020-01-31.ts
70+
│   ├── ex1_2020-02-01.ts
71+
│   └── ex1_init.ts
72+
├── ex1.log
73+
└── ex1.matches.db
74+
```
75+
76+
Here we've run inference for all dates in January 2020 for which we have data, plus the 1st Feb.
77+
The results of inference for each day are stored in the
78+
``example_inference/ex1`` directory as tskit files representing the ARG
79+
inferred up to that day. There is a lot of redundancy in keeping all these
80+
daily files lying around, but it is useful to be able to go back to the
81+
state of the ARG at a particular date and they don't take up much space.
82+
83+
The file ``ex1.log`` contains the log file. The config file set the log-level
84+
to 2, which is full debug output. There is a lot of useful information in there,
85+
and it can be very helpful when debugging, so we recommend keeping the logs.
86+
87+
The ``ex1.matches.db`` is the "match DB" which stores information about the
88+
HMM match for each sample. This is mainly used to store exact matches
89+
found during inference.
90+
91+
The ARGs output during primary inference (this step here) have a lot of
92+
debugging metadata included (see the section on the Debug utilities below)
93+
94+
Primary inference can be stopped and picked up again at any point using
95+
the ``--start`` option.
96+
97+
:::{todo}
98+
Add documentation for the toml config file
99+
:::
100+
101+
## Postprocessing
102+
103+
Once we've finished primary inference we can run postprocessing to perform
104+
a few housekeeping tasks. Continuing the example above:
105+
106+
```
107+
$ python3 -m sc2ts postprocess -vv \
108+
--match-db example_inference/ex1.matches.db \
109+
example_inference/ex1/ex1_2020-02-01.ts \
110+
example_inference/ex1_2020-02-01_pp.ts
111+
```
112+
113+
Among other things, this incorporates the exact matches in the match DB
114+
into the final ARG.
115+
116+
## Generating final analysis file
117+
118+
To generate the final analysis-ready file (used as input to the analysis
119+
APIs above) we need to run ``minimise-metadata``. This removes all but
120+
the most necessary metadata from the ARG, and recodes node metadata
121+
using the [struct codec](https://tskit.dev/tskit/docs/stable/metadata.html#structured-array-metadata)
122+
for efficiency. On our example above:
123+
124+
```
125+
$ python -m sc2ts minimise-metadata \
126+
-m strain sample_id \
127+
-m Viridian_pangolin pango \
128+
example_inference/ex1_2020-02-01_pp.ts \
129+
example_inference/ex1_2020-02-01_pp_mm.ts
130+
```
131+
132+
This recodes the metadata in the input tree sequence such that
133+
the existing ``strain`` field is renamed to ``sample_id``
134+
(for compatibility with VCF Zarr) and the ``Viridian_pangolin``
135+
field (extracted from the Viridian metadata) is renamed to ``pango``.
136+
137+
We can then use the analysis APIs on this file:
138+
139+
```python
140+
import sc2ts
141+
import tskit
142+
143+
ts = tskit.load("example_inference/ex1_2020-02-01_pp_mm.ts")
144+
dfn = sc2ts.node_data(ts)
145+
print(dfn)
146+
```
147+
148+
giving something like:
149+
150+
```
151+
pango sample_id node_id is_sample is_recombinant num_mutations date
152+
0 Vestigial_ignore 0 False False 0 2019-12-25
153+
1 Wuhan/Hu-1/2019 1 False False 0 2019-12-26
154+
2 A SRR11772659 2 True False 1 2020-01-19
155+
3 B SRR11397727 3 True False 0 2020-01-24
156+
4 B SRR11397730 4 True False 0 2020-01-24
157+
.. ... ... ... ... ... ... ...
158+
60 A SRR11597177 60 True False 0 2020-01-30
159+
61 A SRR11597197 61 True False 0 2020-01-30
160+
62 B SRR11597144 62 True False 0 2020-02-01
161+
63 B SRR11597148 63 True False 0 2020-02-01
162+
64 B SRR25229386 64 True False 0 2020-02-01
163+
```

0 commit comments

Comments
 (0)