Skip to content

Commit a71d1e3

Browse files
Merge pull request #565 from jeromekelleher/file-slimming
File slimming
2 parents 69c2eb1 + ab60c53 commit a71d1e3

File tree

5 files changed

+224
-66
lines changed

5 files changed

+224
-66
lines changed

README.md

Lines changed: 177 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,20 @@
11
# sc2ts
22
Infer a succinct tree sequence from SARS-COV-2 variation data
33

4-
**This is an early alpha version not intended for production use!!**
54

65
If you are interested in helping to develop sc2ts or would like to
76
work with the inferred ARGS, please get in touch.
87

8+
9+
Then, download the ARG in tszip format from
10+
[Zenodo](https://zenodo.org/records/17558489/):
11+
12+
```
13+
curl -O https://zenodo.org/records/17558489/files/sc2ts_viridian_v1.2.trees.tsz
14+
```
15+
16+
17+
918
## Installation
1019

1120
** TODO document local install **
@@ -26,17 +35,159 @@ python3 -m sc2ts
2635
**TODO document the process of getting a Zarr dataset and using it**
2736

2837

29-
## Licensing
38+
## Inference
39+
40+
Here we'll run through a quick example of how to get inference running
41+
on a local machine using an example config file, using the Viridian data downloaded
42+
from Zenodo.
43+
44+
### Prerequisites
45+
46+
First, install the "inference" version of sc2ts from pypi:
47+
48+
```
49+
python -m pip install sc2ts[inference]
50+
```
51+
52+
**This is essential! The base install of sc2ts contains the minimal
53+
dependencies required to access the analysis utilities outlined above.**
54+
55+
Then, download the Viridian dataset in
56+
[VCF Zarr format](https://doi.org/10.1093/gigascience/giaf049) from
57+
[Zenodo](https://zenodo.org/records/16314739):
3058

31-
The code is marked as licensed under the MIT license,
32-
but because the current implementation is used the matching
33-
engine from tsinfer (which is GPL licensed) this code is
34-
therefore also GPL.
59+
```
60+
curl -O https://zenodo.org/records/16314739/files/viridian_mafft_2024-10-14_v1.vcz.zip
61+
```
62+
### CLI
63+
64+
Inference is performed using the CLI, which is composed of number of subcommands.
65+
See the online help for more information:
66+
67+
```
68+
python -m sc2ts --help
69+
```
70+
71+
### Primary inference
72+
73+
Primary inference is performed using the ``infer`` subcommand of the CLI,
74+
and all parameters are specified using a toml file.
3575

36-
However, we plan to switch out the matching engine for an
37-
implementation provided by tskit, which is MIT licensed.
38-
This will be done before the first official release.
76+
Then inference under the [example config](example_config.toml)
77+
for little while to see how things work:
78+
79+
```
80+
python3 -m sc2ts infer example_config.toml --stop=2020-02-02
81+
```
82+
83+
Once this finishes (it should take a few minutes), the results of the
84+
inference will be in the ``example_inference`` directory (as specified in the
85+
config file) and look something like this:
86+
87+
```
88+
$ tree example_inference
89+
example_inference
90+
├── ex1
91+
│   ├── ex1_2020-01-01.ts
92+
│   ├── ex1_2020-01-10.ts
93+
│   ├── ex1_2020-01-12.ts
94+
│   ├── ex1_2020-01-19.ts
95+
│   ├── ex1_2020-01-24.ts
96+
│   ├── ex1_2020-01-25.ts
97+
│   ├── ex1_2020-01-28.ts
98+
│   ├── ex1_2020-01-29.ts
99+
│   ├── ex1_2020-01-30.ts
100+
│   ├── ex1_2020-01-31.ts
101+
│   ├── ex1_2020-02-01.ts
102+
│   └── ex1_init.ts
103+
├── ex1.log
104+
└── ex1.matches.db
105+
```
39106

107+
Here we've run inference for all dates in January 2020 for which we have data
108+
and Feb 01. The results of inference for each day is stored in the
109+
``example_inference/ex1`` directory as a tskit file representing the ARG
110+
inferred up to that day. There is a lot of redundancy in keeping all these
111+
daily files lying around, but it is useful to be able to go back to the
112+
state of the ARG at a particular date and they don't take up much space.
113+
114+
The file ``ex1.log`` contains the log file. The config file set the log-level
115+
to 2, which is full debug output. There is a lot of useful information in there,
116+
and it can be very helpful when debugging, so we recommend keeping the logs.
117+
118+
The ``ex1.matches.db`` is the "match DB" which stores information about the
119+
HMM match for each sample. This is mainly used to store exact matches
120+
found during inference.
121+
122+
The ARGs output during primary inference (this step here) have a lot of
123+
debugging metadata included (see the section on the Debug utilities below)
124+
125+
Primary inference can be stopped and picked up again at any point using
126+
the ``--start`` option.
127+
128+
129+
### Postprocessing
130+
131+
Once we've finished primary inference we can run postprocessing to perform
132+
a few housekeeping tasks. Continuing the example above:
133+
134+
```
135+
$ python3 -m sc2ts postprocess -vv \
136+
--match-db example_inference/ex1.matches.db \
137+
example_inference/ex1/ex1_2020-02-01.ts \
138+
example_inference/ex1_2020-02-01_pp.ts
139+
```
140+
141+
Among other things, this incorporates the exact matches in the match DB
142+
into the final ARG.
143+
144+
### Generating final analysis file
145+
146+
To generate the final analysis ready file (used as input to the analysis
147+
APIs above) we need to run ``minimise-metadata``. This removes all but
148+
the most necessary metadata from the ARG, and recodes node metadata
149+
using the [struct codec](https://tskit.dev/tskit/docs/stable/metadata.html#structured-array-metadata)
150+
for efficiency. On our example above:
151+
152+
```
153+
$ python -m sc2ts minimise-metadata \
154+
-m strain sample_id \
155+
-m Viridian_pangolin pango \
156+
example_inference/ex1_2020-02-01_pp.ts \
157+
example_inference/ex1_2020-02-01_pp_mm.ts
158+
```
159+
160+
This recodes the metadata in the input tree sequence such that
161+
the existing ``strain`` field is renamed to ``sample_id``
162+
(for compatibility with VCF Zarr) and the ``Viridian_pangolin``
163+
field (extracted from the Viridian metadata) is renamed to ``pango``.
164+
165+
We can then use the analysis APIs on this file:
166+
```python
167+
import sc2ts
168+
import tskit
169+
170+
ts = tskit.load("example_inference/ex1_2020-02-01_pp_mm.ts")
171+
dfn = sc2ts.node_data(ts)
172+
print(dfn)
173+
```
174+
175+
giving something like:
176+
177+
```
178+
pango sample_id node_id is_sample is_recombinant num_mutations date
179+
0 Vestigial_ignore 0 False False 0 2019-12-25
180+
1 Wuhan/Hu-1/2019 1 False False 0 2019-12-26
181+
2 A SRR11772659 2 True False 1 2020-01-19
182+
3 B SRR11397727 3 True False 0 2020-01-24
183+
4 B SRR11397730 4 True False 0 2020-01-24
184+
.. ... ... ... ... ... ... ...
185+
60 A SRR11597177 60 True False 0 2020-01-30
186+
61 A SRR11597197 61 True False 0 2020-01-30
187+
62 B SRR11597144 62 True False 0 2020-02-01
188+
63 B SRR11597148 63 True False 0 2020-02-01
189+
64 B SRR25229386 64 True False 0 2020-02-01
190+
```
40191

41192
## Development
42193

@@ -55,4 +206,21 @@ rm -fR tests/data/cache/
55206

56207
and rerun tests as above.
57208

209+
### Debug utilities
210+
211+
The tree sequences files output during primary inference have a lot
212+
of debugging metadata, and there are some developer tools for inspecting
213+
this in the ``sc2ts.debug`` package. In particular, the ``ArgInfo``
214+
class has a lot of useful utilities designed to be used in a Jupyter
215+
notebook. Use it like
216+
217+
```python
218+
import sc2ts.debug as sd
219+
import tskit
220+
221+
ts = tskit.load("path_to_daily_inference.ts")
222+
ai = sd.ArgInfo(ts)
223+
ai # view summary in notebook
224+
```
225+
58226

example_config.toml

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
dataset="viridian_mafft_2024-10-14_v1.vcz.zip"
2+
date_field="Date_tree"
3+
4+
run_id="ex1"
5+
# Configure where the result files are stored. For simplicity
6+
# we put them all in the "example_inference" directory.
7+
# It can be good idea to store the match_db on fast storage
8+
# as it gets updated and synced a lot, which can be slow
9+
# on shared file systems
10+
results_dir = "example_inference/"
11+
log_dir = "example_inference/"
12+
matches_dir= "example_inference/"
13+
# This is full debug output, which is verbose (but useful!)
14+
log_level = 2
15+
16+
# Dates to exclude from inference. This one is a large outlier in terms of the
17+
# numbers of samples, and enriched for incorrectly assigned dates.
18+
exclude_dates = ["2020-12-31"]
19+
20+
# The set of site positions to mask during inference (list of integers).
21+
# Real inferences will almost certainly want to mask some sites, although
22+
# it's not clear which sites exactly.
23+
exclude_sites = []
24+
25+
[extend_parameters]
26+
num_mismatches=4
27+
hmm_cost_threshold=7
28+
max_missing_sites=500
29+
deletions_as_missing=true
30+
# max_daily_samples=1000
31+
32+
# Knobs for tuning retro group insertion
33+
min_group_size=10
34+
min_root_mutations=2
35+
max_recurrent_mutations=2
36+
max_mutations_per_sample=5
37+
retrospective_window=7
38+
39+
num_threads=-1
40+
memory_limit=32
41+
42+
include_samples=[]
43+
44+
[[override]]
45+
start = "2020-01-01"
46+
stop = "2020-03-01"
47+
parameters.max_missing_sites = 10000

get_lineage_data.sh

Lines changed: 0 additions & 6 deletions
This file was deleted.

get_problematic_sites.sh

Lines changed: 0 additions & 7 deletions
This file was deleted.

run.sh

Lines changed: 0 additions & 44 deletions
This file was deleted.

0 commit comments

Comments
 (0)