11# sc2ts
22Infer a succinct tree sequence from SARS-COV-2 variation data
33
4- ** This is an early alpha version not intended for production use!!**
54
65If you are interested in helping to develop sc2ts or would like to
76work with the inferred ARGS, please get in touch.
87
8+
9+ Then, download the ARG in tszip format from
10+ [ Zenodo] ( https://zenodo.org/records/17558489/ ) :
11+
12+ ```
13+ curl -O https://zenodo.org/records/17558489/files/sc2ts_viridian_v1.2.trees.tsz
14+ ```
15+
16+
17+
918## Installation
1019
1120** TODO document local install **
@@ -26,17 +35,159 @@ python3 -m sc2ts
2635** TODO document the process of getting a Zarr dataset and using it**
2736
2837
29- ## Licensing
38+ ## Inference
39+
40+ Here we'll run through a quick example of how to get inference running
41+ on a local machine using an example config file, using the Viridian data downloaded
42+ from Zenodo.
43+
44+ ### Prerequisites
45+
46+ First, install the "inference" version of sc2ts from pypi:
47+
48+ ```
49+ python -m pip install sc2ts[inference]
50+ ```
51+
52+ ** This is essential! The base install of sc2ts contains the minimal
53+ dependencies required to access the analysis utilities outlined above.**
54+
55+ Then, download the Viridian dataset in
56+ [ VCF Zarr format] ( https://doi.org/10.1093/gigascience/giaf049 ) from
57+ [ Zenodo] ( https://zenodo.org/records/16314739 ) :
3058
31- The code is marked as licensed under the MIT license,
32- but because the current implementation is used the matching
33- engine from tsinfer (which is GPL licensed) this code is
34- therefore also GPL.
59+ ```
60+ curl -O https://zenodo.org/records/16314739/files/viridian_mafft_2024-10-14_v1.vcz.zip
61+ ```
62+ ### CLI
63+
64+ Inference is performed using the CLI, which is composed of number of subcommands.
65+ See the online help for more information:
66+
67+ ```
68+ python -m sc2ts --help
69+ ```
70+
71+ ### Primary inference
72+
73+ Primary inference is performed using the `` infer `` subcommand of the CLI,
74+ and all parameters are specified using a toml file.
3575
36- However, we plan to switch out the matching engine for an
37- implementation provided by tskit, which is MIT licensed.
38- This will be done before the first official release.
76+ Then inference under the [ example config] ( example_config.toml )
77+ for little while to see how things work:
78+
79+ ```
80+ python3 -m sc2ts infer example_config.toml --stop=2020-02-02
81+ ```
82+
83+ Once this finishes (it should take a few minutes), the results of the
84+ inference will be in the `` example_inference `` directory (as specified in the
85+ config file) and look something like this:
86+
87+ ```
88+ $ tree example_inference
89+ example_inference
90+ ├── ex1
91+ │ ├── ex1_2020-01-01.ts
92+ │ ├── ex1_2020-01-10.ts
93+ │ ├── ex1_2020-01-12.ts
94+ │ ├── ex1_2020-01-19.ts
95+ │ ├── ex1_2020-01-24.ts
96+ │ ├── ex1_2020-01-25.ts
97+ │ ├── ex1_2020-01-28.ts
98+ │ ├── ex1_2020-01-29.ts
99+ │ ├── ex1_2020-01-30.ts
100+ │ ├── ex1_2020-01-31.ts
101+ │ ├── ex1_2020-02-01.ts
102+ │ └── ex1_init.ts
103+ ├── ex1.log
104+ └── ex1.matches.db
105+ ```
39106
107+ Here we've run inference for all dates in January 2020 for which we have data
108+ and Feb 01. The results of inference for each day is stored in the
109+ `` example_inference/ex1 `` directory as a tskit file representing the ARG
110+ inferred up to that day. There is a lot of redundancy in keeping all these
111+ daily files lying around, but it is useful to be able to go back to the
112+ state of the ARG at a particular date and they don't take up much space.
113+
114+ The file `` ex1.log `` contains the log file. The config file set the log-level
115+ to 2, which is full debug output. There is a lot of useful information in there,
116+ and it can be very helpful when debugging, so we recommend keeping the logs.
117+
118+ The `` ex1.matches.db `` is the "match DB" which stores information about the
119+ HMM match for each sample. This is mainly used to store exact matches
120+ found during inference.
121+
122+ The ARGs output during primary inference (this step here) have a lot of
123+ debugging metadata included (see the section on the Debug utilities below)
124+
125+ Primary inference can be stopped and picked up again at any point using
126+ the `` --start `` option.
127+
128+
129+ ### Postprocessing
130+
131+ Once we've finished primary inference we can run postprocessing to perform
132+ a few housekeeping tasks. Continuing the example above:
133+
134+ ```
135+ $ python3 -m sc2ts postprocess -vv \
136+ --match-db example_inference/ex1.matches.db \
137+ example_inference/ex1/ex1_2020-02-01.ts \
138+ example_inference/ex1_2020-02-01_pp.ts
139+ ```
140+
141+ Among other things, this incorporates the exact matches in the match DB
142+ into the final ARG.
143+
144+ ### Generating final analysis file
145+
146+ To generate the final analysis ready file (used as input to the analysis
147+ APIs above) we need to run `` minimise-metadata `` . This removes all but
148+ the most necessary metadata from the ARG, and recodes node metadata
149+ using the [ struct codec] ( https://tskit.dev/tskit/docs/stable/metadata.html#structured-array-metadata )
150+ for efficiency. On our example above:
151+
152+ ```
153+ $ python -m sc2ts minimise-metadata \
154+ -m strain sample_id \
155+ -m Viridian_pangolin pango \
156+ example_inference/ex1_2020-02-01_pp.ts \
157+ example_inference/ex1_2020-02-01_pp_mm.ts
158+ ```
159+
160+ This recodes the metadata in the input tree sequence such that
161+ the existing `` strain `` field is renamed to `` sample_id ``
162+ (for compatibility with VCF Zarr) and the `` Viridian_pangolin ``
163+ field (extracted from the Viridian metadata) is renamed to `` pango `` .
164+
165+ We can then use the analysis APIs on this file:
166+ ``` python
167+ import sc2ts
168+ import tskit
169+
170+ ts = tskit.load(" example_inference/ex1_2020-02-01_pp_mm.ts" )
171+ dfn = sc2ts.node_data(ts)
172+ print (dfn)
173+ ```
174+
175+ giving something like:
176+
177+ ```
178+ pango sample_id node_id is_sample is_recombinant num_mutations date
179+ 0 Vestigial_ignore 0 False False 0 2019-12-25
180+ 1 Wuhan/Hu-1/2019 1 False False 0 2019-12-26
181+ 2 A SRR11772659 2 True False 1 2020-01-19
182+ 3 B SRR11397727 3 True False 0 2020-01-24
183+ 4 B SRR11397730 4 True False 0 2020-01-24
184+ .. ... ... ... ... ... ... ...
185+ 60 A SRR11597177 60 True False 0 2020-01-30
186+ 61 A SRR11597197 61 True False 0 2020-01-30
187+ 62 B SRR11597144 62 True False 0 2020-02-01
188+ 63 B SRR11597148 63 True False 0 2020-02-01
189+ 64 B SRR25229386 64 True False 0 2020-02-01
190+ ```
40191
41192## Development
42193
@@ -55,4 +206,21 @@ rm -fR tests/data/cache/
55206
56207and rerun tests as above.
57208
209+ ### Debug utilities
210+
211+ The tree sequences files output during primary inference have a lot
212+ of debugging metadata, and there are some developer tools for inspecting
213+ this in the `` sc2ts.debug `` package. In particular, the `` ArgInfo ``
214+ class has a lot of useful utilities designed to be used in a Jupyter
215+ notebook. Use it like
216+
217+ ``` python
218+ import sc2ts.debug as sd
219+ import tskit
220+
221+ ts = tskit.load(" path_to_daily_inference.ts" )
222+ ai = sd.ArgInfo(ts)
223+ ai # view summary in notebook
224+ ```
225+
58226
0 commit comments