@@ -11,274 +11,8 @@ access.
1111convenient and efficient access to the full Viridian dataset (alignments and metadata)
1212in a single file using the [ VCF Zarr specification] ( https://doi.org/10.1093/gigascience/giaf049 ) .
1313
14- Please see the [ preprint] ( https://www.biorxiv.org/content/10.1101/2023.06.08.544212v2 )
15- for details.
16-
17- ## Installation
18-
19- Install sc2ts from PyPI:
20-
21- ```
22- python -m pip install sc2ts
23- ```
24-
25- This installs the minimum requirement to enable the
26- [ ARG analysis] ( #ARG-analysis-API ) and [ Dataset] ( #Dataset-API ) s.
27- To run [ inference] ( #inference ) , you must install some extra
28- dependencies using the 'inference' optional extra:
29-
30- ```
31- python -m pip install sc2ts[inference]
32- ```
33-
34- ## ARG analysis API
35-
36- The sc2ts API provides two convenience functions to compute summary
37- dataframes for the nodes and mutations in a sc2ts-output ARG.
38-
39- To see some examples, first download the (31MB) sc2ts inferred ARG
40- from [ Zenodo] ( https://zenodo.org/records/17558489/ ) :
41-
42- ```
43- curl -O https://zenodo.org/records/17558489/files/sc2ts_viridian_v1.2.trees.tsz
44- ```
45-
46- We can then use these like
47-
48- ``` python
49- import sc2ts
50- import tszip
51-
52- ts = tszip.load(" sc2ts_viridian_v1.2.trees.tsz" )
53-
54- df_node = sc2ts.node_data(ts)
55- df_mutation = sc2ts.mutation_data(ts)
56- ```
57-
58- See the [ live demo] ( https://tskit.dev/explore/lab/index.html?path=sc2ts.ipynb )
59- for a browser based interactive demo of using these dataframes for
60- real-time pandemic-scale analysis.
61-
62- ## Dataset API
63-
64- Sc2ts also provides a convenient API for accessing large-scale
65- alignments and metadata stored in
66- [ VCF Zarr] ( https://doi.org/10.1093/gigascience/giaf049 ) format.
67-
68- Resources:
69-
70- - See this [ notebook] ( https://github.com/tskit-dev/sc2ts-paper/blob/main/notebooks/example_data_processing.ipynb )
71- for an example in which we access the data variant-by-variant and
72- which explains the low-level data encoding
73- - See the [ VCF Zarr publication] ( https://doi.org/10.1093/gigascience/giaf049 )
74- for more details on and benchmarks on this dataset.
75-
76-
77- ** TODO** Add some references to API documentation
78-
79- ## Inference
80-
81- ### Command line inference
82-
83- Inference is intended to be run from the command-line primarily,
84- and most likely orchestrated via a shell script or Snakemake file, etc.
85-
86- The CLI is split into subcommands. Get help by running the CLI without arguments:
87-
88- ```
89- python3 -m sc2ts
90- ```
91-
92- ** TODO document the process of getting a Zarr dataset and using it**
93-
94-
95- ## Inference
96-
97- Here we'll run through a quick example of how to get inference running
98- on a local machine using an example config file, using the Viridian data downloaded
99- from Zenodo.
100-
101- ### Prerequisites
102-
103- First, install the "inference" version of sc2ts from pypi:
104-
105- ```
106- python -m pip install sc2ts[inference]
107- ```
108-
109- ** This is essential! The base install of sc2ts contains the minimal
110- dependencies required to access the analysis utilities outlined above.**
111-
112- Then, download the (401MB) Viridian dataset in
113- [ VCF Zarr format] ( https://doi.org/10.1093/gigascience/giaf049 ) from
114- [ Zenodo] ( https://zenodo.org/records/16314739 ) :
115-
116- ```
117- curl -O https://zenodo.org/records/16314739/files/viridian_mafft_2024-10-14_v1.vcz.zip
118- ```
119- ### CLI
120-
121- Inference is performed using the CLI, which is composed of number of subcommands.
122- See the online help for more information:
123-
124- ```
125- python -m sc2ts --help
126- ```
127-
128- ### Primary inference
129-
130- Primary inference is performed using the `` infer `` subcommand of the CLI,
131- and all parameters are specified using a toml file.
132-
133- The [ example config file] ( example_config.toml ) can be used to perform
134- inference over a short period, to demonstrate how sc2ts works:
135-
136- ```
137- python3 -m sc2ts infer example_config.toml --stop=2020-02-02
138- ```
139-
140- Once this finishes (it should take a few minutes and requires ~ 5GB RAM), the results of the
141- inference will be in the `` example_inference `` directory (as specified in the
142- config file) and look something like this:
143-
144- ```
145- $ tree example_inference
146- example_inference
147- ├── ex1
148- │ ├── ex1_2020-01-01.ts
149- │ ├── ex1_2020-01-10.ts
150- │ ├── ex1_2020-01-12.ts
151- │ ├── ex1_2020-01-19.ts
152- │ ├── ex1_2020-01-24.ts
153- │ ├── ex1_2020-01-25.ts
154- │ ├── ex1_2020-01-28.ts
155- │ ├── ex1_2020-01-29.ts
156- │ ├── ex1_2020-01-30.ts
157- │ ├── ex1_2020-01-31.ts
158- │ ├── ex1_2020-02-01.ts
159- │ └── ex1_init.ts
160- ├── ex1.log
161- └── ex1.matches.db
162- ```
163-
164- Here we've run inference for all dates in January 2020 for which we have data, plus the 1st Feb.
165- The results of inference for each day are stored in the
166- `` example_inference/ex1 `` directory as tskit files representing the ARG
167- inferred up to that day. There is a lot of redundancy in keeping all these
168- daily files lying around, but it is useful to be able to go back to the
169- state of the ARG at a particular date and they don't take up much space.
170-
171- The file `` ex1.log `` contains the log file. The config file set the log-level
172- to 2, which is full debug output. There is a lot of useful information in there,
173- and it can be very helpful when debugging, so we recommend keeping the logs.
174-
175- The `` ex1.matches.db `` is the "match DB" which stores information about the
176- HMM match for each sample. This is mainly used to store exact matches
177- found during inference.
178-
179- The ARGs output during primary inference (this step here) have a lot of
180- debugging metadata included (see the section on the Debug utilities below)
181-
182- Primary inference can be stopped and picked up again at any point using
183- the `` --start `` option.
184-
185-
186- ### Postprocessing
187-
188- Once we've finished primary inference we can run postprocessing to perform
189- a few housekeeping tasks. Continuing the example above:
190-
191- ```
192- $ python3 -m sc2ts postprocess -vv \
193- --match-db example_inference/ex1.matches.db \
194- example_inference/ex1/ex1_2020-02-01.ts \
195- example_inference/ex1_2020-02-01_pp.ts
196- ```
197-
198- Among other things, this incorporates the exact matches in the match DB
199- into the final ARG.
200-
201- ### Generating final analysis file
202-
203- To generate the final analysis-ready file (used as input to the analysis
204- APIs above) we need to run `` minimise-metadata `` . This removes all but
205- the most necessary metadata from the ARG, and recodes node metadata
206- using the [ struct codec] ( https://tskit.dev/tskit/docs/stable/metadata.html#structured-array-metadata )
207- for efficiency. On our example above:
208-
209- ```
210- $ python -m sc2ts minimise-metadata \
211- -m strain sample_id \
212- -m Viridian_pangolin pango \
213- example_inference/ex1_2020-02-01_pp.ts \
214- example_inference/ex1_2020-02-01_pp_mm.ts
215- ```
216-
217- This recodes the metadata in the input tree sequence such that
218- the existing `` strain `` field is renamed to `` sample_id ``
219- (for compatibility with VCF Zarr) and the `` Viridian_pangolin ``
220- field (extracted from the Viridian metadata) is renamed to `` pango `` .
221-
222- We can then use the analysis APIs on this file:
223-
224- ``` python
225- import sc2ts
226- import tskit
227-
228- ts = tskit.load(" example_inference/ex1_2020-02-01_pp_mm.ts" )
229- dfn = sc2ts.node_data(ts)
230- print (dfn)
231- ```
232-
233- giving something like:
234-
235- ```
236- pango sample_id node_id is_sample is_recombinant num_mutations date
237- 0 Vestigial_ignore 0 False False 0 2019-12-25
238- 1 Wuhan/Hu-1/2019 1 False False 0 2019-12-26
239- 2 A SRR11772659 2 True False 1 2020-01-19
240- 3 B SRR11397727 3 True False 0 2020-01-24
241- 4 B SRR11397730 4 True False 0 2020-01-24
242- .. ... ... ... ... ... ... ...
243- 60 A SRR11597177 60 True False 0 2020-01-30
244- 61 A SRR11597197 61 True False 0 2020-01-30
245- 62 B SRR11597144 62 True False 0 2020-02-01
246- 63 B SRR11597148 63 True False 0 2020-02-01
247- 64 B SRR25229386 64 True False 0 2020-02-01
248- ```
249-
250- ## Development
251-
252- To run the unit tests, use
253-
254- ```
255- python3 -m pytest
256- ```
257-
258- You may need to regenerate some cached test fixtures occasionaly (particularly
259- if getting cryptic errors when running the test suite). To do this, run
260-
261- ```
262- rm -fR tests/data/cache/
263- ```
264-
265- and rerun tests as above.
266-
267- ### Debug utilities
268-
269- The tree sequence files output during primary inference have a lot
270- of debugging metadata, and there are some developer tools for inspecting
271- this in the `` sc2ts.debug `` package. In particular, the `` ArgInfo ``
272- class has a lot of useful utilities designed to be used in a Jupyter
273- notebook. Note that `` matplotlib `` is required for these. Use it like:
274-
275- ``` python
276- import sc2ts.debug as sd
277- import tskit
278-
279- ts = tskit.load(" path_to_daily_inference.ts" )
280- ai = sd.ArgInfo(ts)
281- ai # view summary in notebook
282- ```
283-
14+ Please see the online [ documentation] ( https://tskit.dev/sc2ts/docs ) for details
15+ on the software
16+ and the [ preprint] ( https://www.biorxiv.org/content/10.1101/2023.06.08.544212v2 )
17+ for information on the method and the inferred ARG.
28418
0 commit comments