You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/examples.rst
+43-33Lines changed: 43 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@
3
3
Examples
4
4
========
5
5
6
-
This page provides examples for the three main ways to execute quicksand. The regular run,
6
+
This page shows examples for the three main ways to execute quicksand. The regular run,
7
7
a run with fixed references and a rerun with fixed references within an existing run-folder.
8
8
9
9
Please see the :ref:`quickstart-page` section to download a test-dataset (:code:`split`)
@@ -25,9 +25,9 @@ Execute quicksand like this::
25
25
--bedfiles refseq/genomes/ \
26
26
--masked refseq/masked/
27
27
28
-
The output files are grouped by family-level in the :code:`out/` directory. Extracted family-sequences
29
-
after the KrakenUniq run are stored in :code:`out/{family}/1-extracted/` while mapped, deduped and filtered sequences are saved to the
30
-
:code:`out/{family}/best/{step}/` directory after the respective processing step::
28
+
The output files are grouped by family-level in the :code:`out/` directory. Sequences binned by family-level
29
+
(after KrakenUniq) are stored in :code:`out/{family}/1-extracted/` while mapped, deduped and bedfiltered sequences are saved in the
30
+
:code:`out/{family}/best/{step}/` directories after the respective processing step::
31
31
32
32
quicksand_v2.3
33
33
├── out
@@ -45,25 +45,32 @@ after the KrakenUniq run are stored in :code:`out/{family}/1-extracted/` while m
45
45
└── final_report.tsv
46
46
47
47
48
-
See the :code:`final_report.tsv` for a summary of the quicksand run
48
+
See the :code:`final_report.tsv` for a summary of the quicksand run.
49
+
50
+
Filter the final_report
51
+
~~~~~~~~~~~~~~~~~
52
+
53
+
The default quicksand-output (:code:`final_report.tsv`) is **unfiltered**, because the best
54
+
filtering thresholds might differ between sites (and projects). However, we provide a filtered version of the report :code:`filtered_report_05p_05b.tsv`
55
+
with the default filter-thresholds applied. These thresholds are the :code:`FamPercentage` column (>=0.5%)
56
+
and the :code:`ProportionExpectedBreadth` column (>=0.5).
49
57
50
58
Fixed references
51
59
~~~~~~~~~~~~~~~~~
52
60
53
-
quicksand is designed to work with target-enriched DNA sequences and to account for
54
-
expected families in the data. For families of interest
55
-
provide an input-file with the :code:`--fixed` flag, which specifies the reference-genomes
56
-
to use for the sequences assigned by KrakenUniq to the given family. Tags are used for the
57
-
file-names and should be unique!::
61
+
quicksand is designed to work with target-enriched data. To account for
62
+
expected taxa in the sequences, users can provide a TSV-file with the :code:`--fixed` flag. This file specifies for each family the reference-genome(s)
63
+
that quicksand uses for mapping sequences assigned by KrakenUniq to the given family.
64
+
The 'Tags' used are used in the same way as the 'Species' (e.g. in the file-names) and should be unique!::
Copy file name to clipboardExpand all lines: docs/source/filters.rst
+30-1Lines changed: 30 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,12 +4,41 @@ Filters
4
4
========
5
5
6
6
The output of quicksand is not filtered and therefore contains false-positive family-assignments. For the analysis of the quicksand-output, we recommend
7
-
applying tow sets of filters.
7
+
applying two sets of filters.
8
8
9
9
Percentage based filter
10
10
~~~~~~~~~~~~~~~~~~~~~~~~
11
11
12
+
Filters based on a minimum percentage of sequences assigned per biological family have shown to be effective in removing misidentified sequences.
13
+
In simulated data, we compared the relative contribution of correctly and incorrectly identified families to the total number of mapped and
14
+
deduplicated sequences. We found that false-positive families are supported by a low percentage of sequences (median below 0.1%),
15
+
but are generally more abundant in larger and more damaged datasets. We therefore recommend a percentage threshold of at least 0.5% of the total sequences.
16
+
17
+
The percentage of mapped and deduplicated sequences per family is listed in the :code:`FamPercentag` column of the :code:`final_report.tsv` file.
18
+
12
19
13
20
Breadth of coverage based filter
14
21
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15
22
23
+
As a second filter, we calculate the evenness of coverage of mapped sequences along the reference genome for each family-level assignment.
24
+
High evenness of coverage indicates that sequences are randomly distributed across the reference genome, which is expected when the source of the
25
+
mapped DNA is closely related to the reference, such as from the same or a closely related species.
26
+
In contrast, sequences from more diverged sources may only map to conserved regions, resulting in clustered alignments and low evenness of coverage.
27
+
28
+
The parameter reported in the quicksand final summary report for evaluating coverage evenness is the 'proportion of expected breadth'.
29
+
Breadth of coverage is defined as the proportion of the reference genome covered by at least one sequence, while Genomic coverage (or depth of coverage),
30
+
defines the average number of times each base in the reference genome is covered by mapped sequences.
31
+
32
+
Under the assumption of random mapping to the correct reference genome,
33
+
the breadth of coverage is a function of the genomic coverage and can be calculated using the formula empirically determined by Olm et al. 2021 [1]_.
34
+
35
+
(1) breadth of coverage = 1 - e-0.883 * coverage
36
+
37
+
We refer to the calculated breadth of coverage as the expected breadth of coverage, as it assumes mapping to the correct reference genome.
38
+
To evaluate deviations from this expectation, we calculated for each family the proportion of expected breadth,
39
+
defined as the ratio of the observed to the expected breadth of coverage.
40
+
For correct family assignments, the observed breadth of coverage matches the expectations (PEB around 1),
41
+
while the false-positive families show PEB values between 1 and 0.2.
42
+
43
+
44
+
.. [1] Olm, M.R., Crits-Christoph, A., Bouma-Gregson, K. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat Biotechnol **39**, 727–736 (2021). [https://doi.org/10.1038/s41587-020-00797-0](https://doi.org/10.1038/s41587-020-00797-0).
0 commit comments