Skip to content

Commit 92f10d3

Browse files
committed
Update documentation
1 parent b5940b8 commit 92f10d3

File tree

4 files changed

+78
-36
lines changed

4 files changed

+78
-36
lines changed

docs/source/configuration.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ The following nextflow specific ENV variables can be set::
4848
NXF_WORK <path> Corresponds to the -w flag
4949
NXF_OPTS <ARGS> Hand args over to the Java Virtual Machine.
5050
In case of a heap-space error, assign more space with the
51-
Arguments: "-Xms128g -Xmx128g" (allocates 128GB heap-space for the run)
51+
Arguments: "-Xms10g -Xmx20g" (allocates 128GB heap-space for the run)
5252

5353
.. _work:
5454

docs/source/examples.rst

Lines changed: 43 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
Examples
44
========
55

6-
This page provides examples for the three main ways to execute quicksand. The regular run,
6+
This page shows examples for the three main ways to execute quicksand. The regular run,
77
a run with fixed references and a rerun with fixed references within an existing run-folder.
88

99
Please see the :ref:`quickstart-page` section to download a test-dataset (:code:`split`)
@@ -25,9 +25,9 @@ Execute quicksand like this::
2525
--bedfiles refseq/genomes/ \
2626
--masked refseq/masked/
2727

28-
The output files are grouped by family-level in the :code:`out/` directory. Extracted family-sequences
29-
after the KrakenUniq run are stored in :code:`out/{family}/1-extracted/` while mapped, deduped and filtered sequences are saved to the
30-
:code:`out/{family}/best/{step}/` directory after the respective processing step::
28+
The output files are grouped by family-level in the :code:`out/` directory. Sequences binned by family-level
29+
(after KrakenUniq) are stored in :code:`out/{family}/1-extracted/` while mapped, deduped and bedfiltered sequences are saved in the
30+
:code:`out/{family}/best/{step}/` directories after the respective processing step::
3131

3232
quicksand_v2.3
3333
├── out
@@ -45,25 +45,32 @@ after the KrakenUniq run are stored in :code:`out/{family}/1-extracted/` while m
4545
└── final_report.tsv
4646

4747

48-
See the :code:`final_report.tsv` for a summary of the quicksand run
48+
See the :code:`final_report.tsv` for a summary of the quicksand run.
49+
50+
Filter the final_report
51+
~~~~~~~~~~~~~~~~~
52+
53+
The default quicksand-output (:code:`final_report.tsv`) is **unfiltered**, because the best
54+
filtering thresholds might differ between sites (and projects). However, we provide a filtered version of the report :code:`filtered_report_05p_05b.tsv`
55+
with the default filter-thresholds applied. These thresholds are the :code:`FamPercentage` column (>=0.5%)
56+
and the :code:`ProportionExpectedBreadth` column (>=0.5).
4957

5058
Fixed references
5159
~~~~~~~~~~~~~~~~~
5260

53-
quicksand is designed to work with target-enriched DNA sequences and to account for
54-
expected families in the data. For families of interest
55-
provide an input-file with the :code:`--fixed` flag, which specifies the reference-genomes
56-
to use for the sequences assigned by KrakenUniq to the given family. Tags are used for the
57-
file-names and should be unique!::
61+
quicksand is designed to work with target-enriched data. To account for
62+
expected taxa in the sequences, users can provide a TSV-file with the :code:`--fixed` flag. This file specifies for each family the reference-genome(s)
63+
that quicksand uses for mapping sequences assigned by KrakenUniq to the given family.
64+
The 'Tags' used are used in the same way as the 'Species' (e.g. in the file-names) and should be unique!::
5865

5966
file: fixed-references.tsv
6067

6168
Taxon Tag Genome
62-
Hominidae Homo_sapiens /path/to/reference.fasta
63-
Hominidae Another_human /path/to/reference.fasta
69+
Hominidae Homo_sapiens /path/to/reference_1.fasta
70+
Hominidae Another_human /path/to/reference_2.fasta
6471

6572

66-
and start the execution with::
73+
Run quicksand with::
6774

6875
nextflow run mpieva/quicksand -r v2.3 \
6976
-profile singularity \
@@ -73,9 +80,9 @@ and start the execution with::
7380
--bedfiles refseq/masked/
7481
--fixed fixed-references.tsv
7582

76-
The output file structure remains the same as before. For families specified in the :code:`fixed-references.tsv` file output-files
83+
The output file structure remains mostly the same. For families specified in the :code:`fixed-references.tsv` file output-files
7784
appear in the :code:`out/{family}/fixed/{step}/` directory, together with additional output-files
78-
that are useful in additional downstream-analyses, such as the extracted deaminated reads::
85+
that might be useful for additional downstream-analyses, such as the extracted deaminated reads::
7986

8087
quicksand_v2.3
8188
├── out
@@ -89,6 +96,8 @@ that are useful in additional downstream-analyses, such as the extracted deamina
8996
│ │ └── {RG}.{family}.{Tag}.bam
9097
│ ├── 3-deduped
9198
│ │ └── {RG}.{family}.{Tag}_deduped.bam
99+
│ ├── 4-bedfiltered #(only if --fixed_bedfiltering)
100+
│ │ └── {RG}.{family}.{Tag}_deduped_bedfiltered.bam
92101
│ ├── 5-deaminated
93102
│ │ ├── {RG}.{family}.{Tag}_deduped_deaminated_1term.bam
94103
│ │ └── {RG}.{family}.{Tag}_deduped_deaminated_3term.bam
@@ -103,22 +112,21 @@ Rerun
103112
~~~~~~
104113

105114
This mode is used to repeat a run with a different set of fixed references.
106-
Imagine beeing interested in the evolution of the Suidae family after having analyzed all samples with
107-
quicksand already.
108-
109-
And in the final report of the analysis some lines look like this::
115+
For example: the final report of the analysis look like this::
110116

111117
Family Species Reference ReadsMapped ProportionMapped ReadsDeduped
112118
Suidae Sus_scrofa_taivanus best 1208 0.9028 1000
113119

114-
The assigned species was based on the KrakenUniq results and probably doesnt resemble the "real" species as
120+
The assigned ('best') species was based on the KrakenUniq results and might reflect the "real" species as
115121
RefSeq contains only limited amounts of reference genomes. For any analyses that go beyond the family level, a
116-
reanalysis with a suitable reference genome is required.
122+
reanalysis with a suitable reference genome might be required.
117123

118-
After collecting the reference genome(s) for the Suidae family, prepare a fresh fixed-references file::
124+
So after collecting more reference genome(s) for the Suidae family, prepare a fresh fixed-references file::
119125

120126
Taxon Tag Genome
121127
Suidae super_cool_pig /path/to/reference.fasta
128+
Suidae super_cool_pig2 /path/to/reference2.fasta
129+
Suidae super_cool_pig3 /path/to/reference3.fasta
122130

123131
and rerun the pipeline with::
124132

@@ -127,25 +135,25 @@ and rerun the pipeline with::
127135
--rerun \
128136
--fixed fixed-references.tsv
129137

130-
The (additional) output files are the ones created by the :code:`--fixed` flag::
138+
The (additional) output files are then the ones created by the :code:`--fixed` flag::
131139

132140
quicksand_v2.3
133141
├── out
134-
│ └── {family}
142+
│ └── Suidae
135143
│ ├── 1-extracted
136-
│ │ └── {RG}_extractedReads-{family}.bam
137-
│ └── fixed // (family in fixed)
144+
│ │ └── {RG}_extractedReads-Suidae.bam
145+
│ └── fixed
138146
│ ├── 2-aligned
139-
│ │ └── {RG}.{family}.{Tag}.bam
147+
│ │ └── {RG}.Suidae.{Tag}.bam
140148
│ ├── 3-deduped
141-
│ │ └── {RG}.{family}.{Tag}_deduped.bam
149+
│ │ └── {RG}.Suidae.{Tag}_deduped.bam
142150
│ ├── 5-deaminated
143-
│ │ ├── {RG}.{family}.{Tag}_deduped_deaminated_1term.bam
144-
│ │ └── {RG}.{family}.{Tag}_deduped_deaminated_3term.bam
151+
│ │ ├── {RG}.Suidae.{Tag}_deduped_deaminated_1term.bam
152+
│ │ └── {RG}.Suidae.{Tag}_deduped_deaminated_3term.bam
145153
│ └── 6-mpileups
146-
│ ├── {RG}.{family}.{Tag}_term1_mpiled.tsv
147-
│ ├── {RG}.{family}.{Tag}_term3_mpiled.tsv
148-
│ └── {RG}.{family}.{Tag}_all_mpiled.tsv
154+
│ ├── {RG}.Suidae.{Tag}_term1_mpiled.tsv
155+
│ ├── {RG}.Suidae.{Tag}_term3_mpiled.tsv
156+
│ └── {RG}.Suidae.{Tag}_all_mpiled.tsv
149157
...
150158
└── final_report.tsv
151159

@@ -155,5 +163,7 @@ The report contains now additional lines for the Suidae family with the 'fixed'
155163
Family Species Reference ReadsMapped ProportionMapped ReadsDeduped
156164
Suidae Sus_scrofa_taivanus best 1208 0.9028 1000
157165
Suidae super_cool_pig fixed 1052 0.8024 976
166+
Suidae super_cool_pig2 fixed 1000 0.9001 800
167+
Suidae super_cool_pig3 fixed 860 0.7551 550
158168

159169
The final report contains a mix of best (old run) and fixed (rerun) reference entries.

docs/source/filters.rst

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,41 @@ Filters
44
========
55

66
The output of quicksand is not filtered and therefore contains false-positive family-assignments. For the analysis of the quicksand-output, we recommend
7-
applying tow sets of filters.
7+
applying two sets of filters.
88

99
Percentage based filter
1010
~~~~~~~~~~~~~~~~~~~~~~~~
1111

12+
Filters based on a minimum percentage of sequences assigned per biological family have shown to be effective in removing misidentified sequences.
13+
In simulated data, we compared the relative contribution of correctly and incorrectly identified families to the total number of mapped and
14+
deduplicated sequences. We found that false-positive families are supported by a low percentage of sequences (median below 0.1%),
15+
but are generally more abundant in larger and more damaged datasets. We therefore recommend a percentage threshold of at least 0.5% of the total sequences.
16+
17+
The percentage of mapped and deduplicated sequences per family is listed in the :code:`FamPercentag` column of the :code:`final_report.tsv` file.
18+
1219

1320
Breadth of coverage based filter
1421
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1522

23+
As a second filter, we calculate the evenness of coverage of mapped sequences along the reference genome for each family-level assignment.
24+
High evenness of coverage indicates that sequences are randomly distributed across the reference genome, which is expected when the source of the
25+
mapped DNA is closely related to the reference, such as from the same or a closely related species.
26+
In contrast, sequences from more diverged sources may only map to conserved regions, resulting in clustered alignments and low evenness of coverage.
27+
28+
The parameter reported in the quicksand final summary report for evaluating coverage evenness is the 'proportion of expected breadth'.
29+
Breadth of coverage is defined as the proportion of the reference genome covered by at least one sequence, while Genomic coverage (or depth of coverage),
30+
defines the average number of times each base in the reference genome is covered by mapped sequences.
31+
32+
Under the assumption of random mapping to the correct reference genome,
33+
the breadth of coverage is a function of the genomic coverage and can be calculated using the formula empirically determined by Olm et al. 2021 [1]_.
34+
35+
(1) breadth of coverage = 1 - e-0.883 * coverage
36+
37+
We refer to the calculated breadth of coverage as the expected breadth of coverage, as it assumes mapping to the correct reference genome.
38+
To evaluate deviations from this expectation, we calculated for each family the proportion of expected breadth,
39+
defined as the ratio of the observed to the expected breadth of coverage.
40+
For correct family assignments, the observed breadth of coverage matches the expectations (PEB around 1),
41+
while the false-positive families show PEB values between 1 and 0.2.
42+
43+
44+
.. [1] Olm, M.R., Crits-Christoph, A., Bouma-Gregson, K. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat Biotechnol **39**, 727–736 (2021). [https://doi.org/10.1038/s41587-020-00797-0](https://doi.org/10.1038/s41587-020-00797-0).

docs/source/in_and_out.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,8 @@ layed out as follows::
4444
│ │ └── {RG}.{family}.{species}.bam
4545
│ ├── 3-deduped
4646
│ │ └── {RG}.{family}.{species}_deduped.bam
47+
│ ├── 3-bedfiltered #(if --fixed_bedfiltering)
48+
│ │ └── {RG}.{family}.{species}_deduped_bedfiltered.bam
4749
│ ├── 5-deaminated
4850
│ │ ├── {RG}.{family}.{species}_deduped_deaminated_1term.bam
4951
│ │ └── {RG}.{family}.{species}_deduped_deaminated_3term.bam
@@ -66,7 +68,8 @@ layed out as follows::
6668
├── work
6769
│ └── ...
6870
├── cc_estimates.tsv
69-
├── filtered_report_{N}p_{N}b.tsv
71+
├── filtered_report_{N}p_{N}b.tsv # final_report.tsv with applied filters
72+
├── R_final_report.tsv # final_report.tsv, but with R-friendly headers
7073
└── final_report.tsv
7174

7275

0 commit comments

Comments
 (0)