Update documentation

merszym · merszym · commit 92f10d3d4bd8 · 2025-05-13T14:46:37.000+02:00
diff --git a/docs/source/configuration.rst b/docs/source/configuration.rst
@@ -48,7 +48,7 @@ The following nextflow specific ENV variables can be set::
     NXF_WORK                 <path> Corresponds to the -w flag
     NXF_OPTS                 <ARGS> Hand args over to the Java Virtual Machine.
                              In case of a heap-space error, assign more space with the
-                             Arguments: "-Xms128g -Xmx128g" (allocates 128GB heap-space for the run)
+                             Arguments: "-Xms10g -Xmx20g" (allocates 128GB heap-space for the run)
 
 .. _work:
 
diff --git a/docs/source/examples.rst b/docs/source/examples.rst
@@ -3,7 +3,7 @@
 Examples
 ========
 
-This page provides examples for the three main ways to execute quicksand. The regular run,
+This page shows examples for the three main ways to execute quicksand. The regular run,
 a run with fixed references and a rerun with fixed references within an existing run-folder.
 
 Please see the :ref:`quickstart-page` section to download a test-dataset (:code:`split`)
@@ -25,9 +25,9 @@ Execute quicksand like this::
         --bedfiles refseq/genomes/ \
         --masked   refseq/masked/
 
-The output files are grouped by family-level in the :code:`out/` directory. Extracted family-sequences
-after the KrakenUniq run are stored in :code:`out/{family}/1-extracted/` while mapped, deduped and filtered sequences are saved to the
-:code:`out/{family}/best/{step}/` directory after the respective processing step::
+The output files are grouped by family-level in the :code:`out/` directory. Sequences binned by family-level
+(after KrakenUniq) are stored in :code:`out/{family}/1-extracted/` while mapped, deduped and bedfiltered sequences are saved in the
+:code:`out/{family}/best/{step}/` directories after the respective processing step::
 
     quicksand_v2.3
     ├── out
@@ -45,25 +45,32 @@ after the KrakenUniq run are stored in :code:`out/{family}/1-extracted/` while m
     └── final_report.tsv
 
 
-See the :code:`final_report.tsv` for a summary of the quicksand run
+See the :code:`final_report.tsv` for a summary of the quicksand run. 
+
+Filter the final_report
+~~~~~~~~~~~~~~~~~
+
+The default quicksand-output (:code:`final_report.tsv`) is **unfiltered**, because the best 
+filtering thresholds might differ between sites (and projects). However, we provide a filtered version of the report :code:`filtered_report_05p_05b.tsv`
+with the default filter-thresholds applied. These thresholds are the :code:`FamPercentage` column (>=0.5%) 
+and the :code:`ProportionExpectedBreadth` column (>=0.5).
 
 Fixed references
 ~~~~~~~~~~~~~~~~~
 
-quicksand is designed to work with target-enriched DNA sequences and to account for
-expected families in the data. For families of interest
-provide an input-file with the :code:`--fixed` flag, which specifies the reference-genomes
-to use for the sequences assigned by KrakenUniq to the given family. Tags are used for the
-file-names and should be unique!::
+quicksand is designed to work with target-enriched data. To account for
+expected taxa in the sequences, users can provide a TSV-file with the :code:`--fixed` flag. This file specifies for each family the reference-genome(s) 
+that quicksand uses for mapping sequences assigned by KrakenUniq to the given family. 
+The 'Tags' used are used in the same way as the 'Species' (e.g. in the file-names) and should be unique!::
 
     file: fixed-references.tsv
 
     Taxon       Tag             Genome
-    Hominidae   Homo_sapiens    /path/to/reference.fasta
-    Hominidae   Another_human   /path/to/reference.fasta
+    Hominidae   Homo_sapiens    /path/to/reference_1.fasta
+    Hominidae   Another_human   /path/to/reference_2.fasta
 
 
-and start the execution with::
+Run quicksand with::
 
     nextflow run mpieva/quicksand -r v2.3 \
         -profile   singularity \
@@ -73,9 +80,9 @@ and start the execution with::
         --bedfiles refseq/masked/
         --fixed    fixed-references.tsv
 
-The output file structure remains the same as before. For families specified in the :code:`fixed-references.tsv` file output-files
+The output file structure remains mostly the same. For families specified in the :code:`fixed-references.tsv` file output-files
 appear in the :code:`out/{family}/fixed/{step}/` directory, together with additional output-files
-that are useful in additional downstream-analyses, such as the extracted deaminated reads::
+that might be useful for additional downstream-analyses, such as the extracted deaminated reads::
 
     quicksand_v2.3
     ├── out
@@ -89,6 +96,8 @@ that are useful in additional downstream-analyses, such as the extracted deamina
     │              │     └── {RG}.{family}.{Tag}.bam
     │              ├── 3-deduped
     │              │     └── {RG}.{family}.{Tag}_deduped.bam
+    │              ├── 4-bedfiltered #(only if --fixed_bedfiltering)
+    │              │     └── {RG}.{family}.{Tag}_deduped_bedfiltered.bam
     │              ├── 5-deaminated
     │              │     ├── {RG}.{family}.{Tag}_deduped_deaminated_1term.bam
     │              │     └── {RG}.{family}.{Tag}_deduped_deaminated_3term.bam
@@ -103,22 +112,21 @@ Rerun
 ~~~~~~
 
 This mode is used to repeat a run with a different set of fixed references.
-Imagine beeing interested in the evolution of the Suidae family after having analyzed all samples with
-quicksand already.
-
-And in the final report of the analysis some lines look like this::
+For example: the final report of the analysis look like this::
 
     Family    Species                   Reference     ReadsMapped    ProportionMapped    ReadsDeduped
     Suidae    Sus_scrofa_taivanus       best          1208           0.9028              1000
 
-The assigned species was based on the KrakenUniq results and probably doesnt resemble the "real" species as
+The assigned ('best') species was based on the KrakenUniq results and might reflect the "real" species as
 RefSeq contains only limited amounts of reference genomes. For any analyses that go beyond the family level, a
-reanalysis with a suitable reference genome is required.
+reanalysis with a suitable reference genome might be required.
 
-After collecting the reference genome(s) for the Suidae family, prepare a fresh fixed-references file::
+So after collecting more reference genome(s) for the Suidae family, prepare a fresh fixed-references file::
 
     Taxon       Tag                 Genome
     Suidae      super_cool_pig      /path/to/reference.fasta
+    Suidae      super_cool_pig2     /path/to/reference2.fasta
+    Suidae      super_cool_pig3     /path/to/reference3.fasta
 
 and rerun the pipeline with::
 
@@ -127,25 +135,25 @@ and rerun the pipeline with::
         --rerun    \
         --fixed    fixed-references.tsv
 
-The (additional) output files are the ones created by the :code:`--fixed` flag::
+The (additional) output files are then the ones created by the :code:`--fixed` flag::
 
     quicksand_v2.3
     ├── out
-    │    └── {family}
+    │    └── Suidae
     │         ├── 1-extracted
-    │         │    └── {RG}_extractedReads-{family}.bam
-    │         └── fixed // (family in fixed)
+    │         │    └── {RG}_extractedReads-Suidae.bam
+    │         └── fixed
     │              ├── 2-aligned
-    │              │     └── {RG}.{family}.{Tag}.bam
+    │              │     └── {RG}.Suidae.{Tag}.bam
     │              ├── 3-deduped
-    │              │     └── {RG}.{family}.{Tag}_deduped.bam
+    │              │     └── {RG}.Suidae.{Tag}_deduped.bam
     │              ├── 5-deaminated
-    │              │     ├── {RG}.{family}.{Tag}_deduped_deaminated_1term.bam
-    │              │     └── {RG}.{family}.{Tag}_deduped_deaminated_3term.bam
+    │              │     ├── {RG}.Suidae.{Tag}_deduped_deaminated_1term.bam
+    │              │     └── {RG}.Suidae.{Tag}_deduped_deaminated_3term.bam
     │              └── 6-mpileups
-    │                    ├── {RG}.{family}.{Tag}_term1_mpiled.tsv
-    │                    ├── {RG}.{family}.{Tag}_term3_mpiled.tsv
-    │                    └── {RG}.{family}.{Tag}_all_mpiled.tsv
+    │                    ├── {RG}.Suidae.{Tag}_term1_mpiled.tsv
+    │                    ├── {RG}.Suidae.{Tag}_term3_mpiled.tsv
+    │                    └── {RG}.Suidae.{Tag}_all_mpiled.tsv
     ...
     └── final_report.tsv
 
@@ -155,5 +163,7 @@ The report contains now additional lines for the Suidae family with the 'fixed'
     Family    Species                   Reference     ReadsMapped    ProportionMapped    ReadsDeduped
     Suidae    Sus_scrofa_taivanus       best          1208           0.9028              1000
     Suidae    super_cool_pig            fixed         1052           0.8024              976
+    Suidae    super_cool_pig2           fixed         1000           0.9001              800
+    Suidae    super_cool_pig3           fixed         860            0.7551              550
 
 The final report contains a mix of best (old run) and fixed (rerun) reference entries.
diff --git a/docs/source/filters.rst b/docs/source/filters.rst
@@ -4,12 +4,41 @@ Filters
 ========
 
 The output of quicksand is not filtered and therefore contains false-positive family-assignments. For the analysis of the quicksand-output, we recommend
-applying tow sets of filters.
+applying two sets of filters.
 
 Percentage based filter
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
+Filters based on a minimum percentage of sequences assigned per biological family have shown to be effective in removing misidentified sequences. 
+In simulated data, we compared the relative contribution of correctly and incorrectly identified families to the total number of mapped and 
+deduplicated sequences. We found that false-positive families are supported by a low percentage of sequences (median below 0.1%), 
+but are generally more abundant in larger and more damaged datasets. We therefore recommend a percentage threshold of at least 0.5% of the total sequences.
+
+The percentage of mapped and deduplicated sequences per family is listed in the :code:`FamPercentag` column of the :code:`final_report.tsv` file.
+
 
 Breadth of coverage based filter
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+As a second filter, we calculate the evenness of coverage of mapped sequences along the reference genome for each family-level assignment. 
+High evenness of coverage indicates that sequences are randomly distributed across the reference genome, which is expected when the source of the 
+mapped DNA is closely related to the reference, such as from the same or a closely related species. 
+In contrast, sequences from more diverged sources may only map to conserved regions, resulting in clustered alignments and low evenness of coverage. 
+
+The parameter reported in the quicksand final summary report for evaluating coverage evenness is the 'proportion of expected breadth'. 
+Breadth of coverage is defined as the proportion of the reference genome covered by at least one sequence, while Genomic coverage (or depth of coverage), 
+defines the average number of times each base in the reference genome is covered by mapped sequences.
+
+Under the assumption of random mapping to the correct reference genome, 
+the breadth of coverage is a function of the genomic coverage and can be calculated using the formula empirically determined by Olm et al. 2021 [1]_.
+
+(1) breadth of coverage = 1 - e-0.883 * coverage
+
+We refer to the calculated breadth of coverage as the expected breadth of coverage, as it assumes mapping to the correct reference genome. 
+To evaluate deviations from this expectation, we calculated for each family the proportion of expected breadth, 
+defined as the ratio of the observed to the expected breadth of coverage. 
+For correct family assignments, the observed breadth of coverage matches the expectations (PEB around 1), 
+while the false-positive families show PEB values between 1 and 0.2. 
+
+
+.. [1] Olm, M.R., Crits-Christoph, A., Bouma-Gregson, K. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat Biotechnol **39**, 727–736 (2021). [https://doi.org/10.1038/s41587-020-00797-0](https://doi.org/10.1038/s41587-020-00797-0).
diff --git a/docs/source/in_and_out.rst b/docs/source/in_and_out.rst
@@ -44,6 +44,8 @@ layed out as follows::
     │              │     └── {RG}.{family}.{species}.bam
     │              ├── 3-deduped
     │              │     └── {RG}.{family}.{species}_deduped.bam
+    │              ├── 3-bedfiltered #(if --fixed_bedfiltering)
+    │              │     └── {RG}.{family}.{species}_deduped_bedfiltered.bam
     │              ├── 5-deaminated
     │              │     ├── {RG}.{family}.{species}_deduped_deaminated_1term.bam
     │              │     └── {RG}.{family}.{species}_deduped_deaminated_3term.bam
@@ -66,7 +68,8 @@ layed out as follows::
     ├── work
     │    └── ...
     ├── cc_estimates.tsv
-    ├── filtered_report_{N}p_{N}b.tsv
+    ├── filtered_report_{N}p_{N}b.tsv # final_report.tsv with applied filters
+    ├── R_final_report.tsv # final_report.tsv, but with R-friendly headers
     └── final_report.tsv