diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 00000000..6ee38518
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "Amplicon/Illumina/Workflow_Documentation/NF_AmpIllumina"]
+ path = Amplicon/Illumina/Workflow_Documentation/NF_AmpIllumina
+ url = https://github.com/nasa/GeneLab_AmpliconSeq_Workflow
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software.md b/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software.md
deleted file mode 100644
index c38136d9..00000000
--- a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software.md
+++ /dev/null
@@ -1,37 +0,0 @@
-## The NASA GeneLab [Amplicon](../Amplicon) and [Metagenomics](../Metagenomics) Processing Pipelines software also makes use of the following 3rd party Open Source software:
-
-|3rd Party Software Name|License|License URL|Copyright Notice|
-|:----------------------|:------|:----------|:----------------------|
-|MetaPhlAn3|[The MIT License (MIT) Copyright (c) 2015, Duy Tin Truong, Nicola Segata and Curtis Huttenhower](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/MetaPhlAn3_license.pdf)|[https://github.com/biobakery/MetaPhlAn/blob/3.0/license.txt](https://github.com/biobakery/MetaPhlAn/blob/3.0/license.txt)|Copyright (c) 2015, Duy Tin Truong, Nicola Segata and Curtis Huttenhower|
-|FastQC|[GNU GENERAL PUBLIC LICENSE Version 2, June 1991](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/FASTQC_LICENSE.pdf)|[https://github.com/s-andrews/FastQC/blob/master/LICENSE](https://github.com/s-andrews/FastQC/blob/master/LICENSE)|Copyright (C) 1989, 1991 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|MultiQC|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/MultiQC_LICENSE.pdf)|[https://github.com/ewels/MultiQC/blob/master/LICENSE](https://github.com/ewels/MultiQC/blob/master/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|Cutadapt|[Copyright (c) 2010-2020 Marcel Martin](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/cutadapt_LICENSE.pdf)|[https://github.com/marcelm/cutadapt/blob/main/LICENSE](https://github.com/marcelm/cutadapt/blob/main/LICENSE)|Copyright (c) 2010-2020 Marcel Martin Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
-|DADA2|[GNU LESSER GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/dada2_LICENSE.pdf)|[https://www.bioconductor.org/packages/release/bioc/licenses/dada2/LICENSE](https://www.bioconductor.org/packages/release/bioc/licenses/dada2/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|DECIPHER|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/DECIPHER_gpl-3.0.pdf)|[https://www.gnu.org/licenses/gpl-3.0.en.html](https://www.gnu.org/licenses/gpl-3.0.en.html)|Copyright © 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|biomformat|[Modified BSD](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/The_BIOM_Format_License_biom_format.org.pdf)|[http://biom-format.org/BIOM_LICENSE.html](http://biom-format.org/BIOM_LICENSE.html)|Copyright (c) 2011-2014, The BIOM Format Development Team All rights reserved.|
-|bbduk|[BBTools Copyright (c) 2014, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/bbduk_license.pdf)|[https://github.com/BioInfoTools/BBMap/blob/a9ceda047a7c918dc090de0fdbf6f924292d4a1f/license.txt](https://github.com/BioInfoTools/BBMap/blob/a9ceda047a7c918dc090de0fdbf6f924292d4a1f/license.txt)|BBTools Copyright (c) 2014, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.|
-|vsearch|[This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License.](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/VSEARCH_LICENSE.pdf)|[https://github.com/torognes/vsearch/blob/master/LICENSE.txt](https://github.com/torognes/vsearch/blob/master/LICENSE.txt)|Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License.|
-|kraken2|[The MIT License (MIT) Copyright (c) 2017-2018 Derrick Wood](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/kraken2_LICENSE.pdf)|[https://github.com/DerrickWood/kraken2/blob/master/LICENSE](https://github.com/DerrickWood/kraken2/blob/master/LICENSE)|Copyright (c) 2017-2021 Derrick Wood Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
-|megahit|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/megahit_LICENSE.pdf)|[https://github.com/voutcn/megahit/blob/master/LICENSE](https://github.com/voutcn/megahit/blob/master/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|bit|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/bit_LICENSE.pdf)|[https://github.com/AstrobioMike/bioinf_tools/blob/master/LICENSE](https://github.com/AstrobioMike/bioinf_tools/blob/master/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|bowtie2|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/bowtie2_LICENSE.pdf)|[https://github.com/BenLangmead/bowtie2/blob/master/LICENSE](https://github.com/BenLangmead/bowtie2/blob/master/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|samtools|[The MIT/Expat License Copyright (C) 2008-2020 Genome Research Ltd.](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/samtools_LICENSE.pdf)|[https://github.com/samtools/samtools/blob/develop/LICENSE](https://github.com/samtools/samtools/blob/develop/LICENSE)|Copyright (C) 2008-2021 Genome Research Ltd. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
-|prodigal|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/Prodigal_LICENSE.pdf)|[https://github.com/hyattpd/Prodigal/blob/GoogleImport/LICENSE](https://github.com/hyattpd/Prodigal/blob/GoogleImport/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|KOFamScan|[MIT License Copyright (c) 2019 Takuya Aramaki](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/kofam_scan_LICENSE.pdf)|[https://github.com/takaram/kofam_scan/blob/master/LICENSE.txt](https://github.com/takaram/kofam_scan/blob/master/LICENSE.txt)|Copyright (c) 2019 Takuya Aramaki Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
-|CAT|[The MIT License (MIT) Copyright (c) 2019 Universiteit Utrecht](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/CAT_LICENSE.pdf)|[https://github.com/dutilh/CAT/blob/master/LICENSE.md](https://github.com/dutilh/CAT/blob/master/LICENSE.md)|Copyright (c) 2019 Universiteit Utrecht Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
-|Metabat2|[MetaBAT (2014-075), The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/MetaBAT_license.pdf)|[https://bitbucket.org/berkeleylab/metabat/src/master/license.txt](https://bitbucket.org/berkeleylab/metabat/src/master/license.txt)|MetaBAT (2014-075), The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.|
-|checkm|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/CheckM_LICENSE.pdf)|[https://github.com/Ecogenomics/CheckM/blob/master/LICENSE](https://github.com/Ecogenomics/CheckM/blob/master/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|gtdbtk|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/GTDBTk_LICENSE.pdf)|[https://github.com/Ecogenomics/GTDBTk/blob/master/LICENSE](https://github.com/Ecogenomics/GTDBTk/blob/master/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|HUMAnN3|[The HUMAnN software is licensed under the MIT license. Copyright (c) 2014 Harvard School of Public Health](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/HUMAnN_LICENSE.pdf)|[https://github.com/biobakery/humann/blob/master/LICENSE](https://github.com/biobakery/humann/blob/master/LICENSE)|Copyright (c) 2014 Harvard School of Public Health Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
-|Snakemake|[The MIT License (MIT) Copyright (c) 2012-2019 Johannes Köster](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/License_Snakemake_6.7.0_documentation.pdf)|[https://snakemake.readthedocs.io/en/stable/project_info/license.html](https://snakemake.readthedocs.io/en/stable/project_info/license.html)|Copyright (c) 2012-2019 Johannes Köster Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
-|genelab-utils|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/genelab-utils_LICENSE.pdf)|[https://github.com/AstrobioMike/GeneLab-utils/blob/main/LICENSE](https://github.com/AstrobioMike/GeneLab-utils/blob/main/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|KEGGDecoder|[The MIT License (MIT) Copyright (c) 2019 Benjamin Tully](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/KEGGDecoder_LICENSE.pdf)|[https://github.com/bjtully/BioData/blob/master/LICENSE](https://github.com/bjtully/BioData/blob/master/LICENSE) | Copyright (c) 2019 Benjamin Tully Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
-|R|[GNU GENERAL PUBLIC LICENSE Version 2, June 1991, and Version 3, 29 June 2007](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/R_GPL-2_and_GPL-3_LICENSES.pdf)|[https://www.r-project.org/Licenses/](https://www.r-project.org/Licenses/)|Version 2: Copyright (C) 1989, 1991 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.; Version 3: Copyright (C) 2007 Free Software Foundation, Inc. http://fsf.org/ Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|vegan|[GNU GENERAL PUBLIC LICENSE Version 2, June 1991](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/vegan_LICENSE.pdf)|[https://github.com/vegandevs/vegan/blob/master/LICENSE](https://github.com/vegandevs/vegan/blob/master/LICENSE)|Copyright (C) 1989, 1991 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|tidyverse|[MIT Licence Copyright (c) 2021](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/tidyverse_LICENSE.pdf)|[https://tidyverse.tidyverse.org/LICENSE.html](https://tidyverse.tidyverse.org/LICENSE.html)|MIT License Copyright (c) 2021 tidyverse authors|
-|dendextend|[GNU GENERAL PUBLIC LICENSE Version 2, June 1991, and Version 3, 29 June 2007](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/R_GPL-2_and_GPL-3_LICENSES.pdf)|[https://talgalili.github.io/dendextend/](https://talgalili.github.io/dendextend/)|Version 2: Copyright (C) 1989, 1991 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.; Version 3: Copyright (C) 2007 Free Software Foundation, Inc. http://fsf.org/ Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|ggrepel|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/ggrepel_LICENSE.pdf)|[https://github.com/slowkow/ggrepel/blob/master/LICENSE](https://github.com/slowkow/ggrepel/blob/master/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|dplyr|[MIT License Copyright (c) 2022 dplyr authors](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/dplyr_LICENSE.pdf)|[https://github.com/tidyverse/dplyr/blob/main/LICENSE.md](https://github.com/tidyverse/dplyr/blob/main/LICENSE.md)|Copyright (c) 2022 dplyr authors Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
-|rcolorbrewer|[Apache License Version 2.0, January 2004](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/colorbrewer_LICENSE.pdf)|[https://github.com/axismaps/colorbrewer/blob/master/LICENCE.txt](https://github.com/axismaps/colorbrewer/blob/master/LICENCE.txt)|Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.|
-|DESeq2|[GNU LESSER GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/DESeq2_LICENSE.pdf)|[https://bioconductor.org/packages/release/bioc/html/DESeq2.html](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)|Copyright (C) 2007 Free Software Foundation, Inc. http://fsf.org/ Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
-|phyloseq|[GNU AFFERO GENERAL PUBLIC LICENSE Version 3, 19 November 2007](Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/phyloseq_LICENSE.pdf)|[https://bioconductor.org/packages/release/bioc/html/phyloseq.html](https://bioconductor.org/packages/release/bioc/html/phyloseq.html)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
\ No newline at end of file
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/DECIPHER_gpl-3.0.pdf b/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/DECIPHER_gpl-3.0.pdf
deleted file mode 100644
index 26e8fdf7..00000000
Binary files a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/DECIPHER_gpl-3.0.pdf and /dev/null differ
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/DESeq2_LICENSE.pdf b/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/DESeq2_LICENSE.pdf
deleted file mode 100644
index 5c5d5118..00000000
Binary files a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/DESeq2_LICENSE.pdf and /dev/null differ
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/The_BIOM_Format_License_biom_format.org.pdf b/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/The_BIOM_Format_License_biom_format.org.pdf
deleted file mode 100644
index fc94e7de..00000000
Binary files a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/The_BIOM_Format_License_biom_format.org.pdf and /dev/null differ
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/VSEARCH_LICENSE.pdf b/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/VSEARCH_LICENSE.pdf
deleted file mode 100644
index 77bb8591..00000000
Binary files a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/VSEARCH_LICENSE.pdf and /dev/null differ
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/colorbrewer_LICENSE.pdf b/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/colorbrewer_LICENSE.pdf
deleted file mode 100644
index ea9e9490..00000000
Binary files a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/colorbrewer_LICENSE.pdf and /dev/null differ
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/cutadapt_LICENSE.pdf b/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/cutadapt_LICENSE.pdf
deleted file mode 100644
index 67f5bfb9..00000000
Binary files a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/cutadapt_LICENSE.pdf and /dev/null differ
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/dada2_LICENSE.pdf b/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/dada2_LICENSE.pdf
deleted file mode 100644
index 21a9d25e..00000000
Binary files a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/dada2_LICENSE.pdf and /dev/null differ
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/dplyr_LICENSE.pdf b/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/dplyr_LICENSE.pdf
deleted file mode 100644
index cc0416a3..00000000
Binary files a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/dplyr_LICENSE.pdf and /dev/null differ
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/ggrepel_LICENSE.pdf b/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/ggrepel_LICENSE.pdf
deleted file mode 100644
index bbeaba9d..00000000
Binary files a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/ggrepel_LICENSE.pdf and /dev/null differ
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/phyloseq_LICENSE.pdf b/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/phyloseq_LICENSE.pdf
deleted file mode 100644
index 249b1ec6..00000000
Binary files a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/phyloseq_LICENSE.pdf and /dev/null differ
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/tidyverse_LICENSE.pdf b/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/tidyverse_LICENSE.pdf
deleted file mode 100644
index 08554be5..00000000
Binary files a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/tidyverse_LICENSE.pdf and /dev/null differ
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/vegan_LICENSE.pdf b/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/vegan_LICENSE.pdf
deleted file mode 100644
index 947f00b0..00000000
Binary files a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/vegan_LICENSE.pdf and /dev/null differ
diff --git a/3rd_Party_Licenses/Metagenomics_3rd_Party_Software.md b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software.md
new file mode 100644
index 00000000..88378014
--- /dev/null
+++ b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software.md
@@ -0,0 +1,24 @@
+## The NASA GeneLab [Metagenomics](../Metagenomics) Processing Pipelines software also makes use of the following 3rd party Open Source software:
+
+|3rd Party Software Name|License|License URL|Copyright Notice|
+|:----------------------|:------|:----------|:----------------------|
+|FastQC|[GNU GENERAL PUBLIC LICENSE Version 2, June 1991](Metagenomics_3rd_Party_Software_Licenses/FASTQC_LICENSE.pdf)|[https://github.com/s-andrews/FastQC/blob/master/LICENSE](https://github.com/s-andrews/FastQC/blob/master/LICENSE)|Copyright (C) 1989, 1991 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
+|MultiQC|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Metagenomics_3rd_Party_Software_Licenses/MultiQC_LICENSE.pdf)|[https://github.com/ewels/MultiQC/blob/master/LICENSE](https://github.com/ewels/MultiQC/blob/master/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
+|bbduk|[BBTools Copyright (c) 2014, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.](Metagenomics_3rd_Party_Software_Licenses/bbduk_license.pdf)|[https://github.com/BioInfoTools/BBMap/blob/a9ceda047a7c918dc090de0fdbf6f924292d4a1f/license.txt](https://github.com/BioInfoTools/BBMap/blob/a9ceda047a7c918dc090de0fdbf6f924292d4a1f/license.txt)|BBTools Copyright (c) 2014, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.|
+|megahit|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Metagenomics_3rd_Party_Software_Licenses/megahit_LICENSE.pdf)|[https://github.com/voutcn/megahit/blob/master/LICENSE](https://github.com/voutcn/megahit/blob/master/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
+|bit|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Metagenomics_3rd_Party_Software_Licenses/bit_LICENSE.pdf)|[https://github.com/AstrobioMike/bioinf_tools/blob/master/LICENSE](https://github.com/AstrobioMike/bioinf_tools/blob/master/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
+|bowtie2|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Metagenomics_3rd_Party_Software_Licenses/bowtie2_LICENSE.pdf)|[https://github.com/BenLangmead/bowtie2/blob/master/LICENSE](https://github.com/BenLangmead/bowtie2/blob/master/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
+|samtools|[The MIT/Expat License Copyright (C) 2008-2020 Genome Research Ltd.](Metagenomics_3rd_Party_Software_Licenses/samtools_LICENSE.pdf)|[https://github.com/samtools/samtools/blob/develop/LICENSE](https://github.com/samtools/samtools/blob/develop/LICENSE)|Copyright (C) 2008-2021 Genome Research Ltd. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
+|prodigal|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Metagenomics_3rd_Party_Software_Licenses/Prodigal_LICENSE.pdf)|[https://github.com/hyattpd/Prodigal/blob/GoogleImport/LICENSE](https://github.com/hyattpd/Prodigal/blob/GoogleImport/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
+|KOFamScan|[MIT License Copyright (c) 2019 Takuya Aramaki](Metagenomics_3rd_Party_Software_Licenses/kofam_scan_LICENSE.pdf)|[https://github.com/takaram/kofam_scan/blob/master/LICENSE.txt](https://github.com/takaram/kofam_scan/blob/master/LICENSE.txt)|Copyright (c) 2019 Takuya Aramaki Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
+|CAT|[The MIT License (MIT) Copyright (c) 2019 Universiteit Utrecht](Metagenomics_3rd_Party_Software_Licenses/CAT_LICENSE.pdf)|[https://github.com/dutilh/CAT/blob/master/LICENSE.md](https://github.com/dutilh/CAT/blob/master/LICENSE.md)|Copyright (c) 2019 Universiteit Utrecht Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
+|Metabat2|[MetaBAT (2014-075), The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.](Metagenomics_3rd_Party_Software_Licenses/MetaBAT_license.pdf)|[https://bitbucket.org/berkeleylab/metabat/src/master/license.txt](https://bitbucket.org/berkeleylab/metabat/src/master/license.txt)|MetaBAT (2014-075), The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.|
+|checkm|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Metagenomics_3rd_Party_Software_Licenses/CheckM_LICENSE.pdf)|[https://github.com/Ecogenomics/CheckM/blob/master/LICENSE](https://github.com/Ecogenomics/CheckM/blob/master/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
+|gtdbtk|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Metagenomics_3rd_Party_Software_Licenses/GTDBTk_LICENSE.pdf)|[https://github.com/Ecogenomics/GTDBTk/blob/master/LICENSE](https://github.com/Ecogenomics/GTDBTk/blob/master/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
+|KEGGDecoder|[The MIT License (MIT) Copyright (c) 2019 Benjamin Tully](Metagenomics_3rd_Party_Software_Licenses/KEGGDecoder_LICENSE.pdf)|[https://github.com/bjtully/BioData/blob/master/LICENSE](https://github.com/bjtully/BioData/blob/master/LICENSE) | Copyright (c) 2019 Benjamin Tully Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
+|HUMAnN3|[The HUMAnN software is licensed under the MIT license. Copyright (c) 2014 Harvard School of Public Health](Metagenomics_3rd_Party_Software_Licenses/HUMAnN_LICENSE.pdf)|[https://github.com/biobakery/humann/blob/master/LICENSE](https://github.com/biobakery/humann/blob/master/LICENSE)|Copyright (c) 2014 Harvard School of Public Health Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
+|MetaPhlAn3|[The MIT License (MIT) Copyright (c) 2015, Duy Tin Truong, Nicola Segata and Curtis Huttenhower](Metagenomics_3rd_Party_Software_Licenses/MetaPhlAn3_license.pdf)|[https://github.com/biobakery/MetaPhlAn/blob/3.0/license.txt](https://github.com/biobakery/MetaPhlAn/blob/3.0/license.txt)|Copyright (c) 2015, Duy Tin Truong, Nicola Segata and Curtis Huttenhower|
+|kraken2|[The MIT License (MIT) Copyright (c) 2017-2018 Derrick Wood](Metagenomics_3rd_Party_Software_Licenses/kraken2_LICENSE.pdf)|[https://github.com/DerrickWood/kraken2/blob/master/LICENSE](https://github.com/DerrickWood/kraken2/blob/master/LICENSE)|Copyright (c) 2017-2021 Derrick Wood Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
+|Snakemake|[The MIT License (MIT) Copyright (c) 2012-2019 Johannes Köster](Metagenomics_3rd_Party_Software_Licenses/License_Snakemake_6.7.0_documentation.pdf)|[https://snakemake.readthedocs.io/en/stable/project_info/license.html](https://snakemake.readthedocs.io/en/stable/project_info/license.html)|Copyright (c) 2012-2019 Johannes Köster Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
+|genelab-utils|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Metagenomics_3rd_Party_Software_Licenses/genelab-utils_LICENSE.pdf)|[https://github.com/AstrobioMike/GeneLab-utils/blob/main/LICENSE](https://github.com/AstrobioMike/GeneLab-utils/blob/main/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
+|R|[GNU GENERAL PUBLIC LICENSE Version 2, June 1991, and Version 3, 29 June 2007](Metagenomics_3rd_Party_Software_Licenses/R_GPL-2_and_GPL-3_LICENSES.pdf)|[https://www.r-project.org/Licenses/](https://www.r-project.org/Licenses/)|Version 2: Copyright (C) 1989, 1991 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.; Version 3: Copyright (C) 2007 Free Software Foundation, Inc. http://fsf.org/ Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/CAT_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/CAT_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/CAT_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/CAT_LICENSE.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/CheckM_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/CheckM_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/CheckM_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/CheckM_LICENSE.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/FASTQC_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/FASTQC_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/FASTQC_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/FASTQC_LICENSE.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/GTDBTk_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/GTDBTk_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/GTDBTk_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/GTDBTk_LICENSE.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/HUMAnN_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/HUMAnN_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/HUMAnN_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/HUMAnN_LICENSE.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/KEGGDecoder_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/KEGGDecoder_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/KEGGDecoder_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/KEGGDecoder_LICENSE.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/License_Snakemake_6.7.0_documentation.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/License_Snakemake_6.7.0_documentation.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/License_Snakemake_6.7.0_documentation.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/License_Snakemake_6.7.0_documentation.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/MetaBAT_license.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/MetaBAT_license.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/MetaBAT_license.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/MetaBAT_license.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/MetaPhlAn3_license.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/MetaPhlAn3_license.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/MetaPhlAn3_license.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/MetaPhlAn3_license.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/MultiQC_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/MultiQC_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/MultiQC_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/MultiQC_LICENSE.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/Prodigal_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/Prodigal_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/Prodigal_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/Prodigal_LICENSE.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/R_GPL-2_and_GPL-3_LICENSES.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/R_GPL-2_and_GPL-3_LICENSES.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/R_GPL-2_and_GPL-3_LICENSES.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/R_GPL-2_and_GPL-3_LICENSES.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/bbduk_license.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/bbduk_license.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/bbduk_license.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/bbduk_license.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/bit_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/bit_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/bit_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/bit_LICENSE.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/bowtie2_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/bowtie2_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/bowtie2_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/bowtie2_LICENSE.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/genelab-utils_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/genelab-utils_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/genelab-utils_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/genelab-utils_LICENSE.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/kofam_scan_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/kofam_scan_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/kofam_scan_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/kofam_scan_LICENSE.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/kraken2_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/kraken2_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/kraken2_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/kraken2_LICENSE.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/megahit_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/megahit_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/megahit_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/megahit_LICENSE.pdf
diff --git a/3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/samtools_LICENSE.pdf b/3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/samtools_LICENSE.pdf
similarity index 100%
rename from 3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software_Licenses/samtools_LICENSE.pdf
rename to 3rd_Party_Licenses/Metagenomics_3rd_Party_Software_Licenses/samtools_LICENSE.pdf
diff --git a/3rd_Party_Licenses/README.md b/3rd_Party_Licenses/README.md
new file mode 100644
index 00000000..62ac55d0
--- /dev/null
+++ b/3rd_Party_Licenses/README.md
@@ -0,0 +1,10 @@
+## The NASA GeneLab Processing Pipelines utilize 3rd party Open Source software as indicated:
+
+| Assay Type | 3rd Party Software Table | 3rd Party Software Licenses |
+|:-------------------------------|:-------------------------|:----------------------------|
+| Amplicon Sequencing | [Amplicon 3rd Party Software](https://github.com/nasa/GeneLab_AmpliconSeq_Workflow/tree/main/License/3rd_Party_Licenses/README.md) | [Amplicon 3rd Party Licenses](https://github.com/nasa/GeneLab_AmpliconSeq_Workflow/tree/main/License/3rd_Party_Licenses) |
+| Metagenomics | [Metagenomics 3rd Party Software](./Metagenomics_3rd_Party_Software.md) | [Metagenomics 3rd Party Licenses](./Metagenomics_3rd_Party_Software_Licenses/) |
+| Methylation Sequencing | [Methyl-Seq 3rd Party Software](./Methyl-Seq_3rd_Party_Software.md) | [Methyl-Seq 3rd Party Licenses](./Methyl-Seq_3rd_Party_Software_Licenses/) |
+| Microarray - Affymetrix | [Microarray Affymetrix 3rd Party Software](./Microarray_Affymetrix_3rd_Party_Software.md) | [Microarray Affymetrix 3rd Party Licenses](./Microarray_Affymetrix_3rd_Party_Software_Licenses/) |
+| Microarray - Agilent 1-channel | [Microarray Agilent 1-channel 3rd Party Software](./Microarray_Agilent_1_Channel_3rd_Party_Software.md) | [Microarray Affymetrix 3rd Party Licenses](./Microarray_Agilent_1_Channel_3rd_Party_Software_Licenses/) |
+| (bulk) RNAseq | [RNASeq 3rd Party Software](./RNAseq_3rd_Party_Software.md) | [RNAseq 3rd Party Licenses](./RNAseq_3rd_Party_Software_Licenses/) |
diff --git a/Amplicon/454-and-IonTorrent/README.md b/Amplicon/454-and-IonTorrent/README.md
index d79aa62c..3286f065 100644
--- a/Amplicon/454-and-IonTorrent/README.md
+++ b/Amplicon/454-and-IonTorrent/README.md
@@ -7,7 +7,7 @@
---
-
+
---
diff --git a/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-C.md b/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-C.md
new file mode 100644
index 00000000..867e15f2
--- /dev/null
+++ b/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-C.md
@@ -0,0 +1,3614 @@
+# Bioinformatics pipeline for amplicon Illumina sequencing data
+
+> **This page holds an overview and instructions for how GeneLab processes Illumina amplicon sequencing datasets. Exact processing commands for specific datasets that have been released are available in the [GLDS_Processing_Scripts](../GLDS_Processing_Scripts) sub-directory and/or are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**
+
+---
+
+**Date:** May 5, 2025
+**Revision:** C
+**Document Number:** GL-DPPD-7104
+
+**Submitted by:**
+Olabiyi Obayomi, Alexis Torres, and Michael D. Lee (GeneLab Data Processing Team)
+
+**Approved by:**
+Samrawit Gebre (OSDR Project Manager)
+Danielle Lopez (OSDR Deputy Project Manager)
+Jonathan Galazka (OSDR Project Scientist)
+Amanda Saravia-Butler (GeneLab Science Lead)
+Barbara Novak (GeneLab Data Processing Lead)
+
+---
+
+## Updates from previous version
+
+Software Updates and Changes:
+
+| Program | Previous Version | New Version |
+|:-------------|:-----------------|:--------------|
+| FastQC | 0.11.9 | 0.12.1 |
+| MultiQC | 1.9 | 1.27.1 |
+| Cutadapt | 2.3 | 5.0 |
+| R-base | 4.1.1 | 4.4.2 |
+| DADA2 | 1.20.0 | 1.34.0 |
+| DECIPHER | 2.20.0 | 3.2.0 |
+| biomformat | 1.20.0 | 1.34.0 |
+| ANCOMBC | N/A | 2.8.0 |
+| broom | N/A | 1.0.7 |
+| DescTools | N/A | 0.99.59 |
+| DESeq2 | N/A | 1.46.0 |
+| dp_tools | N/A | 1.3.8 |
+| FSA | N/A | 0.9.6 |
+| ggdendro | N/A | 0.2.0 |
+| ggrepel | N/A | 0.9.6 |
+| glue | N/A | 1.8.0 |
+| hexbin | N/A | 1.28.3 |
+| mia | N/A | 1.14.0 |
+| phyloseq | N/A | 1.50.0 |
+| rcolorbrewer | N/A | 1.1.3 |
+| taxize | N/A | 0.10.0 |
+| tidyverse | N/A | 2.0.0 |
+| vegan | N/A | 2.6-10 |
+| vsn | N/A | 3.74.0 |
+| patchwork | N/A | 1.3.0 |
+| rstatix | N/A | 0.7.2 |
+| multcompView | N/A | 0.1-10 |
+| scales | N/A | 1.4.0 |
+| dendextend | N/A | 1.19.0 |
+
+- Added new processing steps in R to generate processed data outputs for alpha and beta diversity, taxonomic summary plots, and differential abundance:
+ - Alpha Diversity Analysis ([Step 7](#7-alpha-diversity-analysis))
+ - Beta Diversity Analysis ([Step 8](#8-beta-diversity-analysis))
+ - Group-wise and Sample-wise Taxonomic Summary Plots ([Step 9](#9-taxonomy-plots))
+ - Differential Abundance Testing ([Step 10](#9-differential-abundance-analysis)) with
+ ANCOMBC 1 ([Step 10a](#10a-ancombc-1)), ANCOMBC 2 ([Step 10b](#10b-ancombc-2)), and Deseq2 ([Step 10c](#10c-deseq2))
+- Assay-specific suffixes were added where needed for OSDR ("_GLAmpSeq")
+- Updated [DECIPHER](https://www2.decipher.codes/data/Downloads/TrainingSets/) reference files to the following:
+ - ITS UNITE: "UNITE\_v2024\_April2024.RData"
+ - SILVA SSU r138: "SILVA\_SSU\_r138\_2\_2024.RData"
+ - PR2 v4.13: "PR2\_v4\_13\_March2021.RData"
+- Added persistent reference links to DECIPHER databases on Figshare and replaced reference links to
+ DECIPHER [website](https://www2.decipher.codes/data/Downloads/TrainingSets/)
+ - [SILVA SSU r138](https://figshare.com/ndownloader/files/52846199)
+ - [UNITE v2024](https://figshare.com/ndownloader/files/52846346)
+ - [PR2 v4.13](https://figshare.com/ndownloader/files/46241917)
+
+---
+
+# Table of contents
+
+- [**Software used**](#software-used)
+- [**Reference databases used**](#reference-databases-used)
+- [**General processing overview with example commands**](#general-processing-overview-with-example-commands)
+ - [**1. Raw Data QC**](#1-raw-data-qc)
+ - [1a. Raw Data QC](#1a-raw-data-qc)
+ - [1b. Compile Raw Data QC](#1b-compile-raw-data-qc)
+ - [**2. Trim Primers**](#2-trim-primers)
+ - [**3. Quality Filtering**](#3-quality-filtering)
+ - [**4. Filtered Data QC**](#4-filtered-data-qc)
+ - [4a. Filtered Data QC](#4a-filtered-data-qc)
+ - [4b. Compile Filtered Data QC](#4b-compile-filtered-data-qc)
+ - [**5. Calculate Error model, Apply DADA2 Algorithm, Assign Taxonomy, and Create Output Tables**](#5-calculate-error-model-apply-dada2-algorithm-assign-taxonomy-and-create-output-tables)
+ - [5a. Learning the Error Rates](#5a-learning-the-error-rates)
+ - [5b. Inferring Sequences](#5b-inferring-sequences)
+ - [5c. Merging Forward and Reverse Reads; Not Needed if Data are Single-End](#5c-merging-forward-and-reverse-reads-not-needed-if-data-are-single-end)
+ - [5d. Generating Sequence Table with Counts per Sample](#5d-generating-sequence-table-with-counts-per-sample)
+ - [5e. Removing Putative Chimeras](#5e-removing-putative-chimeras)
+ - [5f. Assigning Taxonomy](#5f-assigning-taxonomy)
+ - [5g. Generating and Writing Standard Outputs](#5g-generating-and-writing-standard-outputs)
+ - [**6. Amplicon Seq Data Analysis Set Up**](#6-amplicon-seq-data-analysis-set-up)
+ - [6a. Create Sample Runsheet](#6a-create-sample-runsheet)
+ - [6b. R Environment Set Up](#6b-r-environment-set-up)
+ - [6b.i. Load Libraries](#6bi-load-libraries)
+ - [6b.ii. Define Custom Functions](#6bii-define-custom-functions)
+ - [6b.iii. Set Variables](#6biii-set-variables)
+ - [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables)
+ - [6b.v. Preprocessing](#6bv-preprocessing)
+ - [**7. Alpha Diversity Analysis**](#7-alpha-diversity-analysis)
+ - [7a. Rarefaction Curves](#7a-rarefaction-curves)
+ - [7b. Richness and Diversity Estimates](#7b-richness-and-diversity-estimates)
+ - [7c. Plot Richness and Diversity Estimates](#7c-plot-richness-and-diversity-estimates)
+ - [**8. Beta Diversity Analysis**](#8-beta-diversity-analysis)
+ - [**9. Taxonomy Plots**](#9-taxonomy-plots)
+ - [**10. Differential Abundance Testing**](#10-differential-abundance-testing)
+ - [10a. ANCOMBC 1](#10a-ancombc-1)
+ - [10b. ANCOMBC 2](#10b-ancombc-2)
+ - [10c. DESeq2 ](#10c-deseq2)
+
+---
+
+# Software used
+
+|Program|Version|Relevant Links|
+|:------|:-----:|:-------------|
+|FastQC|0.12.1|[https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)|
+|MultiQC|1.27.1|[https://multiqc.info/](https://multiqc.info/)|
+|Cutadapt|5.0|[https://cutadapt.readthedocs.io/en/stable/](https://cutadapt.readthedocs.io/en/stable/)|
+|R-base|4.4.2|[https://www.r-project.org/](https://www.r-project.org/)|
+|DADA2|1.34.0|[https://www.bioconductor.org/packages/release/bioc/html/dada2.html](https://www.bioconductor.org/packages/release/bioc/html/dada2.html)|
+|DECIPHER|3.20.0|[https://bioconductor.org/packages/release/bioc/html/DECIPHER.html](https://bioconductor.org/packages/release/bioc/html/DECIPHER.html)|
+|biomformat|1.34.0|[https://github.com/joey711/biomformat](https://github.com/joey711/biomformat)|
+|ANCOMBC|2.8.0|[https://github.com/FrederickHuangLin/ANCOMBC](https://github.com/FrederickHuangLin/ANCOMBC)|
+|broom|1.0.7|[https://CRAN.R-project.org/package=broom](https://CRAN.R-project.org/package=broom)|
+|DescTools|0.99.59|[https://andrisignorell.github.io/DescTools/](https://andrisignorell.github.io/DescTools/)|
+|DESeq2|1.46.0|[https://bioconductor.org/packages/release/bioc/html/DESeq2.html](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)|
+|FSA|0.9.6|[https://CRAN.R-project.org/package=FSA](https://CRAN.R-project.org/package=FSA)|
+|ggdendro|0.2.0|[https://CRAN.R-project.org/package=ggdendro](https://CRAN.R-project.org/package=ggdendro)|
+|ggrepel|0.9.6|[https://CRAN.R-project.org/package=ggrepel](https://CRAN.R-project.org/package=ggrepel)|
+|glue|1.8.0|[https://glue.tidyverse.org/](https://glue.tidyverse.org/)|
+|hexbin|1.28.3|[https://CRAN.R-project.org/package=hexbin](https://CRAN.R-project.org/package=hexbin)|
+|mia|1.14.0|[https://github.com/microbiome/mia](https://github.com/microbiome/mia)|
+|phyloseq|1.50.0|[https://bioconductor.org/packages/release/bioc/html/phyloseq.html](https://bioconductor.org/packages/release/bioc/html/phyloseq.html)|
+|rcolorbrewer|1.1.3|[https://CRAN.R-project.org/package=RColorBrewer](https://CRAN.R-project.org/package=RColorBrewer)|
+|taxize|0.10.0|[https://docs.ropensci.org/taxize/](https://docs.ropensci.org/taxize/)|
+|tidyverse|2.0.0|[https://CRAN.R-project.org/package=tidyverse](https://CRAN.R-project.org/package=tidyverse)|
+|vegan|2.6-10|[https://cran.r-project.org/package=vegan](https://cran.r-project.org/package=vegan)|
+|vsn|3.74.0|[https://bioconductor.org/packages/release/bioc/html/vsn.html](https://bioconductor.org/packages/release/bioc/html/vsn.html)|
+|patchwork|1.3.0|[https://CRAN.R-project.org/package=patchwork](https://CRAN.R-project.org/package=patchwork)|
+|rstatix|0.7.2|[https://CRAN.R-project.org/package=rstatix](https://CRAN.R-project.org/package=rstatix)|
+|multcompView|0.1-10|[https://CRAN.R-project.org/package=multcompView](https://CRAN.R-project.org/package=multcompView)|
+|scales|1.4.0|[https://CRAN.R-project.org/package=scales](https://CRAN.R-project.org/package=scales)|
+|dendextend|1.19.0|[https://CRAN.R-project.org/package=dendextend](https://CRAN.R-project.org/package=dendextend)|
+
+# Reference databases used
+
+
+|Program used|Database|DECIPHER Link|GeneLab Figshare Link|GeneLab Download Date|
+|:-----------|:------:|:------------|--------------------:|--------------------:|
+|DECIPHER| SILVA SSU r138_2 | [https://www2.decipher.codes/data/Downloads/TrainingSets/SILVA_SSU_r138_2_2024.RData](https://www2.decipher.codes/data/Downloads/TrainingSets/SILVA_SSU_r138_2_2024.RData) |[SILVA_SSU_r138_2_2024.RData](https://figshare.com/ndownloader/files/52846199)| 03/06/2025 |
+|DECIPHER| UNITE v2024 | [https://www2.decipher.codes/data/Downloads/TrainingSets/UNITE_v2024_April2024.RData](https://www2.decipher.codes/data/Downloads/TrainingSets/UNITE_v2024_April2024.RData) | [UNITE_v2024_April2024.RData](https://figshare.com/ndownloader/files/52846346)| 03/06/2025 |
+|DECIPHER| PR2 v4.13 | [https://www2.decipher.codes/data/Downloads/TrainingSets/PR2_v4_13_March2021.RData](https://www2.decipher.codes/data/Downloads/TrainingSets/PR2_v4_13_March2021.RData) | [PR2_v4_13_March2021.RData](https://figshare.com/ndownloader/files/46241917)| 05/10/2024 |
+---
+
+# General processing overview with example commands
+
+> Exact processing commands for specific datasets are available in the [GLDS_Processing_Scripts](../GLDS_Processing_Scripts) sub-directory of this repository, and/or are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).
+>
+> Output files listed in **bold** below are included with each Amplicon Seq processed dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).
+
+---
+
+## 1. Raw Data QC
+
+
+
+### 1a. Raw Data QC
+
+```
+fastqc -o raw_fastqc_output *.fastq.gz
+```
+
+**Parameter Definitions:**
+
+* `-o` – the output directory to store results
+* `*.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
+
+**Input Data:**
+
+* \*fastq.gz (raw reads)
+
+**Output Data:**
+
+* \*fastqc.html (FastQC output html summary)
+* \*fastqc.zip (FastQC output data)
+
+
+
+
+### 1b. Compile Raw Data QC
+
+```
+multiqc --interactive -n raw_multiqc_GLAmpSeq -o /path/to/raw_multiqc/output/raw_multiqc_GLAmpSeq_report /path/to/directory/containing/raw_fastqc/files
+
+zip -r raw_multiqc_GLAmpSeq_report.zip raw_multiqc_GLAmpSeq_report
+```
+
+**Parameter Definitions:**
+
+- `--interactive` – force reports to use interactive plots
+- `-n` – prefix name for output files
+- `-o` – the output directory to store results
+- `/path/to/directory/containing/raw_fastqc/files` – the directory holding the output data from the FastQC run, provided as a positional argument
+
+**Input Data:**
+
+* \*fastqc.zip (FastQC output data, output from [Step 1a](#1a-raw-data-qc))
+
+**Output Data:**
+
+* **raw_multiqc_GLAmpSeq_report.zip** (zip containing the following)
+ * **raw_multiqc_GLAmpSeq.html** (multiqc output html summary)
+ * **raw_multiqc_GLAmpSeq_data** (directory containing multiqc output data)
+
+
+
+---
+
+## 2. Trim Primers
+
+The location and orientation of primers in the data is important to understand in deciding how to do this step. `cutadapt` has many options for primer identification and removal, which are described in detail in the [cutadapt adapter type documentation](https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types).
+
+The following example commands show how it was done for some samples of [GLDS-200](https://osdr.nasa.gov/bio/repo/data/studies/OSD-200), which was 2x250 sequencing of the 16S gene using these primers:
+* forward: 5'-GTGCCAGCMGCCGCGGTAA-3'
+* reverse: 5'-GGACTACVSGGGTATCTAAT-3'
+
+Due to the size of the target amplicon and the type of sequencing done here, both forward and reverse primers are expected to be on each of the forward and reverse reads. It therefore takes "linked" primers as input for forward and reverse reads, specified in the example command below by the `...` between them. It also expects that the primers start at the first position of the reads ("anchored"), specified with the leading `^` characters in the example command below.
+
+The following website is useful for reverse complementing primers and dealing with degenerate bases appropriately: [http://arep.med.harvard.edu/labgc/adnan/projects/Utilities/revcomp.html](http://arep.med.harvard.edu/labgc/adnan/projects/Utilities/revcomp.html)
+
+```
+cutadapt -a ^GTGCCAGCMGCCGCGGTAA...ATTAGATACCCSBGTAGTCC -A ^GGACTACVSGGGTATCTAAT...TTACCGCGGCKGCTGGCAC \
+ -o sample1_R1_trimmed.fastq.gz -p sample1_R2_trimmed.fastq.gz sample1_R1_raw.fastq.gz sample1_R2_raw.fastq.gz \
+ --discard-untrimmed
+```
+
+**Parameter Definitions:**
+
+* `-a` – specifies the primers and orientations expected on the forward reads (when primers are linked as noted above)
+* `-A` – specifies the primers and orientations expected on the reverse reads (when primers are linked as noted above)
+* `-o` – specifies file path/name of forward, primer-trimmed reads
+* `-p` – specifies file path/name of reverse, primer-trimmed reads
+* `sample1_R1_raw.fastq.gz` – this and following "R2" file are positional arguments specifying the forward and reverse reads, respectively, for input
+* `--discard-untrimmed` – this filters out those reads where the primers were not found as expected
+
+**Input Data:**
+
+* \*fastq.gz (raw reads)
+
+**Output Data:**
+
+* **\*trimmed.fastq.gz** (trimmed reads)
+* **trimmed-read-counts_GLAmpSeq.tsv** (per sample read counts before and after trimming)
+* **cutadapt_GLAmpSeq.log** (log file of standard output and error from cutadapt)
+
+
+
+---
+
+## 3. Quality Filtering
+> The following is run in an R environment.
+
+Specific settings required will depend on the dataset being processing. These include parameters such as `truncLen`, which might depend on the target amplicon and its size, and `maxEE` which might depend on the quality of the sequencing run. For instance, when working with ITS data, it may be preferable to omit using the `truncLen` parameter if the target amplified region is expected to vary to lengths greater than the read size. More information on these parameters can be found at these sites:
+* [https://benjjneb.github.io/dada2/tutorial.html](https://benjjneb.github.io/dada2/tutorial.html)
+* [https://astrobiomike.github.io/amplicon/dada2_workflow_ex](https://astrobiomike.github.io/amplicon/dada2_workflow_ex)
+
+
+The following is an example from a [GLDS-200](https://osdr.nasa.gov/bio/repo/data/studies/OSD-200) sample that used paired-end 2x250 sequencing with the following 16S primers:
+* forward: 5'-GTGCCAGCMGCCGCGGTAA-3'
+* reverse: 5'- GGACTACVSGGGTATCTAAT-3'
+
+```bash
+filtered_out <- filterAndTrim(fwd="sample1_R1_trimmed.fastq.gz", filt="sample1_R1_filtered.fastq.gz",
+ rev="sample1_R2_trimmed.fastq.gz", filt.rev="sample1_R1_filtered.fastq.gz",
+ truncLen=c(220, 160), maxN=0, maxEE=c(2,2),
+ truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)
+```
+
+**Parameter Definitions:**
+
+* `filtered_out <-` – specifies the variable that will store the summary results within in our R environment
+* `filterAndTrim()` – the DADA2 function we are calling, with the following parameters set within it
+* `fwd=` – specifying the path to the forward reads, here "sample1_R1_trimmed.fastq.gz"
+* `filt=` – specifying the path to where the output forward reads will be written
+* `rev=` – specifying the path to the reverse reads, here "sample1_R2_trimmed.fastq.gz"; only applicable if paired-end
+* `filt.rev=` – specifying the path to where the output reverse reads will be written; only applicable if paired-end
+* `truncLen=c(220, 160)` – specifying the forward reads to be truncated at 220 bp, and the reverse to be truncated at 160 bps (note that this parameter also functions as a minimum-length filter); would only have 1 value if not paired-end
+* `maxN=0` – setting the maximum allowed Ns to 0, any reads with an N will be filtered out
+* `maxEE=c(2,2)` – setting maximum expected error allowed to 2 for each forward and reverse read; would only have 1 value if not paired-end
+* `truncQ=2` – looking from the lower-quality end of each read, truncate at the first base with a quality score lower than 2
+* `rm.phix=TRUE` – filter out reads with exact kmers matching the PhiX genome
+* `compress=TRUE` – gzip-compress the output filtered reads
+* `multithread=TRUE` – determine number of cores available and run in parallel when possible (can also take an integer specifying the number to run)
+
+**Input Data:**
+
+* \*.trimmed.fastq.gz (primer-trimmed reads, output from [Step 2](#2-trim-primers))
+
+**Output Data:**
+
+* **\*filtered.fastq.gz** (filtered reads)
+* **filtered-read-counts_GLAmpSeq.tsv** (a tab-separated file containing per sample read counts before and after filtering)
+
+
+
+---
+
+## 4. Filtered Data QC
+
+
+
+### 4a. Filtered Data QC
+```bash
+fastqc -o filtered_fastqc_output/ *filtered.fastq.gz
+```
+
+**Parameter Definitions:**
+
+* `-o` – the output directory to store results
+* `*filtered.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
+
+**Input Data:**
+
+* \*fastq.gz (filtered reads)
+
+**Output Data:**
+
+* \*fastqc.html (FastQC output html summary)
+* \*fastqc.zip (FastQC output data)
+
+
+
+### 4b. Compile Filtered Data QC
+```bash
+multiqc --interactive -n filtered_multiqc_GLAmpSeq -o /path/to/filtered_multiqc/output/filtered_multiqc_GLAmpSeq_report /path/to/directory/containing/filtered_fastqc/files
+
+zip -r filtered_multiqc_GLAmpSeq_report.zip filtered_multiqc_GLAmpSeq_report
+```
+
+**Parameter Definitions:**
+
+- `--interactive` – force reports to use interactive plots
+- `-n` – prefix name for output files
+- `-o` – the output directory to store results
+- `/path/to/directory/containing/filtered_fastqc/files` – the directory holding the output data from the FastQC run, provided as a positional argument
+
+**Input Data:**
+
+* \*fastqc.zip (FastQC output data, output from [Step 4a](#4a-filtered-data-qc))
+
+**Output Data:**
+
+* **filtered_multiqc_GLAmpSeq_report.zip** (zip containing the following)
+ * **filtered_multiqc_GLAmpSeq_report.html** (multiqc output html summary)
+ * **filtered_multiqc_GLAmpSeq_data** (directory containing multiqc output data)
+
+
+
+---
+
+## 5. Calculate Error Mdel, Apply DADA2 Algorithm, Assign Taxonomy, and Create Output Tables
+> The following is run in an R environment.
+
+These example commands as written assume paired-end data, with notes included on what would be different if working with single-end data. The taxonomy reference database used below is an example only, suitable for the example 16S dataset ([GLDS-200](https://osdr.nasa.gov/bio/repo/data/studies/OSD-200)) used here. Other taxonomy references databases designed for DECIPHER can be found here: [https://www2.decipher.codes/data/Downloads/TrainingSets/](https://www2.decipher.codes/data/Downloads/TrainingSets/)
+
+
+
+### 5a. Learning the Error Rates
+```R
+## Forward error rates ##
+forward_errors <- learnErrors(fls="sample1_R1_filtered.fastq.gz", multithread=TRUE)
+
+## Reverse error rates (skip if single-end data) ##
+reverse_errors <- learnErrors(fls="sample1_R2_filtered.fastq.gz", multithread=TRUE)
+```
+
+**Parameter Definitions:**
+
+* `learnErrors()` – the DADA2 function we are calling, with the following parameters set within it
+* `fls=` – specifies the path to the filtered reads (either forward or reverse)
+* `multithread=TRUE` – determine number of cores available and run in parallel when possible (can also take an integer specifying the number of cores to use)
+
+**Input Data:**
+
+* \*filtered.fastq.gz (filtered reads, output from [Step 3](#3-quality-filtering))
+
+**Output Data:**
+
+* `forward_errors` (a named list containing a numeric matrix with the forward error rates)
+* `reverse_errors` (a named list containing a numeric matrix with the reverse error rates (only for paired-end data))
+
+
+
+### 5b. Inferring Sequences
+```R
+## Inferring forward sequences ##
+forward_seqs <- dada(derep="sample1_R1_filtered.fastq.gz", err=forward_errors, pool="pseudo", multithread=TRUE)
+
+## Inferring reverse sequences (skip if single-end)##
+reverse_seqs <- dada(derep="sample1_R2_filtered.fastq.gz", err=reverse_errors, pool="pseudo", multithread=TRUE)
+```
+
+**Parameter Definitions:**
+
+* `dada()` – the DADA2 function we are calling, with the following parameters set within it
+* `derep=` – the path to the filtered reads (either forward or reverse)
+* `err=` – the object holding the error profile for the inferred reads (either forward or reverse)
+* `pool="pseudo"` – setting the method of incorporating information from multiple samples, "pseudo" instructs the algorithm to perform pseudo-pooling between individually processed samples
+* `multithread=TRUE` – determine number of cores available and run in parallel when possible (can also take an integer specifying the number of cores to use)
+
+**Input Data:**
+
+* \*filtered.fastq.gz (filtered reads, output from [Step 3](#3-quality-filtering))
+* `forward_errors` (a named list containing a numeric matrix with the forward error rates, output from [Step 5a](#5a-learning-the-error-rates))
+* `reverse_errors` (a named list containing a numeric matrix with the reverse error rates, output from [Step 5a](#5a-learning-the-error-rates) (only for paired-end))
+
+**Output Data:**
+
+* `forward_seqs` (a dada-class object containing the forward-read inferred sequences)
+* `reverse_seqs` (a dada-class object containing the reverse-read inferred sequences (only for paired-end))
+
+
+
+### 5c. Merging Forward and Reverse Reads; Skip if Data are Single-End
+```R
+merged_contigs <- mergePairs(dadaF=forward_seqs, derepF="sample1_R1_filtered.fastq.gz", dadaR=reverse_seqs, derepR="sample1_R2_filtered.fastq.gz")
+```
+
+**Parameter Definitions:**
+
+* `merged_contigs <-` – specifies the variable that will store the results within in our R environment
+* `mergePairs()` – the DADA2 function we are calling, with the following parameters set within it
+* `dadaF=` – specifies the object holding the forward-read inferred sequences
+* `derepF=` – specifies the path to the filtered forward reads
+* `dadaR=` – specifies the object holding the reverse-read inferred sequences
+* `derepR=` – specifies the path to the filtered reverse reads
+
+**Input Data:**
+
+* \*filtered.fastq.gz (filtered reads, output from [Step 3](#3-quality-filtering))
+* `forward_seqs` (a dada-class object containing forward-read inferred sequences, output from [Step 5b](#5b-inferring-sequences))
+* `reverse_seqs` (a dada-class object containing reverse-read inferred sequences, output from [Step 5b](#5b-inferring-sequences))
+
+**Output Data:**
+
+* `merged_contigs` (a dataframe containing the merged contigs)
+
+
+
+### 5d. Generating Sequence Table with Counts per Sample
+```R
+seqtab <- makeSequenceTable(merged_contigs)
+```
+
+**Parameter Definitions:**
+
+* `seqtab <-` - specifies the variable that will store the results within our R environment
+* `makeSequenceTable()` - the DADA2 function we call calling, with either `merged_contigs` for paired-end data (as in this example) or `forward_seqs` for single-end data as input.
+
+**Input Data:**
+
+* `merged_contigs` or `forward_seqs` (a variable containing the merged contigs for paired-end data, output from [Step 5c](#5c-merging-forward-and-reverse-reads-skip-if-data-are-single-end), or the forward-read inferred sequences for single-end data, output from [Step 5b](#5b-inferring-sequences))
+
+**Output Data:**
+
+* `seqtab` (a named integer matrix containing the sequence table)
+
+
+
+### 5e. Removing putative chimeras
+```R
+seqtab.nochim <- removeBimeraDenovo(unqs=seqtab, method="consensus", multithread=TRUE)
+```
+
+**Parameter Definitions:**
+
+* `seqtab.nochim <-` – specifies the variable that will store the results within in our R environment
+* `removeBimeraDenovo()` – the DADA2 function we are calling, with the following parameters set within it
+* `unqs=` – specifying the `seqtab` object created above
+* `method=` – specifying the method for putative-chimera identification and removal, "consensus" instructs the function to check the samples in the sequence table independently for bimeras and make a consensus decision on each sequence variant
+* `multithread=TRUE` – determine number of cores available and run in parallel when possible (can also take an integer specifying the number to run)
+
+**Input Data:**
+
+* `seqtab` (a named integer matrix containing the sequence table, output from [Step 5d](#5d-generating-sequence-table-with-counts-per-sample))
+
+**Output Data:**
+
+* `seqtab.nochim` (a named integer matrix containing the sequence table filtered to exclude putative chimeras)
+
+
+
+### 5f. Assigning Taxonomy
+
+```R
+## Creating a DNAStringSet object from the ASVs: ##
+dna <- DNAStringSet(getSequences(seqtab.nochim))
+
+## Downloading the reference R taxonomy object: ##
+download.file(url = "https://figshare.com/ndownloader/files/52846199",
+ destfile = "SILVA_SSU_r138_2_2024.RData",
+ method = "libcurl",
+ headers = c("User-Agent" = "Mozilla/5.0"))
+
+## Loading taxonomy object: ##
+load("SILVA_SSU_r138_2_2024.RData")
+
+## Classifying sequences:
+tax_info <- IdTaxa(test=dna, trainingSet=trainingSet, strand="both", processors=NULL)
+
+**Parameter Definitions:**
+
+- `download.file()` - the R utils function used to download the taxonomy database file
+ - `url=` - reference database URL address to download
+ - `destfile=` - local path/name for the downloaded file
+ - `method=` - specifies the download method to use
+ - `headers=` - HTTP headers to pass with the download request
+- `IdTaxa()` - the DECIPHER function used to classify the sequences
+ - `test=dna` - DNAStringSet object holding sequences to classify
+ - `trainingSet=trainingSet` - specifies the reference database to use
+ - `strand="both"` - specifies to check taxonomy assignment in both orientations
+ - `processors=NULL` - specifies the number of processors to use, `NULL` indicates to use all available cores or an integer may be provided to manually specify the number to use
+
+**Input Data:**
+
+* `seqtab.nochim` (a named integer matrix containing the filtered sequence table, output from [Step 5e](#5e-removing-putative-chimeras))
+* `trainingSet` (a variable provided in the RData object containing the reference database, SILVA_SSU_r138_2_2024.RData)
+
+**Output Data:**
+
+* `tax_info` (the DECIPHER Taxa object containing assigned taxons)
+
+
+
+### 5g. Generating and Writing Standard Outputs
+
+```R
+## Giving sequences more manageable names (e.g. ASV_1, ASV_2, …,): ##
+asv_seqs <- colnames(seqtab.nochim)
+asv_headers <- vector(dim(seqtab.nochim)[2], mode="character")
+
+for (i in 1:dim(seqtab.nochim)[2]) {
+ asv_headers[i] <- paste(">ASV", i, sep="_")
+}
+
+## Making then writing a fasta of final ASV seqs: ##
+asv_fasta <- c(rbind(asv_headers, asv_seqs))
+write(asv_fasta, "ASVs_GLAmpSeq.fasta")
+
+## Making then writing a count table: ##
+asv_tab <- t(seqtab.nochim)
+row.names(asv_tab) <- sub(">", "", asv_headers)
+
+write.table(asv_tab, "counts_GLAmpSeq.tsv", sep="\t", quote=F, col.names=NA)
+
+## Creating table of taxonomy and setting any that are unclassified as "NA": ##
+ranks <- c("domain", "phylum", "class", "order", "family", "genus", "species")
+tax_tab <- t(sapply(tax_info, function(x) {
+ m <- match(ranks, x$rank)
+ taxa <- x$taxon[m]
+ taxa[startsWith(taxa, "unclassified_")] <- NA
+ taxa
+}))
+colnames(tax_tab) <- ranks
+rownames(tax_tab) <- gsub(pattern=">", replacement="", x=asv_headers)
+
+write.table(tax_tab, "taxonomy_GLAmpSeq.tsv", sep = "\t", quote=F, col.names=NA)
+
+## Generating then writing biom file format: ##
+biom_object <- make_biom(data=asv_tab, observation_metadata=tax_tab)
+write_biom(biom_object, "taxonomy-and-counts_GLAmpSeq.biom")
+
+## Making a combined taxonomy and count table ##
+tax_and_count_tab <- merge(tax_tab, asv_tab)
+write.table(tax_and_count_tab, "taxonomy-and-counts_GLAmpSeq.tsv", sep="\t", quote=FALSE, row.names=FALSE)
+```
+
+**Input Data:**
+
+* `seqtab.nochim` (a named integer matrix containing the filtered sequence table, output from [Step 5e](#5e-removing-putative-chimeras))
+* `tax_info` (the DECIPHER Taxa object containing assigned taxons, output from [Step 5f](#5f-assigning-taxonomy))
+
+**Output Data:**
+
+* **ASVs_GLAmpSeq.fasta** (a fasta file containing the inferred sequences)
+* **counts_GLAmpSeq.tsv** (a tab-separated file containing the sample feature count table)
+* **taxonomy_GLAmpSeq.tsv** (a tab-separated file containing the taxonomy table)
+* **taxonomy-and-counts_GLAmpSeq.tsv** (a tab-separated file containing the combined taxonomy and count table)
+* **taxonomy-and-counts_GLAmpSeq.biom** (a biom-formatted file containing the count and taxonomy table)
+* **read-count-tracking_GLAmpSeq.tsv** (a tab-separated file containing the read counts at each processing step)
+
+
+
+---
+
+## 6. Amplicon Seq Data Analysis Set Up
+
+
+
+### 6a. Create Sample Runsheet
+
+> Note: Rather than running the command below to create the runsheet needed for processing, the runsheet may also be created manually by following the examples for [Paired-end](https://github.com/nasa/GeneLab_AmpliconSeq_Workflow/blob/main/examples/runsheet/PE_file.csv) and [Single-end](https://github.com/nasa/GeneLab_AmpliconSeq_Workflow/blob/main/examples/runsheet/SE_file.csv) samples. When creating this table manually, the most important columns for the analyses below are:
+
+* `sample_id` - column with unique sample names.
+* `groups` - column with the groups/treatments that each sample belong to. This column is used for comparison.
+
+```bash
+### Download the *ISA.zip file from the OSDR ###
+dpt-get-isa-archive \
+ --accession OSD-###
+
+### Parse the metadata from the *ISA.zip file to create a sample runsheet ###
+dpt-isa-to-runsheet --accession OSD-### \
+ --config-type amplicon \
+ --config-version Latest \
+ --isa-archive *ISA.zip
+```
+
+**Parameter Definitions:**
+
+* `--accession OSD-###` - OSD accession ID (replace ### with the OSD number being processed), used to retrieve the urls for the ISA archive and raw reads hosted on the OSDR
+* `--config-type` - instructs the script to extract the metadata required for Amplicon Sequencing data processing from the ISA archive
+* `--config-version` - specifies the `dp-tools` configuration version to use, a value of `Latest` will specify the most recent version
+* `--isa-archive` - specifies the *ISA.zip file for the respective OSD dataset, downloaded in the `dpt-get-isa-archive` command
+
+**Input Data:**
+
+* *No input data required, but the OSD accession ID needs to be indicated, which is used to download the respective ISA archive*
+
+**Output Data:**
+
+* *ISA.zip (compressed ISA directory containing Investigation, Study, and Assay (ISA) metadata files for the respective OSD dataset, used to define sample groups - the *ISA.zip file is located in the [OSDR](https://osdr.nasa.gov/bio/repo/) under 'Files' -> 'Study Metadata Files')
+
+* **{OSD-Accession-ID}_amplicon_v{version}_runsheet.csv** (a comma-separated sample metadata file containing sample group information, version denotes the dp_tools schema used to specify the metadata to extract from the ISA archive)
+ > NOTE: if there are multiple valid Amplicon Sequencing assays in the dataset, then multiple runsheets will be generated (1 for each assay). The runsheet filenames will also include the value from the "Parameter Value[Library Selection]" column as well as the assay table name in between the OSD Accession ID and the `config_type`. For example, for OSD-268, which has both "16S" and "ITS" assays, two files are generated: OSD-268_16S_a_OSD-268_amplicon-sequencing_16s_illumina_amplicon_v1_runsheet.csv and OSD-268_ITS_a_OSD-268_amplicon-sequencing_its_illumina_amplicon_v1_runsheet.csv.
+
+
+
+> The remainder of this document is performed in R.
+
+
+### 6b. R Environment Set Up
+
+
+#### 6b.i. Load Libraries
+
+```R
+library(vegan)
+library(phyloseq)
+library(glue)
+library(FSA)
+library(multcompView)
+library(rstatix)
+library(patchwork)
+library(RColorBrewer)
+library(DESeq2)
+library(ggdendro)
+library(broom)
+library(ggrepel)
+library(tools)
+library(ANCOMBC)
+library(DescTools)
+library(taxize)
+library(mia)
+library(utils)
+library(scales)
+library(tidyverse)
+library(vsn)
+library(hexbin)
+```
+
+#### 6b.ii. Define Custom Functions
+
+#### calculate_text_size()
+
+ calculates text size for plotting based on number of samples and minimum text size
+
+ ```R
+ calculate_text_size <- function(num_samples, start_samples = 25, min_size = 3) {
+ max_size <- 11 # Maximum size for up to start_samples
+ slope <- -0.15
+
+ if (num_samples <= start_samples) {
+ return(max_size)
+ } else {
+ # Calculate the current size with the hard coded slope
+ current_size = max_size + slope * (num_samples - start_samples)
+
+ # Ensure the size doesn't go below the minimum
+ return(max(current_size, min_size))
+ }
+ }
+ ```
+
+ **Function Parameter Definitions:**
+ - `num_samples=` - the number of samples to plot
+ - `start_samples=25` - the number of samples to start with, used to specify text size
+ - `min_size=3` - the minimum text size for plotting
+
+ **Returns:** an integer representing the calculated text size
+
+
+#### expandy()
+
+ wrapper around `ggplot2:expand_limits` that expands a plot's y-limit based on a specified set of y-values
+
+ ```R
+ expandy <- function(vec, ymin=NULL) {
+ # vec [NUMERIC] - a numeric vector of y values.
+
+ max.val <- max(vec, na.rm=TRUE) + 0.1
+
+ expand_limits(y=c(ymin, max.val))
+ }
+ ```
+
+ **Function Parameter Definitions:**
+ - `vec=` - a numeric vector containing y-values
+ - `ymin=` - the minimum y-limit
+
+
+
+#### transform_phyloseq()
+
+ create a phyloseq object with the appropriate sample count transformation depending on the supplied transformation method ('rarefy' or 'vst')
+
+ ```R
+ transform_phyloseq <- function( feature_table, metadata, method, rarefaction_depth=500){
+
+ # Rarefaction
+ if(method == 'rarefy'){
+ # Create phyloseq object
+ ASV_physeq <- phyloseq(otu_table(feature_table, taxa_are_rows = TRUE),
+ sample_data(metadata))
+
+ # Get the count for every sample sorted in ascending order
+ seq_per_sample <- colSums(feature_table) %>% sort()
+ # Minimum sequences/count value
+ depth <- min(seq_per_sample)
+
+ # Loop through the sequences per sample and return the count
+ # nearest to the minimum required rarefaction depth
+ for (count in seq_per_sample) {
+ # Get the count equal to rarefaction_depth or nearest to it
+ if(count >= rarefaction_depth) {
+ depth <- count
+ break
+ }
+ }
+
+ #----- Rarefy sample counts to even depth per sample
+ ps <- rarefy_even_depth(physeq = ASV_physeq,
+ sample.size = depth,
+ rngseed = 1,
+ replace = FALSE,
+ verbose = FALSE)
+
+ # Variance Stabilizing Transformation
+ }else if(method == "vst"){
+
+ # Using deseq
+ # Keep only ASVs with at least 1 count
+ feature_table <- feature_table[rowSums(feature_table) > 0, ]
+ # Add +1 pseudocount for VST for vst transformation
+ feature_table <- feature_table + 1
+
+ # Make the order of samples in metadata match the order in feature table
+ metadata <- metadata[colnames(feature_table),]
+
+ # Create VST normalized counts matrix
+ # ~1 means no design
+ deseq_counts <- DESeqDataSetFromMatrix(countData = feature_table,
+ colData = metadata,
+ design = ~1)
+ deseq_counts_vst <- varianceStabilizingTransformation(deseq_counts)
+ vst_trans_count_tab <- assay(deseq_counts_vst)
+
+ # Making a phyloseq object with our transformed table
+ vst_count_phy <- otu_table(object = vst_trans_count_tab, taxa_are_rows = TRUE)
+ sample_info_tab_phy <- sample_data(metadata)
+ ps <- phyloseq(vst_count_phy, sample_info_tab_phy)
+ }else{
+ stop("Please supply a valid normalization method, either 'rarefy' or 'vst' ")
+ }
+ return(ps)
+ }
+ ```
+ **Function Parameter Definitions:**
+ - `feature_table=` - a dataframe containing feature/ASV counts with samples as columns and features as rows
+ - `metadata=` - a dataframe containing sample metadata with samples as row names and sample info as columns
+ - `method=` - a string specifying the transformation to use: either "rarefy" (rarefaction) or "vst" (variance stabilizing transformation)
+ - `rarefaction_depth=500` - the minimum number of reads to simulate during rarefaction
+
+ **Returns:** a phyloseq object created from the feature_table, metadata, and specified transformation method
+
+
+#### make_dendrogram()
+
+ compute hierarchical clustering and create a dendrogram
+
+ ```R
+ make_dendrogram <- function(dist_obj, metadata, groups_colname,
+ group_colors, legend_title){
+
+ # Hierarchical Clustering
+ sample_clust <- hclust(d = dist_obj, method = "ward.D2")
+
+ # Extract clustering data for plotting
+ hcdata <- dendro_data(sample_clust, type = "rectangle")
+ segment_data <- segment(hcdata) # sepcifications for tree structure
+ label_data <- label(hcdata) %>%
+ left_join(metadata %>%
+ rownames_to_column("label")) # Labels are sample names
+
+ # Plot dendrogram
+ dendrogram <- ggplot() +
+ # Plot tree
+ geom_segment(data = segment_data,
+ aes(x = x, y = y, xend = xend, yend = yend)
+ ) +
+ # Add sample text labels to tree
+ geom_text(data = label_data ,
+ aes(x = x, y = y, label = label,
+ color = !!sym(groups_colname) , hjust = 0),
+ size = 4.5, key_glyph = "rect") +
+ scale_color_manual(values = group_colors) +
+ coord_flip() +
+ scale_y_reverse(expand = c(0.2, 0)) +
+ labs(color = legend_title) +
+ theme_dendro() +
+ guides(colour = guide_legend(override.aes = list(size = 5)))+
+ theme(legend.key = element_rect(fill=NA),
+ text = element_text(face = 'bold'),
+ legend.title = element_text(size = 12, face='bold'),
+ legend.text = element_text(face = 'bold', size = 11))
+
+ return(dendrogram)
+ }
+ ```
+ **Function Parameter Definitions:**
+ - `dist_obj=` - a distance object of class 'dist' holding the calculated distance (Euclidean, Bray-Curtis, etc.) between samples
+ - `metadata=` - a dataframe containing sample metadata with samples as row names and sample info as columns
+ - `groups_colname=` - name of the column in the metadata dataframe to use for specifying sample groups
+ - `legend_title=` - legend title to use for plotting
+
+ **Returns:** a dendrogram plot
+
+
+#### run_stats()
+
+ run variance and adonis (analysis of variance using distance matrices) tests
+
+ ```R
+ run_stats <- function(dist_obj, metadata, groups_colname){
+
+ # Retrieve sample names from the dist object
+ samples <- attr(dist_obj, "Label")
+ # subset metadata to contain ony samples in the dist_obj
+ metadata <- metadata[samples,]
+
+ # Run variance test and present the result in a nicely formatted table / dataframe
+ variance_test <- betadisper(d = dist_obj,
+ group = metadata[[groups_colname]]) %>%
+ anova() %>% # MAke results anova-like
+ broom::tidy() %>% # make the table 'tidy'
+ mutate(across(where(is.numeric), ~round(.x, digits = 2))) # round-up numeric columns
+
+ # Run Adonis test
+ adonis_res <- adonis2(formula = dist_obj ~ metadata[[groups_colname]])
+
+ adonis_test <- adonis_res %>%
+ broom::tidy() %>% # Make tidy table
+ mutate(across(where(is.numeric), ~round(.x, digits = 2))) # round-up numeric columns
+
+ # Return a named list with the variance and adonis test results
+ return(list(variance = variance_test, adonis = adonis_test))
+ }
+ ```
+ **Function Parameter Definitions:**
+ - `dist_obj=` - distance object of class 'dist' holding the calculated distance (Euclidean, Bray-Curtis, etc.) between samples
+ - `metadata=` - a dataframe containing sample metadata with samples as row names and sample info as columns
+ - `groups_colname=` - string specifying the name of the column in the metadata dataframe to use for specifying sample groups
+
+ **Returns:** a named list containing the variance and adonis test results as dataframes
+
+
+#### plot_pcoa()
+
+ generate a Principle Coordinate Analysis (PCoA) plot using phyloseq::ordinate function
+
+ ```R
+ plot_pcoa <- function(ps, stats_res, distance_method,
+ groups_colname, group_colors, legend_title,
+ addtext=FALSE) {
+
+ # Generating a PCoA with phyloseq
+ pcoa <- ordinate(physeq = ps, method = "PCoA", distance = distance_method)
+ eigen_vals <- pcoa$values$Eigenvalues
+
+ # Calculate the percentage of variance
+ percent_variance <- eigen_vals / sum(eigen_vals) * 100
+
+ # Retrieving plot labels
+ r2_value <- stats_res$adonis[["R2"]][1]
+ prf_value <- stats_res$adonis[["p.value"]][1]
+ label_PC1 <- sprintf("PC1 [%.1f%%]", percent_variance[1])
+ label_PC2 <- sprintf("PC2 [%.1f%%]", percent_variance[2])
+
+ # Retrieving pcoa vectors
+ vectors_df <- pcoa$vectors %>%
+ as.data.frame() %>%
+ rownames_to_column("samples")
+ # Creating a dataframe for plotting
+ plot_df <- sample_data(ps) %>%
+ as.matrix() %>%
+ as.data.frame() %>%
+ rownames_to_column("samples") %>%
+ select(samples, !!groups_colname) %>%
+ right_join(vectors_df, join_by("samples"))
+ # Plot pcoa
+ p <- ggplot(plot_df, aes(x=Axis.1, y=Axis.2,
+ color=!!sym(groups_colname),
+ label=samples)) +
+ geom_point(size=1)
+
+ # Add text
+ if(addtext){
+ p <- p + geom_text(show.legend = FALSE,
+ hjust = 0.3, vjust = -0.4, size = 4)
+ }
+
+ # Add annotations to pcoa plot
+ p <- p + labs(x = label_PC1, y = label_PC2, color = legend_title) +
+ coord_fixed(sqrt(eigen_vals[2]/eigen_vals[1])) +
+ scale_color_manual(values = group_colors) +
+ theme_bw() + theme(text = element_text(size = 15, face="bold"),
+ legend.direction = "vertical",
+ legend.justification = "center",
+ legend.title = element_text(hjust=0.1)) +
+ annotate("text", x = Inf, y = -Inf,
+ label = paste("R2:", toString(round(r2_value, 3))),
+ hjust = 1.1, vjust = -2, size = 4)+
+ annotate("text", x = Inf, y = -Inf,
+ label = paste("Pr(>F)", toString(round(prf_value,4))),
+ hjust = 1.1, vjust = -0.5, size = 4) + ggtitle("PCoA")
+
+ return(p)
+ }
+ ```
+
+ **Function Parameter Definitions:**
+ - `ps=` - phyloseq object constructed from feature, taxonomy, and metadata tables
+ - `stats_res=` - named list containing the variance and adonis test results as dataframes, generated using [run_stats()](#run_stats)
+ - `distance_method=` - string specifying the method used to calculate the distance between samples; values can be "euclidean" (Euclidean distance) or "bray" (Bray-Curtis dissimilarity)
+ - `groups_colname=` - string specifying the name of the column in the metadata dataframe to use for specifying sample groups
+ - `group_colors=` - named character vector of colors for each group in `groups_colname`
+ - `legend_title=` - string specifying the legend title to use for plotting
+ - `addtext=FALSE` - boolean value specifying if the sample labels should be added to the pcoa plot
+
+ **Returns:** a PCoA plot
+
+
+#### remove_rare_features()
+
+ filter out rare features from a feature table by occurrence in a fraction of samples depending on the supplied cut-off
+
+ ```R
+ remove_rare_features <- function(feature_table, cut_off_percent=0.75){
+
+ # Filter by occurrence in a fraction of samples
+ # Define a cut-off for determining what's rare
+ cut_off <- cut_off_percent * ncol(feature_table)
+ # Get the occurrence for each feature
+ feature_occurrence <- rowSums(feature_table > 0)
+ # Get names of the abundant features
+ abund_features <- names(feature_occurrence[feature_occurrence >= cut_off])
+ # Remove rare features
+ abun_features.m <- feature_table[abund_features,]
+ return(abun_features.m)
+ }
+ ```
+**Function Parameter Definitions:**
+ - `feature_table=` - dataframe containing feature/ASV counts with samples as columns and features as rows
+ - `cut_off_percent=0.75` - decimal value between 0.001 and 1 specifying the fraction of the total number of samples to determine the most abundant features; by default it removes features that are not present in 3/4 of the total number of samples
+
+ **Returns:** a dataframe of feature/ASV counts filtered to include only features passing the specified threshold
+
+
+#### process_taxonomy()
+
+ reformat taxonomy table to regularize naming/formatting. Removes extraneous prefixes/suffixes from taxonomy names; replaces empty values with "Other".
+
+ ```R
+ process_taxonomy <- function(taxonomy, prefix='\\w__') {
+
+ # Ensure that all columns are of character data type
+ taxonomy <- apply(X = taxonomy, MARGIN = 2, FUN = as.character)
+
+ # Loop over every column (rank i.e. domain to species) amd make the necessary edits
+ for (rank in colnames(taxonomy)) {
+ # Delete the taxonomy prefix
+ taxonomy[,rank] <- gsub(pattern = prefix, x = taxonomy[, rank],
+ replacement = '')
+ # Delete _number at the end of taxonomy names inserted by the new version of DECIPHER
+ taxonomy[,rank] <- gsub(pattern ="_[0-9]+$", x = taxonomy[, rank], replacement = '')
+
+ indices <- which(is.na(taxonomy[,rank]))
+ taxonomy[indices, rank] <- rep(x = "Other", times=length(indices))
+ # replace empty cell with the string 'Other'
+ indices <- which(taxonomy[,rank] == "")
+ taxonomy[indices,rank] <- rep(x = "Other", times=length(indices))
+ }
+ # Replace _ with space
+ taxonomy <- apply(X = taxonomy,MARGIN = 2,
+ FUN = gsub,pattern = "_",replacement = " ") %>%
+ as.data.frame(stringAsfactor=F)
+ return(taxonomy)
+ }
+ ```
+ **Function Parameter Definitions:**
+ - `taxonomy=` - dataframe of ASV taxonomy assignments to be processed
+ - `prefix='\\w__'` - a regular expression specifying the characters to remove from taxon names; use '\\w__' for greengenes and 'D_\\d__' for SILVA
+
+ **Returns:** a dataframe containing the reformatted ASV taxonomy assignments
+
+
+#### format_taxonomy_table()
+
+ reformat taxonomy table by appending a suffix to a known name in the previous cell
+
+ ```R
+ format_taxonomy_table <- function(taxonomy, stringToReplace="Other", suffix=";Other") {
+
+ for (taxa_index in seq_along(taxonomy)) {
+
+ indices <- grep(x = taxonomy[,taxa_index], pattern = stringToReplace)
+
+ taxonomy[indices,taxa_index] <-
+ paste0(taxonomy[indices,taxa_index-1],
+ rep(x = suffix, times=length(indices)))
+
+ }
+ return(taxonomy)
+ }
+ ```
+ **Function Parameter Definitions:**
+ - `taxonomy=` - dataframe of ASV taxonomy assignments to be processed
+ - `stringToReplace="Other"` - specifies the string to replace, "Other" is used by default
+ - `suffix=";Other"` - specifies the replacement string, ";Other" is used by default
+
+ **Returns:** a dataframe containing the reformatted ASV taxonomy assignments
+
+
+#### fix_names()
+
+ reformat taxonomy table by appending a set of suffixes to a set of known names
+
+ ```R
+ fix_names<- function(taxonomy,stringToReplace,suffix){
+
+ for(index in seq_along(stringToReplace)){
+ taxonomy <- format_taxonomy_table(taxonomy = taxonomy,
+ stringToReplace=stringToReplace[index],
+ suffix=suffix[index])
+ }
+ return(taxonomy)
+ }
+ ```
+ **Custom Functions Used:**
+ - [format_taxonomy_table()](#format_taxonomy_table)
+
+ **Function Parameter Definitions:**
+ - `taxonomy=` - dataframe of ASV taxonomy assignments to be processed
+ - `stringToReplace=` - a vector of regex strings specifying what to replace in the taxonomy dataframe
+ - `suffix=` - a vector of regex strings specifying the replacement strings to use
+
+ **Returns:** a dataframe containing the reformatted ASV taxonomy assignments
+
+
+#### make_feature_table()
+
+ generate taxon level count matrix based on a taxonomy table and an existing feature table, only retains taxa found in at least one sample
+
+ ```R
+ make_feature_table <- function(count_matrix,taxonomy,
+ taxon_level, samples2keep=NULL){
+
+ feature_counts_df <- data.frame(taxon_level=taxonomy[,taxon_level],
+ count_matrix, check.names = FALSE,
+ stringsAsFactors = FALSE)
+
+ feature_counts_df <- aggregate(.~taxon_level,data = feature_counts_df,
+ FUN = sum)
+ rownames(feature_counts_df) <- feature_counts_df[,"taxon_level"]
+ feature_table <- feature_counts_df[,-1]
+ # Retain only taxa found in at least one sample
+ taxa2keep <- rowSums(feature_table) > 0
+ feature_table <- feature_table[taxa2keep,]
+
+ if(!is.null(samples2keep)){
+ feature_table <- feature_table[,samples2keep]
+ # Retain only taxa found in at least one sample
+ taxa2keep <- rowSums(feature_table) > 0
+ feature_table <- feature_table[taxa2keep,]
+ }
+ return(feature_table)
+ }
+ ```
+ **Function Parameter Definitions:**
+ - `count_matrix=` - a dataframe containing ASVs or OTUs and their respective counts
+ - `taxonomy=` - dataframe of ASV taxonomy assignments to be processed
+ - `taxon_level=` - string defining the taxon levels, i.e. domain to species
+ - `samples2keep=NULL` - a list of sample strings specifying the samples to keep; default value of NULL keeps all samples
+
+ **Returns:** a dataframe containing a taxon level count matrix filtered for taxa found in at least one sample
+
+
+#### group_low_abund_taxa()
+
+ group rare taxa or return a table with the rare taxa
+
+ ```R
+ group_low_abund_taxa <- function(abund_table, threshold=0.05,
+ rare_taxa=FALSE) {
+
+ # Initialize an empty vector that will contain the indices for the
+ # low abundance columns/ taxa to group
+ taxa_to_group <- c()
+ #initialize the index variable of species with low abundance (taxa/columns)
+ index <- 1
+
+ # Loop over every column or taxa then check to see if the max abundance is less than the set threshold
+ # if true, save the index in the taxa_to_group vector variable
+
+ for (column in seq_along(abund_table)){
+ if(max(abund_table[, column], na.rm = TRUE) < threshold ){
+ taxa_to_group[index] <- column
+ index = index + 1
+ }
+ }
+ if(is.null(taxa_to_group)){
+ message(glue::glue("Rare taxa were not grouped. please provide a higher threshold than {threshold} for grouping rare taxa, only numbers are allowed."))
+ return(abund_table)
+ }
+
+ if(rare_taxa){
+ abund_table <- abund_table[, taxa_to_group, drop=FALSE]
+ }else{
+ #remove the low abundant taxa or columns
+ abundant_taxa <-abund_table[, -(taxa_to_group), drop=FALSE]
+ #get the rare taxa
+ # rare_taxa <-abund_table[, taxa_to_group]
+ rare_taxa <- subset(x = abund_table, select = taxa_to_group)
+ #get the proportion of each sample that makes up the rare taxa
+ rare <- rowSums(rare_taxa)
+ #bind the abundant taxa to the rae taxa
+ abund_table <- cbind(abundant_taxa, rare)
+ #rename the columns i.e the taxa
+ colnames(abund_table) <- c(colnames(abundant_taxa), "Rare")
+ }
+
+ return(abund_table)
+ }
+ ```
+ **Function Parameter Definitions:**
+ - `abund_table=` - relative abundance matrix with taxa as columns and samples as rows
+ - `threshold=0.05` - a number between 0.001 and 1 specifying the threshold for grouping rare taxa
+ - `rare_taxa=FALSE` - boolean specifying if only rare taxa should be returned, if set to TRUE then a table with only the rare taxa will be returned
+
+ **Returns:** a dataframe containing a relative abundance matrix with taxa as columns and samples as rows
+
+
+#### collapse_samples()
+
+ collapse samples in a feature table with a user-defined function based on group in metadata
+
+ ```R
+ collapse_samples <- function(taxon_table, metadata, group, fun=sum, convertToRelativeAbundance=FALSE){
+
+ common.ids <- intersect(rownames(taxon_table), rownames(metadata))
+ metadata <- droplevels(metadata[common.ids, , drop=FALSE])
+ taxon_table <- taxon_table[common.ids, , drop=FALSE]
+ taxon_table <- cbind(subset(x = metadata, select=group), taxon_table)
+
+ taxon_table <- aggregate(reformulate(termlabels = group, response = '.'),
+ data = taxon_table, FUN = fun)
+ rownames(taxon_table) <- taxon_table[, 1]
+ taxon_table <- taxon_table[,-1]
+ if(convertToRelativeAbundance){
+ taxon_table <- t(apply(X = taxon_table, MARGIN = 1, FUN = function(x) x/sum(x)))
+ }
+
+ final <- list(taxon_table,metadata)
+ names(final) <- c("taxon_table","metadata")
+ return(final)
+ }
+ ```
+ **Function Parameter Definitions:**
+ - `taxon_table=` - a dataframe of feature counts with samples as rows and features (e.g. ASVs or OTUs) as columns
+ - `metadata=` - a dataframe containing sample metadata with samples as row names and sample info as columns
+ - `group=` - a string specifying the column in the metadata dataframe containing the sample groups that will be used to collapse the samples
+ - `fun=sum` - a string specifying the R function to apply when collapsing the samples; 'sum' is used by default
+ - `convertToRelativeAbundance=FALSE` - a boolean specifying whether or not the value in the taxon table should be converted to per sample relative abundance values
+
+ **Returns:** a named list containing two dataframes: `taxon_table` and `metadata`
+ - a dataframe of aggregated feature counts by group
+ - a dataframe containing group specific metadata for the aggregated feature count
+
+
+#### get_ncbi_ids()
+
+ retrieve NCBI taxonomy id for a given taxonomy name using taxize
+
+ ```R
+ taxize_options(ncbi_sleep = 0.8)
+ get_ncbi_ids <- function(taxonomy, target_region){
+
+ if(target_region == "ITS"){
+ search_string <- "fungi"
+ }else if(target_region == "18S"){
+ search_string <- "eukaryote"
+ }else{
+ search_string <- "bacteria"
+ }
+
+ uid <- get_uid(taxonomy, division_filter = search_string)
+ tax_ids <- uid[1:length(uid)]
+ return(tax_ids)
+
+ }
+ ```
+ **Function Parameter Definitions:**
+ - `taxonomy=` - string specifying the taxonomy name that will be used to search for the respective NCBI ID
+ - `target_region=` - amplicon target region to analyze; options are "16S", "18S", or "ITS"
+
+ **Returns:** an integer of NCBI taxonomic identifiers; if none is found, returns NA
+
+
+#### find_bad_taxa()
+
+ error handling function for ANCOMBC::ancombc2
+
+ ```R
+ find_bad_taxa <- function(cnd){
+ split_res <- strsplit(conditionMessage(cnd), "\n")
+
+ if(split_res == "replacement has 0 rows, data has 1" ||
+ split_res == "All taxa contain structural zeros") {
+
+ return(
+ list(res=data.frame(taxon=split_res, lfc=NA, se=NA,
+ W=NA, p=NA, q=NA, diff=NA, pass_ss=NA))
+ )
+ }
+
+ bad_taxa <- split_res[[c(1L, 2L)]]
+ bad_taxa <- .subset2(strsplit(bad_taxa, ", "), 1L)
+ return(bad_taxa)
+ }
+ ```
+ **Function Parameter Definitions:**
+ - `cnd=` - specifies the error condition to catch when running the ANCOMBC::ancombc2 function
+
+ **Returns:** a list containing one empty dataframe named 'res' with the same format as the ANCOMBC::ancombc2 primary result
+
+
+#### ancombc2()
+
+ wrapper around ANCOMBC::ancombc2() function that adds error handling
+
+ ```R
+ ancombc2 <- function(data, ...) {
+
+ tryCatch(
+ ANCOMBC::ancombc2(data = data, ...),
+ error = function(cnd) {
+
+ res <- find_bad_taxa(cnd)
+ if( is.data.frame(res[[1]]) ){
+ # Returns a manually created empty data.frame
+ return(res)
+ }else{
+ # Returns the names of the bad taxa to exclude from further analysis
+ bad_taxa <- res # renaming for readability
+ }
+
+ # Second error catcher in case it fails the first one
+ tryCatch(
+ ANCOMBC::ancombc2(data = data[!rownames(data) %in% bad_taxa, ], ...),
+
+ error = function(cnd) {
+ # Returns a manually created empty data.frame
+ find_bad_taxa(cnd)
+ })
+ }
+ )
+ }
+ ```
+ **Custom Functions Used:**
+ - [find_bad_taxa()](#find_bad_taxa)
+
+ **Function Parameter Definitions:**
+ - `data=` - specifies the treeSummarizedExperiment containing the feature, taxonomy and metadata to be analyzed using ancombc2
+ - `...` - Other arguments passed on to ancombc2
+
+ **Returns:** an ancombc2 result, or an empty result as returned by [find_bad_taxa()](#find_bad_taxa)
+
+
+#### gm_mean()
+
+ calculates the geometric mean
+
+ ```R
+ gm_mean <- function(x, na.rm=TRUE) {
+ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
+ }
+ ```
+ **Function Parameter Definitions:**
+ - `x=` - a numeric vector specifying the values to calculate the geometirc mean on
+ - `na.rm=TRUE` - boolean specifying if NAs should be removed prior to calculating the geometirc mean; default is TRUE
+
+ **Returns:** a numeric value representing the geometric mean
+
+
+#### plotSparsity()
+
+ Plots ASV sparsity. A modification of DESeq2::plotSparsity to generate a ggplot.
+
+ ```R
+ plotSparsity <- function (x, normalized = TRUE, feature="ASV", ...) {
+
+
+ if (is(x, "DESeqDataSet")) {
+
+ x <- counts(x, normalized = normalized)
+ }
+
+ rs <- MatrixGenerics::rowSums(x)
+ rmx <- apply(x, 1, max)
+
+ # Prepare plot dataframe
+ df <- data.frame(rs=rs, rmx=rmx) %>%
+ mutate(x=rs, y=rmx/rs) %>%
+ filter(x>0)
+
+ # Plot
+ ggplot(data = df, aes(x=x, y=y), ...) +
+ geom_point(size=3) +
+ scale_x_log10() +
+ scale_y_continuous(limits = c(0,1)) +
+ theme_bw() +
+ labs(title = glue("Concentration of {feature} counts over total sum of {feature} counts"),
+ x=glue("Sum of counts per {feature}"),
+ y=glue("Max {feature} count / Sum of {feature} counts")) +
+ theme(axis.text = element_text(face = "bold", size = 12),
+ axis.title = element_text(face = "bold", size = 14),
+ title = element_text(face = "bold", size = 14))
+
+ }
+
+ ```
+ **Function Parameter Definitions:**
+ - `x=` - a matrix or DESeqDataSet to plot
+ - `normalized=TRUE` - boolean specifying whether to normalize the counts from a DESeqDataSEt; default is TRUE
+ - `feature=` - a string specifying which feature type ("ASV", "OTU", "gene" etc.) is being plotted; Default is "ASV"
+ - `...=` - any named argument(s) that can be passed to the ggplot2::ggplot function.
+
+ **Returns:** a sparsity plot of type ggplot
+
+
+
+
+
+#### 6b.iii. Set Variables
+
+```R
+# Define a custom palette for plotting
+custom_palette <- c("#1F78B4","#33A02C","#FB9A99","#E31A1C","#6A3D9A",
+ "#FDBF6F", "#FF7F00","#CAB2D6","#FF00FFFF", "#B15928",
+ "#000000","#FFC0CBFF", "#A6CEE3", "#8B864EFF","#F0027F",
+ "#666666","#1B9E77", "#E6AB02","#A6761D","#FFFF00FF",
+ "#00FFFFFF", "#FFFF99", "#B2182B","#FDDBC7","#D1E5F0",
+ "#B2DF8A","#CC0033","#FF00CC","#330033", "#999933",
+ "#FF9933", "#FFFAFAFF",colors())
+
+# Remove white colors
+pattern_to_filter <- "white|snow|azure|gray|#FFFAFAFF|aliceblue"
+custom_palette <- custom_palette[-c(21:23, grep(pattern = pattern_to_filter,
+ x = custom_palette,
+ ignore.case = TRUE))]
+# Custom theme for plotting
+publication_format <- theme_bw() +
+ theme(panel.grid = element_blank()) +
+ theme(axis.ticks.length=unit(-0.15, "cm"),
+ axis.text.x=element_text(margin=ggplot2::margin(t=0.5,r=0,b=0,l=0,unit ="cm")),
+ axis.text.y=element_text(margin=ggplot2::margin(t=0,r=0.5,b=0,l=0,unit ="cm")),
+ axis.title = element_text(size = 18,face ='bold.italic', color = 'black'),
+ axis.text = element_text(size = 16,face ='bold', color = 'black'),
+ legend.position = 'right',
+ legend.title = element_text(size = 15,face ='bold', color = 'black'),
+ legend.text = element_text(size = 14,face ='bold', color = 'black'),
+ strip.text = element_text(size = 14,face ='bold', color = 'black'))
+```
+
+**Input Data:**
+
+*No input data required*
+
+**Output Data:**
+
+* `custom_palette` (a vector of strings specifying a custom color palette for coloring plots)
+* `publication_format` (a ggplot::theme object specifying the custom theme for plotting)
+
+
+
+#### 6b.iv. Read-in Input Tables
+
+```R
+diff_abund_out_dir <- "differential_abundance/"
+if(!dir.exists(diff_abund_out_dir)) dir.create(diff_abund_out_dir)
+assay_suffix <- "_GLAmpSeq"
+output_prefix <- ""
+custom_palette <- {COLOR_VECTOR}
+groups_colname <- "groups"
+sample_colname <- "Sample Name"
+metadata_file <- file.path("{OSD-Accession-ID}_AmpSeq_v{version}_runsheet.csv")
+features_file <- file.path("counts_GLAmpSeq.tsv")
+taxonomy_file <- file.path("taxonomy_GLAmpSeq.tsv")
+
+# Read-in metadata and convert from tibble to dataframe
+metadata <- read_csv(file = metadata_file) %>% as.data.frame()
+# Set row names
+row.names(metadata) <- metadata[[sample_colname]]
+# Write out Sample Table
+write_csv(x = metadata %>% select(!!sym(sample_colname), !!sym(groups_colname)),
+ file = glue("{diff_abund_out_dir}{output_prefix}SampleTable{assay_suffix}.csv"))
+
+# Delete sample column since the rownames now contain sample names
+metadata[,sample_colname] <- NULL
+# Get unique group names
+group_column_values <- metadata %>% pull(!!sym(groups_colname))
+group_levels <- unique(group_column_values)
+
+# Write out table listing contrasts used for all differential abundance methods
+# Get pairwise combinations
+pairwise_comp.m <- utils::combn(group_levels, 2)
+# Create comparison names
+comparisons <- paste0("(", pairwise_comp.m[2,], ")v(", pairwise_comp.m[1,], ")")
+names(comparisons) <- comparisons
+# Create contrasts table
+contrasts_df <- data.frame(
+ " " = c("1", "2"),
+ rbind(pairwise_comp.m[2,], pairwise_comp.m[1,]) %>% as.data.frame() %>% setNames(comparisons),
+ check.names = FALSE
+)
+write_csv(x = contrasts_df,
+ file = glue("{diff_abund_out_dir}{output_prefix}contrasts{assay_suffix}.csv"))
+
+# Add colors to metadata that equals the number of groups
+num_colors <- length(group_levels)
+palette <- 'Set1'
+number_of_colors_in_palette <- 9
+if(num_colors <= number_of_colors_in_palette){
+ colors <- RColorBrewer::brewer.pal(n = num_colors, name = palette)
+}else{
+ colors <- custom_palette[1:num_colors]
+}
+
+# ------ Metadata ----- #
+# Assign color names to each group
+group_colors <- setNames(colors, group_levels)
+metadata <- metadata %>%
+ mutate(color = map_chr(!!sym(groups_colname),
+ function(group) { group_colors[group] }
+ )
+ ) # assign group specific colors to each row in metadata
+
+# Retrieve sample names
+sample_names <- rownames(metadata)
+deseq2_sample_names <- make.names(sample_names, unique = TRUE)
+
+# Subset metadata to contain only the groups and color columns
+sample_info_tab <- metadata %>%
+ select(!!groups_colname, color) %>% # select groups and color columns
+ arrange(!!sym(groups_colname)) # metadata by groups column
+
+# Retrieves unique colors
+values <- sample_info_tab %>% pull(color) %>% unique()
+
+# ---- Import Feature or ASV table ---- #
+feature_table <- read.table(file = features_file, header = TRUE,
+ row.names = 1, sep = "\t",
+ check.names = FALSE)
+
+# ---- Import Taxonomy table ---- #
+taxonomy_table <- read.table(file = taxonomy_file, header = TRUE,
+ row.names = 1, sep = "\t"),
+ check.names = FALSE)
+```
+
+**Input Data:**
+
+* `diff_abund_out_dir` (a string specifying the path to the output folder for the differential abundance results, default is "differential_abundance/")
+* `assay_suffix` (a string specifying the suffix to be added to output files; default is the Genelab assay suffix, "_GLAmpSeq")
+* `output_prefix` (a string specifying an additional prefix to be added to the output files; default is no additional prefix, "")
+* `groups_colname` (a string specifying the name of the column in the metadata table containing the group names)
+* `sample_colname` (a string specifying the name of the column in the metadata table containing the sample names)
+* `custom_palette` (a vector of strings specifying a custom color palette for coloring plots, output from [6b.iii. Set Variables](#6biii-set-variables))
+* {OSD-Accession-ID}_AmpSeq_v{version}_runsheet.csv (a comma-separated sample metadata file containing sample group information, output from [Step 6a](#6a-create-sample-runsheet))
+* counts_GLAmpSeq.tsv (a tab-separated file containing sample feature counts table (i.e. ASV or OTU table), output from [Step 5g](#5g-generating-and-writing-standard-outputs))
+* taxonomy_GLAmpSeq.tsv (a tab-separated file containing feature taxonomy table containing ASV taxonomy assignments, output from [Step 5g](#5g-generating-and-writing-standard-outputs))
+
+**Output Data:**
+
+* `metadata` (a dataframe containing the sample metadata, with samples as row names and sample info as columns)
+* `feature_table` (a dataframe containing the sample feature counts table (i.e. ASV or OTU table) from the input counts file)
+* `taxonomy_table` (a dataframe containing ASV taxonomy assignments from the input taxonomy file)
+* `sample_info_tab` (a dataframe containing a subset of the metadata dataframe with only the groups and color columns)
+* `values` (a character vector of unique color values for each group)
+* `sample_names` (a character vector of sample names)
+* `deseq2_sample_names` (a character vector of unique sample names)
+* `group_colors` (a named character vector of colors for each group)
+* `group_levels` (a character vector of unique group names)
+* **differential_abundance/SampleTable_GLAmpSeq.csv** (a comma-separated file containing a table with two columns: "Sample Name" and "groups"; the output_prefix denotes the method used to compute the differential abundance)
+* **differential_abundance/contrasts_GLAmpSeq.csv** (a comma-separated file listing all pairwise group comparisons)
+
+
+
+#### 6b.v. Preprocessing
+Filters the feature and taxonomy tables to include only features that (a) pass the specified prevalence and library count thresholds and (b) are not from Chloroplast or Mitochondrial Organelle contamination.
+
+```R
+feature_table <- {DATAFRAME} # from step [Read-in Input Tables]
+taxonomy_table <- {DATAFRAME} # from step [Read-in Input Tables]
+target_region <- "16S" # 16S, 18S, or ITS
+remove_rare <- FALSE # TRUE OR FALSE
+prevalence_cutoff <- 0
+library_cutoff <- 0
+
+
+if(remove_rare){
+
+ # Remove samples with less than library-cutoff
+ message(glue("Dropping samples with less than {library_cutoff} read counts"))
+ feature_table <- feature_table[,colSums(feature_table) >= library_cutoff]
+ # Remove rare ASVs
+ message(glue("Dropping features with prevalence less than {prevalence_cutoff * 100}%"))
+ feature_table <- remove_rare_features(feature_table,
+ cut_off_percent = prevalence_cutoff)
+}
+
+# Preprocess ASV and taxonomy tables
+
+message(glue("There are {sum(is.na(taxonomy_table$domain))} features without
+ taxonomy assignments. Dropping them..."))
+
+# Dropping features that couldn't be assigned taxonomy
+# For beta and alpha diversity only, unassigned ASVs are not dropped in DA analyses
+taxonomy_table <- taxonomy_table[-which(is.na(taxonomy_table$domain)),]
+
+# Handle case where no domain was assigned but a phylum was.
+if(all(is.na(taxonomy_table$domain))){
+
+ if(target_region == "ITS"){
+ taxonomy_table$domain <- "Fungi"
+ }else if(target_region == "18S"){
+ taxonomy_table$domain <- "Eukaryotes"
+ }else{
+ taxonomy_table$domain <- "Bacteria"
+ }
+
+}
+
+# Removing Chloroplast and Mitochondria Organelle DNA contamination
+asvs2drop <- taxonomy_table %>%
+ unite(col="taxonomy",domain:species) %>%
+ filter(str_detect(taxonomy, "[Cc]hloroplast|[Mm]itochondria")) %>%
+ row.names()
+taxonomy_table <- taxonomy_table[!(rownames(taxonomy_table) %in% asvs2drop),]
+
+# Clean taxonomy names
+feature_names <- rownames(taxonomy_table)
+taxonomy_table <- process_taxonomy(taxonomy_table)
+rownames(taxonomy_table) <- feature_names
+taxonomy_table <- fix_names(taxonomy_table, "Other", ";_")
+
+# Subset tables
+
+# Get features common to the taxonomy and feature table
+common_ids <- intersect(rownames(feature_table), rownames(taxonomy_table))
+
+# Subset the feature and taxonomy tables to contain
+# only features found in both tables
+feature_table <- feature_table[common_ids,]
+taxonomy_table <- taxonomy_table[common_ids,]
+
+# drop samples with zero sequence counts
+samples2keep <- colnames(feature_table)[colSums(feature_table) > 0]
+
+feature_table <- feature_table[, samples2keep]
+metadata <- metadata[samples2keep,]
+```
+**Custom Functions Used:**
+
+* [remove_rare_features()](#remove_rare_features)
+* [process_taxonomy()](#process_taxonomy)
+* [fix_names()](#fix_names)
+
+**Parameter Definitions:**
+
+* `remove_rare` - boolean specifying if rare features and samples should be filtered out based on the `prevalence_cutoff` and `library_cutoff` cutoff thresholds, respectively, prior to analysis; default is FALSE
+* `prevalence_cutoff` - a decimal between 0 and 1 specifying the proportion of samples required to contain a taxon in order to keep the taxon when `remove_rare` is set to TRUE; default is 0, i.e. do not exclude any taxon / feature
+* `library_cutoff` - a numerical value specifying the number of total counts a sample must have across all features to be retained when `remove_rare` is set to TRUE; default is 0, i.e. no samples will be dropped
+* `target_region` - a string specifying the amplicon target region; options are either "16S", "18S", or "ITS"
+
+**Input Data:**
+
+* `feature_table` (a dataframe containing sample feature counts (i.e. ASV or OTU table), output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+* `taxonomy_table` (a dataframe containing ASV taxonomy assignments, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+
+**Output Data:**
+
+* `feature_table` (a dataframe containing a filtered subset of the samples feature counts (i.e. ASV or OTU table) after removing features that do not meet the filtering thresholds or that belong to Chroroplast or Mitochondrial organelles, and samples with zero sequence counts)
+* `taxonomy_table` (a dataframe containing a filtered subset of the feature taxonomy table after removing ASV taxonomy assignments for features that do not meet the filtering thresholds or that belong to Chroroplast or Mitochondrial organelles)
+
+
+
+---
+
+## 7. Alpha Diversity Analysis
+
+Alpha diversity examines the variety and abundance of taxa within individual samples. Rarefaction curves are utilized to
+visually represent this diversity, plotting the number of unique sequences (ASVs) identified against the total number of
+sequences sampled, offering a perspective on the saturation and completeness of sampling. Metrics like Observed features
+estimates and Shannon diversity indices are employed to quantify the richness (total number of unique sequences) and
+diversity (combination of richness and evenness) within these samples, respectively.
+
+### 7a. Rarefaction Curves
+```R
+# Create output directory if it doesn't already exist
+alpha_diversity_out_dir <- "alpha_diversity/"
+if(!dir.exists(alpha_diversity_out_dir)) dir.create(alpha_diversity_out_dir)
+sample_info_tab <- {DATAFRAME}
+feature_table <- {DATAFRAME}
+taxonomy_table <- {DATAFRAME}
+group_colors <- {NAMED_VECTOR}
+groups_colname <- "groups"
+rarefaction_depth <- 500
+legend_title <- "Groups"
+assay_suffix <- "_GLAmpSeq"
+output_prefix <- ""
+
+# Create phyloseq object
+ASV_physeq <- phyloseq(otu_table(feature_table, taxa_are_rows = TRUE),
+ tax_table(as.matrix(taxonomy_table)),
+ sample_data(sample_info_tab))
+
+seq_per_sample <- colSums(feature_table) %>% sort()
+
+# Get rarefaction depth
+# minimum value
+depth <- min(seq_per_sample)
+
+# Error if the number of sequences per sample left after filtering is
+# insufficient for diversity analysis
+if(max(seq_per_sample) < 100){
+
+ print(seq_per_sample)
+ stop(glue("The maximum sequence count per sample ({max(seq_per_sample)}) is less than 100. \
+ Therefore, alpha diversity analysis cannot be performed."))
+}
+
+# -------------------- Rarefy sample counts to even depth per sample
+ps.rarefied <- rarefy_even_depth(physeq = ASV_physeq,
+ sample.size = depth,
+ rngseed = 1,
+ replace = FALSE,
+ verbose = FALSE)
+
+
+# ------------------- Rarefaction curve
+# Calculate a rough estimate of the step sample step size for plotting.
+# This is meant to keep plotting time constant regardless of sample depth
+step <- (50*depth)/1000
+
+p <- rarecurve(t(otu_table(ps.rarefied)) %>% as.data.frame(),
+ step = step,
+ col = sample_info_tab[["color"]],
+ lwd = 2, ylab = "ASVs", cex=0.5,
+ label = FALSE, tidy = TRUE)
+
+
+sample_info_tab_names <- sample_info_tab %>% rownames_to_column("Site")
+
+p <- p %>% left_join(sample_info_tab_names, by = "Site")
+
+# Sample rarefaction curves
+
+rareplot <- ggplot(p, aes(x = Sample, y = Species,
+ group = Site, color = !!sym(groups_colname))) +
+ geom_line() +
+ scale_color_manual(values = group_colors) +
+ labs(x = "Number of Sequences", y = "Number of ASVs", color = legend_title) +
+ theme_bw() +
+ theme(legend.position = "right",
+ text = element_text(face = 'bold', size = 15),
+ legend.text = element_text(face = 'bold', size = 14),
+ legend.direction = "vertical",
+ legend.justification = "center",
+ legend.box.just = "center",
+ legend.title = element_text(size = 15, face='bold'),
+ panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank(),
+ plot.margin = margin(t = 10, r = 20, b = 10, l = 10, unit = "pt"))
+
+ggsave(filename = glue("{alpha_diversity_out_dir}/{output_prefix}rarefaction_curves{assay_suffix}.png"),
+ plot=rareplot, width = 14, height = 8.33, dpi = 300, limitsize = FALSE)
+```
+
+**Input Data:**
+
+* `alpha_diversity_out_dir` (a string specifying the path to the output folder for the alpha diversity results, default is "alpha_diversity/")
+* `rarefaction_depth` (an integer specifying the minimum number of reads to simulate during rarefaction for alpha diversity estimation)
+* `groups_colname` (a string specifying the name of the column in the `sample_info_tab` table containing the group names)
+* `legend_title` (a string specifying the legend title for plotting)
+* `assay_suffix` (a string specifying the suffix to be added to output files; default is the Genelab assay suffix, "_GLAmpSeq")
+* `output_prefix` (a string specifying an additional prefix to be added to the output files; default is no additional prefix, "")
+* `sample_info_tab` (a dataframe containing a subset of the metadata dataframe with only the groups and color columns, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+* `feature_table` (a dataframe containing a filtered subset of the samples feature dataframe (i.e. ASV), output from [6b.v. Preprocessing](#6bv-preprocessing))
+* `taxonomy_table` (a dataframe containing a filtered subset of the feature taxonomy dataframe with ASV taxonomy assignments, output from [6b.v. Preprocessing](#6bv-preprocessing))
+* `group_colors` (a named character vector of colors for each group, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+
+**Output Data:**
+
+* `ps.rarefied` (a phyloseq object of the sample features (i.e. ASV) with feature counts derived from the `feature_table`, resampled such that all samples have the same library size)
+* **alpha_diversity/rarefaction_curves_GLAmpSeq.png** (plot containing the rarefaction curves for each sample)
+
+
+
+### 7b. Richness and Diversity Estimates
+```R
+metadata <- {DATAFRAME}
+groups_colname <- "groups"
+assay_suffix <- "_GLAmpSeq"
+output_prefix <- ""
+
+# ------------------ Richness and diversity estimates ------------------#
+
+# Statistics table
+diversity_metrics <- c("Observed", "Chao1", "Shannon", "Simpson")
+names(diversity_metrics) <- diversity_metrics
+diversity.df <- estimate_richness(ps.rarefied,
+ measures = diversity_metrics) %>%
+ select(-se.chao1) %>%
+ rownames_to_column("samples")
+
+merged_table <- metadata %>%
+ rownames_to_column("samples") %>%
+ inner_join(diversity.df)
+
+diversity_stats <- map_dfr(.x = diversity_metrics, function(metric){
+
+
+ number_of_groups <- merged_table[,groups_colname] %>% unique() %>% length()
+
+ if (number_of_groups < 2){
+ warning_file <- glue("{alpha_diversity_out_dir}{output_prefix}alpha_diversity_warning.txt")
+ original_groups <- length(unique(metadata[[groups_colname]]))
+ writeLines(
+ text = glue("Group count information:
+Original number of groups: {original_groups}
+Number of groups after filtering: {number_of_groups}
+
+There are less than two groups to compare, hence, pairwise comparisons cannot be performed.
+Please ensure that your metadata contains two or more groups to compare..."),
+ con = warning_file
+ )
+ quit(status = 0)
+ }else if(number_of_groups == 2){
+
+ df <- data.frame(y=merged_table[,metric], x=merged_table[,groups_colname]) %>%
+ wilcox_test(y~x) %>%
+ adjust_pvalue(method = "bonferroni") %>%
+ select(group1, group2, W=statistic, p, p.adj) %>%
+ mutate(Metric=metric) %>%
+ add_significance(p.col='p.adj', output.col = 'p.signif') %>%
+ select(Metric,group1, group2, W, p, p.adj, p.signif)
+
+ }else{
+
+ res <- dunnTest(merged_table[,metric],merged_table[,groups_colname])
+
+ df <- res$res %>%
+ separate(col = Comparison, into = c("group1", "group2"), sep = " - ") %>%
+ mutate(Metric=metric) %>%
+ rename(p=P.unadj, p.adj=P.adj) %>%
+ add_significance(p.col='p.adj', output.col = 'p.signif') %>%
+ select(Metric,group1, group2, Z, p, p.adj, p.signif)
+
+ }
+
+ return(df)
+})
+
+# Write diversity statistics table to file
+write_csv(x = diversity_stats,
+ file = glue("{alpha_diversity_out_dir}/{output_prefix}statistics_table{assay_suffix}.csv"))
+
+# Get different letters indicating statistically significant group comparisons for every diversity metric
+comp_letters <- data.frame(group = group_levels)
+colnames(comp_letters) <- groups_colname
+
+walk(.x = diversity_metrics, function(metric = .x) {
+
+ sub_comp <- diversity_stats %>% filter(Metric == metric)
+
+ sanitize <- function(x) gsub("-", "_", x)
+ g1 <- sanitize(sub_comp$group1)
+ g2 <- sanitize(sub_comp$group2)
+
+ safe_names <- paste(g1, g2, sep = "-")
+ orig_names <- paste(sub_comp$group1, sub_comp$group2, sep = "-")
+ safe_to_orig <- setNames(orig_names, safe_names)
+
+ p_values <- setNames(sub_comp$p.adj, safe_names)
+
+ letters <- multcompView::multcompLetters(p_values)$Letters
+ names(letters) <- safe_to_orig[names(letters)]
+
+ letters_df <- enframe(letters,
+ name = groups_colname,
+ value = glue("{metric}_letter"))
+
+ comp_letters <<- comp_letters %>% left_join(letters_df)
+})
+
+# Summary table
+diversity_table <- metadata %>%
+ rownames_to_column("samples") %>%
+ inner_join(diversity.df) %>%
+ group_by(!!sym(groups_colname)) %>%
+ summarise(N = n(), across(Observed:Simpson,
+ .fns = list(mean = mean, se = se),
+ .names = "{.col}_{.fn}")) %>%
+ mutate(across(where(is.numeric), ~round(.x, digits = 2))) %>%
+ left_join(comp_letters) %>%
+ mutate(Observed = glue("{Observed_mean} ± {Observed_se}{Observed_letter}"),
+ Chao1 = glue("{Chao1_mean} ± {Chao1_se}{Chao1_letter}"),
+ Shannon = glue("{Shannon_mean} ± {Shannon_se}{Shannon_letter}"),
+ Simpson = glue("{Simpson_mean} ± {Simpson_se}{Simpson_letter}")
+ ) %>%
+ select(-contains("_"))
+
+# Write diversity summary table to file
+write_csv(x = diversity_table,
+ file = glue("{alpha_diversity_out_dir}/{output_prefix}summary_table{assay_suffix}.csv"))
+```
+
+**Input Data:**
+
+* `groups_colname` (a string specifying the name of the column in the metadata table containing the group names)
+* `assay_suffix` (a string specifying the suffix to be added to output files; default is the Genelab assay suffix, "_GLAmpSeq")
+* `output_prefix` (a string specifying an additional prefix to be added to the output files; default is no additional prefix, "")
+* `metadata` (a dataframe containing the sample metadata, with samples as row names and sample info as columns, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+* `ps.rarefied` (a phyloseq object of the sample features (i.e. ASV) with feature counts, resampled such that all samples have the same library size, output from [7a. Rarefaction Curves](#7a-rarefaction-curves))
+* `group_levels` (a character vector of unique group names, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+
+**Output Data:**
+
+* **alpha_diversity/statistics_table_GLAmpSeq.csv** (a comma-separated table containing the z-score, p-value, and adjusted p-value statistics for each pairwise comparison for all metrics evaluated, Observed, Chao1, Shannon, and Simpson)
+* **alpha_diversity/summary_table_GLAmpSeq.csv** (a comma-separated table containing the sample number and mean +/- standard error of each metric (Observed, Chao1, Shannon, and Simpson) for each group)
+
+
+
+### 7c. Plot Richness and Diversity Estimates
+
+```R
+sample_info_tab <- {DATAFRAME}
+metadata <- {DATAFRAME}
+groups_colname <- "groups"
+legend_title <- "Groups"
+assay_suffix <- "_GLAmpSeq"
+output_prefix <- ""
+
+# ------------------ Make richness by sample dot plots ---------------------- #
+
+number_of_samples <- length(rownames(sample_info_tab))
+richness_sample_label_size <- calculate_text_size(number_of_samples)
+metrics2plot <- c("Observed", "Shannon")
+names(metrics2plot) <- metrics2plot
+
+samples_order <- metadata %>% arrange(!!sym(groups_colname)) %>% rownames()
+
+richness_by_sample <- plot_richness(ps.rarefied, color = groups_colname,
+ measures = metrics2plot)
+
+richness_by_sample <- ggplot(richness_by_sample$data %>%
+ mutate(samples = factor(samples,
+ levels=samples_order)),
+ aes(x=samples, y=value, colour = !!sym(groups_colname))) +
+ geom_point() +
+ geom_errorbar(aes(ymin=value-se, ymax = value+se),
+ width=0.2, position=position_dodge(0.9)) +
+ facet_wrap(~variable, scales = "free_y") +
+ scale_color_manual(values = group_colors) +
+ theme_bw() +labs(x = NULL, color = legend_title, y="Alpha Diversity Measure") +
+ theme(
+ text = element_text(face = 'bold', size = 15),
+ legend.text = element_text(face = 'bold', size = 14),
+ legend.position = "bottom",
+ legend.direction = "vertical",
+ legend.justification = "center",
+ legend.box.just = "center",
+ legend.title = element_text(face = 'bold', size = 15, hjust = 0.09),
+ axis.text.x = element_text(angle = 90,
+ size = richness_sample_label_size,
+ vjust = 0.5, # Vertically center the text
+ hjust = 1),
+ axis.ticks.length=unit(-0.15, "cm"),
+ strip.text = element_text(size = 14,face ='bold')
+ )
+
+# Save sample plot
+ggsave(filename = glue("{alpha_diversity_out_dir}/{output_prefix}richness_and_diversity_estimates_by_sample{assay_suffix}.png"),
+ plot=richness_by_sample, width = 14, height = 8.33,
+ dpi = 300, units = "in", limitsize = FALSE)
+
+# ------------------- Make richness by group box plots ----------------------- #
+richness_by_group <- plot_richness(ps.rarefied, x = groups_colname,
+ color = groups_colname,
+ measures = metrics2plot)
+
+p <- map(.x = metrics2plot, .f = function(metric){
+
+ p <- ggplot(richness_by_group$data %>% filter(variable == metric),
+ aes(x=!!sym(groups_colname), y=value, fill=!!sym(groups_colname))
+ ) +
+ geom_point() +
+ geom_boxplot() +
+ scale_fill_manual(values = group_colors) +
+ theme_bw() + labs(fill = legend_title, x = NULL, y= metric) +
+ theme(
+ text = element_text(size = 15, face = 'bold'),
+ legend.text = element_text(face = 'bold', size = 14),
+ legend.position = "right",
+ legend.direction = "vertical",
+ legend.justification = "center",
+ legend.box.just = "center",
+ legend.title = element_text(face = 'bold', size = 15),
+ axis.text.x = element_blank(),
+ axis.ticks.length=unit(-0.15, "cm"),
+ strip.text = element_text(size = 14,face ='bold')
+ )
+
+
+ summary_table <- p$data %>%
+ select(!!sym(groups_colname), value) %>%
+ group_by(!!sym(groups_colname)) %>%
+ summarise(max=max(value), range=max(value)-min(value)) %>%
+ left_join(comp_letters %>%
+ select(!!sym(groups_colname), label= !!sym( glue("{metric}_letter") )
+ )
+ )
+ text_size <- 6
+
+ # Calculate a constant to add to the max value of each group
+ # to determine where each group text will be added
+ toAdd <- if_else(condition = max(summary_table$range) <= 5,
+ true = min(summary_table$range),
+ false = (median(summary_table$range) - min(summary_table$range)) / 20
+ )
+
+ # Add text to plot
+ p + geom_text(data=summary_table,
+ mapping = aes(y=max+toAdd, label=label, fontface = "bold"),
+ size = text_size)
+})
+
+richness_by_group <- wrap_plots(p, ncol = 2, guides = 'collect') +
+ plot_annotation(caption = "If letters are shared between two groups, then they are not significantly different (q-value > 0.05)",
+ theme = theme(plot.caption = element_text(face = 'bold.italic'))
+ )
+
+# Save group plot
+width <- 3.6 * length(group_levels)
+ggsave(filename = glue("{output_prefix}richness_and_diversity_estimates_by_group{assay_suffix}.png"),
+ plot=richness_by_group, width = width,
+ height = 8.33, dpi = 300, units = "in",
+ path = alpha_diversity_out_dir)
+
+```
+**Custom Functions Used:**
+
+* [calculate_text_size()](#calculate_text_size)
+
+**Input Data:**
+
+* `groups_colname` (a string specifying the name of the column in the metadata table containing the group names)
+* `legend_title` (a string specifying the legend title for plotting)
+* `assay_suffix` (a string specifying the suffix to be added to output files; default is the Genelab assay suffix, "_GLAmpSeq")
+* `output_prefix` (a string specifying an additional prefix to be added to the output files; default is no additional prefix, "")
+* `metadata` (a dataframe containing the sample metadata, with samples as row names and sample info as columns, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+* `sample_info_tab` (a dataframe containing a subset of the metadata dataframe with only the groups and color columns, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+* `ps.rarefied` (a phyloseq object of the sample features (i.e. ASV) with feature counts, resampled such that all samples have the same library size, output from [7a. Rarefaction Curves](#7a-rarefaction-curves))
+* `group_levels` (a character vector of unique group names, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+* `group_colors` (a named character vector of colors for each group, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+
+
+**Output Data:**
+
+* **alpha_diversity/richness_and_diversity_estimates_by_sample_GLAmpSeq.png** (dot plots containing richness and diversity estimates for each sample)
+* **alpha_diversity/richness_and_diversity_estimates_by_group_GLAmpSeq.png** (box plots containing richness and diversity estimates for each group)
+
+
+
+---
+
+## 8. Beta Diversity Analysis
+
+Beta diversity measures the variation in species composition between different samples or environments. A common practice in working with a new dataset is to generate some exploratory visualizations like ordinations and hierarchical clusterings. These give us a quick overview of how our samples relate to each other and can be a way to check for problems like batch effects.
+
+Two normalization methods are supported before performing hierarchical clustering: variance stabilizing transformation (VST) and rarefaction. After rarefaction, the default Bray-Curtis dissimilarity can be used to generate dissimilarity matrices for hierarchical clustering. VST, however, generates negative values which are incompatible with calculating Bray-Curtis dissimilarity. For VST transformed data, Euclidean distance is used instead.
+
+```R
+beta_diversity_out_dir <- "beta_diversity/"
+if(!dir.exists(beta_diversity_out_dir)) dir.create(beta_diversity_out_dir)
+metadata <- {DATAFRAME}
+feature_table <- {DATAFRAME}
+group_colors <- {NAMED_VECTOR}
+groups_colname <- "groups"
+rarefaction_depth <- 500
+legend_title <- "Groups"
+assay_suffix <- "_GLAmpSeq"
+output_prefix <- ""
+distance_methods <- c("euclidean", "bray")
+normalization_methods <- c("vst", "rarefy")
+
+# Check and adjust rarefaction depth to preserve at least 2 groups
+library_sizes <- colSums(feature_table)
+min_lib_size <- min(library_sizes)
+max_lib_size <- max(library_sizes)
+
+# Check group-wise library sizes
+metadata_with_libsizes <- metadata
+metadata_with_libsizes$library_size <- library_sizes[rownames(metadata)]
+
+group_lib_stats <- metadata_with_libsizes %>%
+ group_by(!!sym(groups_colname)) %>%
+ summarise(
+ n_samples = n(),
+ min_lib = min(library_size),
+ max_lib = max(library_size),
+ median_lib = median(library_size),
+ .groups = 'drop'
+ )
+
+# Find max depth that preserves at least 2 groups
+groups_surviving_at_depth <- function(depth) {
+ sum(group_lib_stats$min_lib >= depth)
+}
+
+if(groups_surviving_at_depth(rarefaction_depth) < 2) {
+
+ # Find the depth that preserves exactly 2 groups (use the 2nd highest group minimum)
+ group_mins <- sort(group_lib_stats$min_lib, decreasing = TRUE)
+ if(length(group_mins) >= 2) {
+ adjusted_depth <- group_mins[2] # Use 2nd highest group minimum directly
+ } else {
+ adjusted_depth <- max(10, floor(min_lib_size * 0.8))
+ }
+
+ warning_msg <- c(
+ paste("Original rarefaction depth:", rarefaction_depth),
+ paste("Total groups in data:", nrow(group_lib_stats)),
+ "",
+ "Group-wise library size stats:",
+ paste(capture.output(print(group_lib_stats, row.names = FALSE)), collapse = "\n"),
+ "",
+ paste("WARNING: Rarefaction depth", rarefaction_depth, "would preserve only",
+ groups_surviving_at_depth(rarefaction_depth), "group(s)"),
+ paste("Beta diversity analysis requires at least 2 groups for statistical tests."),
+ "",
+ paste("Automatically adjusted rarefaction depth to:", adjusted_depth),
+ paste("This should preserve", groups_surviving_at_depth(adjusted_depth), "groups for analysis.")
+ )
+
+ writeLines(warning_msg, glue("{beta_diversity_out_dir}/{output_prefix}rarefaction_depth_warning.txt"))
+ message("WARNING: Rarefaction depth adjusted from ", rarefaction_depth, " to ", adjusted_depth,
+ " to preserve at least 2 groups - see ", output_prefix, "rarefaction_depth_warning.txt")
+
+ # Update the rarefaction depth
+ rarefaction_depth <- adjusted_depth
+}
+
+options(warn=-1) # ignore warnings
+# Run the analysis
+walk2(.x = normalization_methods, .y = distance_methods,
+ .f = function(normalization_method, distance_method){
+
+ # Create transformed phyloseq object
+ ps <- transform_phyloseq(feature_table, metadata,
+ method = normalization_method,
+ rarefaction_depth = rarefaction_depth)
+
+ # ---------Clustering and dendrogram plotting
+
+ # Extract normalized count table
+ count_tab <- otu_table(ps)
+
+ # VSD validation check point
+ if(normalization_method == "vst"){
+
+ # Visualize the sd vs the rank of the mean plot.
+ mead_sd_plot <- meanSdPlot(t(count_tab))$gg +
+ theme_bw() +
+ labs(title = "MEAN-SD VST validation Plot",
+ x="Rank Of Mean",
+ y="Standard Deviation") +
+ theme(axis.text = element_text(face = "bold", size = 12),
+ axis.title = element_text(face = "bold", size = 14),
+ title = element_text(face = "bold", size = 14))
+
+ # Save VSD validation plot
+ ggsave(filename = glue("{beta_diversity_out_dir}/{output_prefix}vsd_validation_plot.png"),
+ plot = mead_sd_plot, width = 14, height = 10,
+ dpi = 300, units = "in", limitsize = FALSE)
+ }
+
+
+ # Calculate distance between samples
+ dist_obj <- vegdist(t(count_tab), method = distance_method)
+
+ # Make dendrogram
+ dendrogram <- make_dendrogram(dist_obj, metadata, groups_colname,
+ group_colors, legend_title)
+
+ # Save dendrogram
+ ggsave(filename = glue("{beta_diversity_out_dir}/{output_prefix}{distance_method}_dendrogram{assay_suffix}.png"),
+ plot = dendrogram, width = 14, height = 10,
+ dpi = 300, units = "in", limitsize = FALSE)
+
+ #---------------------------- Run stats
+ # Checking homogeneity of variance and comparing groups using adonis test
+
+ stats_res <- run_stats(dist_obj, metadata, groups_colname)
+ write_csv(x = stats_res$variance,
+ file = glue("{beta_diversity_out_dir}/{output_prefix}{distance_method}_variance_table{assay_suffix}.csv"))
+
+ write_csv(x = stats_res$adonis,
+ file = glue("{beta_diversity_out_dir}/{output_prefix}{distance_method}_adonis_table{assay_suffix}.csv"))
+
+ #---------------------------- Make PCoA
+ # Unlabeled PCoA plot
+ ordination_plot_u <- plot_pcoa(ps, stats_res, distance_method,
+ groups_colname, group_colors, legend_title)
+ ggsave(filename=glue("{beta_diversity_out_dir}/{output_prefix}{distance_method}_PCoA_without_labels{assay_suffix}.png"),
+ plot=ordination_plot_u, width = 14, height = 8.33,
+ dpi = 300, units = "in", limitsize = FALSE)
+
+ # Labeled PCoA plot
+ ordination_plot <- plot_pcoa(ps, stats_res, distance_method,
+ groups_colname, group_colors, legend_title,
+ addtext=TRUE)
+ ggsave(filename=glue("{beta_diversity_out_dir}/{output_prefix}{distance_method}_PCoA_w_labels{assay_suffix}.png"),
+ plot=ordination_plot, width = 14, height = 8.33,
+ dpi = 300, units = "in", limitsize = FALSE)
+
+})
+```
+**Custom Functions Used:**
+
+* [transform_phyloseq()](#transform_phyloseq)
+* [make_dendrogram()](#make_dendrogram)
+* [run_stats()](#run_stats)
+* [plot_pcoa()](#plot_pcoa)
+
+**Input Data:**
+
+* `rarefaction_depth` (an integer specifying the minimum number of reads to simulate during rarefaction)
+* `groups_colname` (a string specifying the name of the column in the metadata table containing the group names)
+* `legend_title` (a string specifying the legend title for plotting)
+* `assay_suffix` (a string specifying the suffix to be added to output files; default is the Genelab assay suffix, "_GLAmpSeq")
+* `output_prefix` (a string specifying an additional prefix to be added to the output files; default is no additional prefix, "")
+* `normalization_methods` (a string vector specifying the method(s) to use for normalizing sample counts; "vst" (variance stabilizing transform) and "rarefy" (rarefaction) are supported)
+* `distance_methods` (a string vector specifying the method(s) to use to calculate the distance between samples; "vst" transformed data uses "euclidean" (Euclidean distance) and "rarefy" transformed data uses "bray" (Bray-Curtis distance))
+* `metadata` (a dataframe containing the sample metadata, with samples as row names and sample info as columns, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+* `feature_table` (a dataframe containing a filtered subset of the samples feature dataframe (i.e. ASV), output from [6b.v. Preprocessing](#6bv-preprocessing))
+* `group_colors` (a named character vector of colors for each group, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+
+**Output Data:**
+
+* **beta_diversity/_dendrogram_GLAmpSeq.png** (dendrogram(s) of the specified distance, Euclidean or Bray-Curtis, - based hierarchical clustering of the samples, colored by experimental groups)
+* **beta_diversity/_adonis_table_GLAmpSeq.csv** (comma-separated table(s) containing the degrees of freedom (df), sum of squares (SumOfSqs), coefficient of determination (R^2), F-statistic (statistic), and p-value for the model (variation explained by experimental groups) and residual (unexplained variation) sources of variation (terms) for the specified distance analysis, Euclidean or Bray-Curtis)
+* **beta_diversity/_variance_table_GLAmpSeq.csv** (comma-separated table(s) containing the degrees of freedom (df), sum of squares (sumsq), mean square (meansq), F-statistic (statistic), and p-value for the groups (variation explained by experimental groups) and residual (unexplained variation) sources of variation (terms) for the specified distance analysis, Euclidean or Bray-Curtis)
+* **beta_diversity/_PCoA_without_labels_GLAmpSeq.png** (Principle Coordinates Analysis plots of VST transformed and rarefy transformed ASV counts for Euclidean and Bray-Curtis distance methods, respectively, without sample labels)
+* **beta_diversity/_PCoA_w_labels_GLAmpSeq.png** (Principle Coordinates Analysis plots of VST transformed and rarefy transformed ASV counts for Euclidean and Bray-Curtis distance methods, respectively, with sample labels)
+* **beta_diversity/vsd_validation_plot.png** (VST transformation validation diagnostic plot)
+
+
+
+---
+
+## 9. Taxonomy Plots
+
+Taxonomy summaries provide insights into the composition of microbial communities at various taxonomy levels.
+
+```R
+taxonomy_plots_out_dir <- "taxonomy_plots/"
+if(!dir.exists(taxonomy_plots_out_dir)) dir.create(taxonomy_plots_out_dir)
+metadata <- {DATAFRAME}
+feature_table <- {DATAFRAME}
+taxonomy_table <- {DATAFRAME}
+custom_palette <- {COLOR_VECTOR}
+publication_format <- {GGPLOT_THEME}
+groups_colname <- "groups"
+assay_suffix <- "_GLAmpSeq"
+output_prefix <- ""
+
+# -------------------------Prepare feature tables -------------------------- #
+# For ITS and 18S datasets the taxonomy columns may also contain kingdom and division taxonomy levels
+# which will break the code. To avoid this, we only plot the phylum to species levels.
+taxon_levels <- c("phylum", "class", "order", "family", "genus", "species") # Plot only phylum to species
+names(taxon_levels) <- taxon_levels
+taxon_tables <- map(.x = taxon_levels,
+ .f = make_feature_table,
+ count_matrix = feature_table,
+ taxonomy = taxonomy_table)
+
+# ----------------------- Sample abundance plots -------------------------- #
+group_rare <- TRUE
+samples_order <- metadata %>% arrange(!!sym(groups_colname)) %>% rownames()
+dont_group <- c("phylum")
+# In percentage
+# phylum 1%, class 3%, order 3%, family 8%, genus 8% and species 9%
+thresholds <- c(phylum=1,class=3, order=3, family=8, genus=8, species=9)
+# Convert from wide to long format
+relAbundance_tbs_rare_grouped <- map2(.x = taxon_levels,
+ .y = taxon_tables,
+ .f = function(taxon_level=.x,
+ taxon_table=.y){
+
+ print(taxon_level)
+ taxon_table <- apply(X = taxon_table, MARGIN = 2,
+ FUN = function(x) x/sum(x)) * 100
+
+
+ taxon_table <- as.data.frame(taxon_table %>% t())
+ if(group_rare && !(taxon_level %in% dont_group)){
+
+ taxon_table <- group_low_abund_taxa(taxon_table %>%
+ as.data.frame(check.names=FALSE,
+ stringAsFactor=FALSE),
+ threshold = thresholds[taxon_level])
+
+ }
+ taxon_table$samples <- rownames(taxon_table)
+
+
+ # Change data frame from wide to long format
+ taxon_table <- taxon_table %>%
+ pivot_longer(cols = -samples, names_to = taxon_level, values_to = "relativeAbundance")
+ taxon_table$samples <- factor(x = taxon_table$samples,
+ levels = samples_order)
+ return(taxon_table)
+ })
+
+x_lab <- "Samples"
+y_lab <- "Relative abundance (%)"
+x <- 'samples'
+y <- "relativeAbundance"
+facet_by <- reformulate(groups_colname)
+number_of_samples <- length(samples_order)
+
+
+if(number_of_samples >= 30 ){
+
+ plot_width <- 0.6 * number_of_samples
+
+}else{
+
+ plot_width <- 14
+}
+
+# Make sample plots
+walk2(.x = relAbundance_tbs_rare_grouped, .y = taxon_levels,
+ .f = function(relAbundance_tb, taxon_level){
+
+ df <- relAbundance_tb %>%
+ left_join(metadata %>% rownames_to_column("samples"))
+
+ p <- ggplot(data = df, mapping = aes(x= !!sym(x), y=!!sym(y) )) +
+ geom_col(aes(fill = !!sym(taxon_level) )) +
+ facet_wrap(facet_by, scales = "free",
+ nrow = 1, labeller = label_wrap_gen(width=10)) +
+ publication_format +
+ labs(x = x_lab , y = y_lab, fill= tools::toTitleCase(taxon_level)) +
+ scale_fill_manual(values = custom_palette) +
+ theme(axis.text.x=element_text(
+ margin=margin(t=0.5,r=0,b=0,l=0,unit ="cm"),
+ angle = 90,
+ hjust = 0.5, vjust = 0.5)) +
+ labs(x=NULL)
+
+ ggsave(filename = glue("{taxonomy_plots_out_dir}/{output_prefix}samples_{taxon_level}{assay_suffix}.png"),
+ plot=p, width = plot_width, height = 8.5, dpi = 300, limitsize = FALSE)
+
+ })
+
+# ------------------------ Group abundance plots ----------------------------- #
+# In percentage
+# phylum 1% and 2% for class to species.
+thresholds <- c(phylum=1,class=2, order=2, family=2, genus=2, species=2)
+
+# Convert from wide to long format for every treatment group of interest
+group_rare <- TRUE # should rare taxa be grouped ?
+maximum_number_of_taxa <- 500 # If the number of taxa is more than this then rare taxa will be grouped anyway.
+
+group_relAbundance_tbs <- map2(.x = taxon_levels, .y = taxon_tables,
+ .f = function(taxon_level=.x, taxon_table=.y){
+
+ taxon_table <- as.data.frame(taxon_table %>% t())
+ taxon_table <- (collapse_samples(taxon_table = taxon_table,
+ metadata = metadata, group = groups_colname,
+ convertToRelativeAbundance = TRUE)$taxon_table * 100 ) %>%
+ as.data.frame(check.names=FALSE)
+
+ if(ncol(taxon_table) > maximum_number_of_taxa){
+ group_rare <- TRUE
+ }
+
+ if(group_rare && !(taxon_level %in% dont_group)){
+ taxon_table <- group_low_abund_taxa(taxon_table %>%
+ as.data.frame(check.names=FALSE,
+ stringAsFactor=FALSE),
+ threshold = thresholds[taxon_level])
+ group_rare <- FALSE
+ }
+
+ taxon_table[,groups_colname] <- rownames(taxon_table)
+
+
+ # Change from wide to long format
+ taxon_table <- taxon_table %>%
+ pivot_longer(cols = -!!sym(groups_colname),
+ names_to = taxon_level,
+ values_to = "relativeAbundance")
+
+ return(taxon_table)
+
+ })
+
+# Make bar plots
+y_lab <- "Relative abundance (%)"
+y <- "relativeAbundance"
+number_of_groups <- length(group_levels)
+plot_width <- 2.5 * number_of_groups
+
+# Cap the maximum plot width to 50 regardless of the number of groups
+if(plot_width > 50 ){
+
+ plot_width <- 50
+}
+
+walk2(.x = group_relAbundance_tbs, .y = taxon_levels,
+ .f = function(relAbundance_tb=.x, taxon_level=.y){
+
+ p <- ggplot(data = relAbundance_tb %>%
+ mutate(X = str_wrap(!!sym(groups_colname),
+ width = 10) # wrap long group names
+ ),
+ mapping = aes(x = X , y = !!sym(y))) +
+ geom_col(aes(fill = !!sym(taxon_level))) +
+ publication_format +
+ theme(axis.text.x=element_text(
+ margin=margin(t=0.5,r=0,b=0,l=0,unit ="cm"),
+ angle = 0,
+ hjust = 0.5, vjust = 0.5)) +
+ labs(x = NULL , y = y_lab, fill = tools::toTitleCase(taxon_level)) +
+ scale_fill_manual(values = custom_palette)
+ ggsave(filename = glue("{taxonomy_plots_out_dir}/{output_prefix}groups_{taxon_level}{assay_suffix}.png"),
+ plot=p, width = plot_width, height = 10, dpi = 300, limitsize = FALSE)
+ })
+```
+
+**Custom Functions Used**
+
+* [make_feature_table()](#make_feature_table)
+* [group_low_abund_taxa()](#group_low_abund_taxa)
+* [collapse_samples()](#collapse_samples)
+
+**Input Data:**
+
+* `groups_colname` (a string specifying the name of the column in the metadata table containing the group names)
+* `assay_suffix` (a string specifying the suffix to be added to output files; default is the Genelab assay suffix, "_GLAmpSeq")
+* `output_prefix` (a string specifying an additional prefix to be added to the output files; default is no additional prefix, "")
+* `metadata` (a dataframe containing the sample metadata, with samples as row names and sample info as columns, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+* `feature_table` (a dataframe containing a filtered subset of the samples feature dataframe (i.e. ASV), output from [6b.v. Preprocessing](#6bv-preprocessing))
+* `taxonomy_table` (a dataframe containing a filtered subset of the feature taxonomy dataframe with ASV taxonomy assignments, output from [6b.v. Preprocessing](#6bv-preprocessing))
+* `custom_palette` (a vector of strings specifying a custom color palette for coloring plots, output from [6b.iii. Set Variables](#6biii-set-variables))
+* `publication_format` (a ggplot::theme object specifying the custom theme for plotting, output from [6b.iii. Set Variables](#6biii-set-variables))
+* `group_levels` (a character vector of unique group names, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+
+
+**Output Data:**
+
+* **taxonomy_plots/samples__GLAmpSeq.png** (barplots of the relative abundance of the specified taxon level for each sample)
+* **taxonomy_plots/groups__GLAmpSeq.png** (barplots of the relative abundance of the specified taxon level for each group)
+
+Where `taxon_level` is all of phylum, class, order, family, genus, and species.
+
+> Please note that the species plot can be misleading as short amplicon sequences can't be used to accurately predict species.
+
+
+
+---
+
+
+## 10. Differential Abundance Testing
+
+Differential abundance testing aims to uncover specific taxa that exhibit notable variations across different conditions, complemented by visualizations like volcano plots to illustrate these disparities and their implications on ASV abundance and overall microbial community dynamics. ANCOMBC 1, ANCOMBC 2, and DESeq2 provide 3 different methods for calculating differential abundance. ANCOMBC 1 and 2 were specifically designed to handle the compositional nature of microbiome data. ANCOMBC 2 is an improved version of ANCOMBC 1 particularly for datasets with high sparsity, small sample sizes, or longitudinal and correlated experimental designs. This pipeline also implements DESeq2 because it is a popular choice for differential abundance. DESeq2 assumes a negative binomial model and can have issues with sparse data, which is frequently true in microbiome datasets. Two diagnostic plots (VST validation and ASV sparsity plots) help assess whether DESeq2 is appropriate for a given dataset. The VST validation plot assesses whether VST is successfully stabilizing variance. The ASV sparsity plot helps users determine if their data is too sparse to use DESeq2 to assess differential abundance.
+
+> In general, we recommend using the ANCOMBC 2 differential abundance data, although all differential abundance data outputs should be evaluated in the context of the question(s) the user seeks to answer with these data.
+
+
+### 10a. ANCOMBC 1
+
+```R
+# Create output directory if it doesn't already exist
+diff_abund_out_dir <- "differential_abundance/ancombc1/"
+if(!dir.exists(diff_abund_out_dir)) dir.create(diff_abund_out_dir, recursive = TRUE)
+metadata <- {DATAFRAME}
+feature_table <- {DATAFRAME}
+taxonomy_table <- {DATAFRAME}
+feature <- "ASV"
+groups_colname <- "groups"
+samples_column <- "Sample Name"
+assay_suffix <- "_GLAmpSeq"
+target_region <- "16S" # "16S", "18S" or "ITS"
+output_prefix <- ""
+prevalence_cutoff <- 0
+library_cutoff <- 0
+remove_struc_zero <- FALSE
+threads <- 5
+
+# # Get long asv taxonomy names and clean
+species <- taxonomy_table %>%
+ unite(species,domain:species,sep = ";") %>%
+pull %>% str_replace_all("Other", "_")
+
+taxonomy_table <- fix_names(taxonomy_table, "Other", ";_")
+
+taxonomy_table[,"species"] <- species
+
+# Create phyloseq object from feature, taxonomy and sample metadata tables
+ps <- phyloseq(otu_table(feature_table, taxa_are_rows = TRUE),
+ sample_data(metadata),
+ tax_table(as.matrix(taxonomy_table)))
+
+# Convert phyloseq to tree summarized experiment object
+tse <- mia::makeTreeSummarizedExperimentFromPhyloseq(ps)
+
+# Get unique group comparison as a matrix
+pairwise_comp.m <- utils::combn((metadata[,group] %>% unique %>% sort), 2)
+pairwise_comp_df <- pairwise_comp.m %>% as.data.frame
+# Name the columns in the pairwise matrix as group1vgroup2
+colnames(pairwise_comp_df) <- map_chr(pairwise_comp_df,
+ \(col) str_c(col, collapse = "v"))
+comparisons <- colnames(pairwise_comp_df)
+names(comparisons) <- comparisons
+
+
+# ---------------------- Run ANCOMBC 1 ---------------------------------- #
+set.seed(123)
+final_results_bc1 <- map(pairwise_comp_df, function(col){
+
+ group1 <- col[1]
+ group2 <- col[2]
+
+ # Subset the treeSummarizedExperiment object to contain only samples
+ # in group1 and group2
+ tse_sub <- tse[, tse[[groups_colname]] %in% c(group1, group2)]
+
+ # Note that by default, levels of a categorical variable in R are sorted
+ # alphabetically.
+ # Changing the reference group by reordering the factor levels
+ tse_sub[[groups_colname]] <- factor(tse_sub[[groups_colname]] , levels = c(group1, group2))
+
+ # Run ancombc (uses default parameters unless specified in pipeline Parameter Definitions)
+ tryCatch({
+ out <- ancombc(data = tse_sub,
+ formula = groups_colname,
+ p_adj_method = "fdr", prv_cut = prevalence_cutoff,
+ lib_cut = library_cutoff,
+ group = groups_colname , struc_zero = remove_struc_zero,
+ neg_lb = TRUE,
+ conserve = TRUE,
+ n_cl = threads, verbose = TRUE)
+
+ # ------ Set data frame names ---------#
+ # lnFC
+ lfc <- out$res$lfc %>%
+ as.data.frame() %>%
+ select(-contains("Intercept")) %>%
+ set_names(
+ c("taxon",
+ glue("Lnfc_({group2})v({group1})"))
+ )
+
+ # SE
+ se <- out$res$se %>%
+ as.data.frame() %>%
+ select(-contains("Intercept")) %>%
+ set_names(
+ c("taxon",
+ glue("Lnfc.SE_({group2})v({group1})"))
+ )
+
+ # W
+ W <- out$res$W %>%
+ as.data.frame() %>%
+ select(-contains("Intercept")) %>%
+ set_names(
+ c("taxon",
+ glue("Stat_({group2})v({group1})"))
+ )
+
+ # p_val
+ p_val <- out$res$p_val %>%
+ as.data.frame() %>%
+ select(-contains("Intercept")) %>%
+ set_names(
+ c("taxon",
+ glue("P.value_({group2})v({group1})"))
+ )
+
+ # q_val
+ q_val <- out$res$q_val %>%
+ as.data.frame() %>%
+ select(-contains("Intercept")) %>%
+ set_names(
+ c("taxon",
+ glue("Q.value_({group2})v({group1})"))
+ )
+
+ # Diff_abn
+ diff_abn <- out$res$diff_abn %>%
+ as.data.frame() %>%
+ select(-contains("Intercept")) %>%
+ set_names(
+ c("taxon",
+ glue("Diff_({group2})v({group1})"))
+ )
+
+ # Merge the dataframes to one results dataframe
+ res <- lfc %>%
+ left_join(se) %>%
+ left_join(W) %>%
+ left_join(p_val) %>%
+ left_join(q_val) %>%
+ left_join(diff_abn)
+
+ return(res)
+ }, error = function(e) {
+ # Create log message
+ log_msg <- c(
+ "\nANCOMBC1 analysis failed for comparison: ", group1, " vs ", group2,
+ "\nError: ", e$message,
+ "\n\nDiagnostics:",
+ paste("- Number of taxa after filtering:", nrow(taxonomy_table)),
+ paste("- Number of samples in group", group1, ":", sum(tse_sub[[group]] == group1)),
+ paste("- Number of samples in group", group2, ":", sum(tse_sub[[group]] == group2)),
+ "\nPossibly insufficient data for ANCOMBC1 analysis. Consider adjusting filtering parameters or group assignments."
+ )
+
+ # Write to log file
+ writeLines(log_msg,
+ file.path(diff_abund_out_dir,
+ glue("{output_prefix}ancombc1_failure.txt")))
+
+ # Print to console and quit
+ message(log_msg)
+ quit(status = 0)
+ })
+})
+
+# ------------ Create merged stats pairwise dataframe ----------------- #
+# Initialize the merged stats dataframe to contain the taxon column for joining
+merged_stats_df <- final_results_bc1[[names(final_results_bc1)[1]]] %>%
+ as.data.frame() %>% select(taxon)
+
+# Loop over the results of every comparison and join it the pre-existing
+# stats table
+walk(comparisons[names(final_results_bc1)], .f = function(comparison){
+
+ # Get comparison specific statistics
+ df <- final_results_bc1[[comparison]] %>% as.data.frame()
+
+ # Merge it to the pre-existing statistics table
+ merged_stats_df <<- merged_stats_df %>%
+ dplyr::full_join(df, by = join_by("taxon"))
+
+})
+
+# Sort ASVs in ascending order
+merged_stats_df <- merged_stats_df %>%
+ rename(!!feature := taxon) %>%
+ mutate(!!feature := SortMixed(!!sym(feature)))
+
+# ------ Get comparison names
+# Since all column groups i.e. lnFC, pval, W, etc. have the same
+# suffixes as comparison names, we only need to extract the comparison names
+# from one of them. Here we extract them from the "lnFC" prefixed columns
+comp_names <- merged_stats_df %>%
+ select(starts_with("Lnfc_", ignore.case = FALSE)) %>%
+ colnames() %>% str_remove_all("Lnfc_")
+names(comp_names) <- comp_names
+
+# -------------- Make volcano plots ------------------ #
+volcano_plots <- map(comp_names, function(comparison){
+
+ # Construct column names for columns to be selected
+ comp_col <- c(
+ glue("Lnfc_{comparison}"),
+ glue("Lnfc.SE_{comparison}"),
+ glue("Stat_{comparison}"),
+ glue("P.value_{comparison}"),
+ glue("Q.value_{comparison}"),
+ glue("Diff_{comparison}")
+ )
+
+ sub_res_df <- merged_stats_df %>%
+ select(!!feature, all_of(comp_col)) %>% drop_na()
+ colnames(sub_res_df) <- str_replace_all(colnames(sub_res_df),
+ pattern = "(.+)_.+",
+ replacement = "\\1")
+
+ # Set default pvalue and plot dimensions
+ p_val <- 0.1
+ plot_width_inches <- 11.1
+ plot_height_inches <- 8.33
+
+ # Retrieve a vector of the 2 groups being compared
+ groups_vec <- comparison %>%
+ str_replace_all("\\)v\\(", ").vs.(") %>% # replace ')v(' with ').vs.(' to enhance accurate groups splitting
+ str_remove_all("\\(|\\)") %>% # remove brackets
+ str_split(".vs.") %>% unlist # split groups to list then convert to a vector
+
+ group1 <- groups_vec[1]
+ group2 <- groups_vec[2]
+
+ ###### Long x-axis label adjustments ##########
+ x_label <- glue("ln Fold Change\n\n( {group1} vs {group2} )")
+ label_length <- nchar(x_label)
+ max_allowed_label_length <- plot_width_inches * 10
+
+ # Construct x-axis label with new line breaks if was too long
+ if (label_length > max_allowed_label_length){
+ x_label <- glue("ln Fold Change\n\n( {group1} \n vs \n {group2} )")
+ }
+
+ # Make plot
+ p <- ggplot(sub_res_df %>% mutate(diff = Q.value <= p_val),
+ aes(x=Lnfc, y=-log10(Q.value),
+ color=diff, label=!!sym(feature))) +
+ geom_point(alpha=0.7, size=2) +
+ scale_color_manual(values=c("TRUE"="red", "FALSE"="black"),
+ labels=c(paste0("qval > ", p_val),
+ paste0("qval \u2264 ", p_val))) +
+ geom_hline(yintercept = -log10(p_val), linetype = "dashed") +
+ ggrepel::geom_text_repel(show.legend = FALSE) +
+ expandy(-log10(sub_res_df$Q.value)) + # Expand plot y-limit
+ coord_cartesian(clip = 'off') +
+ scale_y_continuous(oob = scales::oob_squish_infinite) + # prevent clipping of infinite values
+ labs(x= x_label, y="-log10(Q-value)",
+ title = "Volcano Plot", color=NULL,
+ caption = glue("dotted line: q-value = {p_val}")) +
+ theme_bw() +
+ theme(legend.position="top", legend.key = element_rect(colour=NA),
+ plot.caption = element_text(face = 'bold.italic'))
+ # Save plot
+ file_name <- glue("{output_prefix}{comparison %>% str_replace_all('[:space:]+','_')}_volcano.png")
+ ggsave(filename = file_name,
+ plot = p, device = "png", width = plot_width_inches,
+ height = plot_height_inches, units = "in",
+ dpi = 300, path = diff_abund_out_dir)
+
+ return(p)
+})
+
+# ------------------- Add NCBI id to feature, i.e. ASV -------------- #
+# Get the best/least possible taxonomy name for the ASVs
+tax_names <- map_chr(str_replace_all(taxonomy_table$species, ";_","") %>%
+ str_split(";"),
+ function(row) row[length(row)])
+
+df <- data.frame(ASV=rownames(taxonomy_table), best_taxonomy=tax_names)
+colnames(df) <- c(feature, "best_taxonomy")
+
+# Pull NCBI IDS for unique taxonomy names
+# Filter out unannotated entries before querying NCBI
+valid_taxonomy <- df$best_taxonomy %>% unique() %>% setdiff("_")
+df2_valid <- data.frame(best_taxonomy = valid_taxonomy) %>%
+ mutate(NCBI_id=get_ncbi_ids(best_taxonomy, target_region),
+ .after = best_taxonomy)
+
+# Add unannotated entries with NA NCBI_id
+df2_invalid <- data.frame(best_taxonomy = "_", NCBI_id = NA)
+df2 <- rbind(df2_valid, df2_invalid)
+
+df <- df %>%
+ left_join(df2, join_by("best_taxonomy")) %>%
+ right_join(merged_stats_df)
+
+# Manually creating a normalized table because normalized
+# tables differ by comparison
+normalized_table <- as.data.frame(feature_table + 1) %>%
+ rownames_to_column(feature) %>%
+ mutate(across( where(is.numeric), log ) )
+
+# Create a missing values / NAs dataframe of samples that were dropped
+# due to prefiltering steps (prevalence and library cut offs filtering)
+# proir to running ANCOMBC
+samples <- metadata[[samples_column]]
+samplesdropped <- setdiff(x = samples, y = colnames(normalized_table)[-1])
+missing_df <- data.frame(ASV=normalized_table[[feature]],
+ matrix(data = NA,
+ nrow = nrow(normalized_table),
+ ncol = length(samplesdropped)
+ )
+)
+colnames(missing_df) <- c(feature,samplesdropped)
+
+# Create mean and standard deviation table
+group_levels <- metadata[, groups_colname] %>% unique() %>% sort()
+group_means_df <- normalized_table[feature]
+walk(group_levels, function(group_level){
+
+ mean_col <- glue("Group.Mean_({group_level})")
+ std_col <- glue("Group.Stdev_({group_level})")
+
+ # Samples that belong to the current group
+ Samples <- metadata %>%
+ filter(!!sym(groups_colname) == group_level) %>%
+ pull(!!sym(samples_column))
+ # Samples that belong to the current group that are in the normalized table
+ Samples <- intersect(colnames(normalized_table), Samples)
+
+ # Calculate means and standard deviations for the current comparison
+ temp_df <- normalized_table %>% select(!!feature, all_of(Samples)) %>%
+ rowwise() %>%
+ mutate(!!mean_col := mean(c_across(where(is.numeric)), na.rm = TRUE),
+ !!std_col := sd(c_across(where(is.numeric)), na.rm = TRUE) ) %>%
+ select(!!feature,!!sym(mean_col), !!sym(std_col))
+
+ # Merge the current comparison's means and stdandard deviations
+ # to previous ones
+ group_means_df <<- group_means_df %>% left_join(temp_df)
+
+})
+
+# Append missing sample columns to the normalized table
+normalized_table <- normalized_table %>%
+ left_join(missing_df, by = feature) %>%
+ select(!!feature, all_of(samples))
+
+# Compute globally/ASV normalized means and standard deviations
+All_mean_sd <- normalized_table %>%
+ rowwise() %>%
+ mutate(All.mean=mean(c_across(where(is.numeric)), na.rm = TRUE),
+ All.stdev=sd(c_across(where(is.numeric)), na.rm = TRUE) ) %>%
+ select(!!feature, All.mean, All.stdev)
+
+# Merge the taxonomy table to the stats table
+merged_df <- df %>%
+ left_join(taxonomy_table %>%
+ as.data.frame() %>%
+ rownames_to_column(feature)) %>%
+ select(!!feature, domain:species,everything())
+
+# Merge all the pre-combined dataframes in the desired order
+merged_df <- merged_df %>%
+ select(!!sym(feature):NCBI_id) %>%
+ left_join(normalized_table, by = feature) %>%
+ left_join(merged_df) %>%
+ left_join(All_mean_sd) %>%
+ left_join(group_means_df, by = feature) %>%
+ mutate(across(where(is.matrix), as.numeric))
+
+# Write out results of differential abundance using ANCOMBC 1
+output_file <- glue("{diff_abund_out_dir}/{output_prefix}ancombc1_differential_abundance{assay_suffix}.csv")
+# Write combined table to file but before that drop
+# all columns of inferred differential abundance by ANCOMBC
+write_csv(merged_df %>%
+ select(-starts_with("Diff_")),
+ output_file)
+
+```
+**Custom Functions Used:**
+
+* [expandy()](#expandy)
+* [get_ncbi_ids()](#get_ncbi_ids)
+* [fix_names()](#fix_names)
+
+**Parameter Definition:**
+
+* `ancombc()` - ANCOMBC::ancombc function (*using the following non-default values:*)
+ * `data` - TreeSummarizedExperiment object created from `feature_table` input data
+ * `formula` - a string specifying the variable in the metadata to use for the fixed effects formula (e.g. group names), set by `groups_colname` input data
+ * `prv_cut` - fraction between 0 and 1 specifying the taxon prevalence cut-off, set by `prevalence_cutoff` input data
+ * `lib_cut` - a numerical threshold for filtering samples based on library sizes, set by `library_cutoff` input data
+ * `group` - the name of the group variable in the metadata, set by `groups_colname` input data
+ * `struc_zero` - logical value indicating whether or not group-wise structural zeros should be detected, set by `remove_struc_zero` input data
+ * `n_cl` - the number of processes to run in parallel, set by `threads` input data
+ * `p_adj_method` - a string specifying the p-value adjustment method for multiple comparisons testing, set to "fdr" to standardize the multiple comparisons method across all three differential abundance methods
+ * `neg_lb` - logical value specifying whether to classify a taxon as a structural zero using its asymptotic lower bound, set to "TRUE"
+ * `conserve` - logical value indicating where or not a conservative variance estimator should be used for the test statistic, set to "TRUE"
+ * `verbose` - logical value specifying whether or not to generate verbose output
+
+
+**Input Data:**
+
+* `feature` (a string specifying the feature type, i.e. "ASV" or "OTU")
+* `groups_colname` (a string specifying the name of the column in the metadata table containing the group names)
+* `samples_column` (a string specifying the name of the column in the metadata table containing the sample names)
+* `assay_suffix` (a string specifying the suffix to be added to output files; default is the Genelab assay suffix, "_GLAmpSeq")
+* `output_prefix` (a string specifying an additional prefix to be added to the output files; default is no additional prefix, "")
+* `threads` (a number specifying the number of cpus to use for parallel processing)
+* `prevalence_cutoff` (a decimal between 0 and 1 specifying the proportion of samples required to contain a taxon in order to keep the taxon when `remove_rare` (set in [Step 6b.iv. Preprocessing](#6bv-preprocessing)) is set to TRUE; default is 0, i.e. do not exclude any taxon / feature)
+* `library_cutoff` (a numerical variable specifying the number of total counts a sample must have across all features to be retained when `remove_rare` (set in [Step 6b.iv. Preprocessing](#6bv-preprocessing)) is set to TRUE; default is 0, i.e. no samples will be dropped)
+* `target_region` (a string specifying the amplicon target region; options are either "16S", "18S", or "ITS")
+* `remove_struc_zero` (a boolean variable specifying whether or not structural zeros (a.k.a ASVs with zero count in at least one group) should be removed; default is FALSE i.e. structural zeros won't be removed)
+* `metadata` (a dataframe containing the sample metadata, with samples as row names and sample info as columns, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+* `feature_table` (a dataframe containing a filtered subset of the samples feature dataframe (i.e. ASV), output from [6b.v. Preprocessing](#6bv-preprocessing))
+* `taxonomy_table` (a dataframe containing a filtered subset of the feature taxonomy dataframe with ASV taxonomy assignments, output from [6b.v. Preprocessing](#6bv-preprocessing))
+
+
+
+**Output Data:**
+
+* **differential_abundance/ancombc1/(\)v(\)_volcano.png** (volcano plots for each pariwise comparison)
+* **differential_abundance/ancombc1/ancombc1_differential_abundance_GLAmpSeq.csv** (a comma-separated ANCOM-BC1 differential abundance results table containing the following columns:
+ - ASV (identified ASVs)
+ - taxonomic assignment columns
+ - NCBI identifier for the best taxonomic assignment for each ASV
+ - Normalized abundance values for each ASV for each sample
+ - For each pairwise group comparison:
+ - natural log of the fold change (Lnfc)
+ - standard error for the lnFC (Lnfc.SE)
+ - test statistic from the primary result (Stat)
+ - P-value (P.value)
+ - Adjusted p-value (Q.value)
+ - All.mean (mean across all samples)
+ - All.stdev (standard deviation across all samples)
+ - For each group:
+ - Group.Mean_(group) (mean within group)
+ - Group.Stdev_(group) (standard deviation within group))
+
+
+
+---
+
+### 10b. ANCOMBC 2
+
+```R
+diff_abund_out_dir <- "differential_abundance/ancombc2/"
+if(!dir.exists(diff_abund_out_dir)) dir.create(diff_abund_out_dir, recursive = TRUE)
+metadata <- {DATAFRAME}
+feature_table <- {DATAFRAME}
+taxonomy_table <- {DATAFRAME}
+feature <- "ASV"
+target_region <- "16S" # "16S" , "18S" or "ITS"
+groups_colname <- "groups"
+samples_column <- "Sample Name"
+assay_suffix <- "_GLAmpSeq"
+output_prefix <- ""
+prevalence_cutoff <- 0 # from [Step 6b.v. Preprocessing]
+library_cutoff <- 0 # from [Step 6b.v. Preprocessing]
+remove_struc_zero <- FALSE
+threads <- 5
+
+# # Get long asv taxonomy names and clean
+species <- taxonomy_table %>%
+ unite(species,domain:species,sep = ";") %>%
+pull %>% str_replace_all("Other", "_")
+
+taxonomy_table <- fix_names(taxonomy_table, "Other", ";_")
+
+taxonomy_table[,"species"] <- species
+
+# Create phyloseq object
+ps <- phyloseq(otu_table(feature_table, taxa_are_rows = TRUE),
+ sample_data(metadata),
+ tax_table(as.matrix(taxonomy_table)))
+
+# Convert phyloseq to tree summarized experiment object
+tse <- mia::makeTreeSummarizedExperimentFromPhyloseq(ps)
+
+# Getting the reference group and making sure that it is the reference
+# used in the analysis
+group_levels <- metadata[, groups_colname] %>% unique() %>% sort()
+ref_group <- group_levels[1] # the first group is used as the reference group by default
+tse[[groups_colname]] <- factor(tse[[groups_colname]] , levels = group_levels)
+
+
+# ---------------------- Run ANCOMBC2 ---------------------------------- #
+# Run ancombc2 (uses default parameters unless specified in pipeline Parameter Definitions)
+output <- ancombc2(data = tse,
+ fix_formula = groups_colname,
+ p_adj_method = "fdr",
+ prv_cut = prevalence_cutoff,
+ lib_cut = library_cutoff, s0_perc = 0.05,
+ group = groups_colname, struc_zero = remove_struc_zero,
+ n_cl = threads, verbose = TRUE,
+ pairwise = TRUE,
+ iter_control = list(tol = 1e-5, max_iter = 20,
+ verbose = FALSE),
+ mdfdr_control = list(fwer_ctrl_method = "fdr", B = 100),
+ lme_control = NULL, trend_control = NULL)
+
+# For 2-group comparisons, use res instead of mapping across pairwise results in res_pair
+is_two_group <- length(unique(tse[[group]])) == 2
+
+# Create new column names - the original column names given by ANCOMBC are
+# difficult to understand
+tryCatch({
+ # Check if this is a 2-group comparison (using res instead of res_pair)
+ if(is_two_group) {
+ # For 2-group comparisons, use the group-specific columns
+ group_cols <- colnames(output$res)[grepl(paste0("^[a-zA-Z_]+_", group), colnames(output$res))]
+ if(length(group_cols) > 0) {
+ # Extract group name from the first group-specific column
+ group_name <- str_replace(group_cols[1], paste0("^[a-zA-Z_]+_", group), "")
+ # Create comparison name
+ comparison_name <- glue("({group_name})v({ref_group})")
+
+ new_colnames <- c(
+ feature, # Keep the feature column name
+ glue("Lnfc_{comparison_name}"),
+ glue("Lnfc.SE_{comparison_name}"),
+ glue("Stat_{comparison_name}"),
+ glue("P.value_{comparison_name}"),
+ glue("Q.value_{comparison_name}"),
+ glue("Diff_{comparison_name}"),
+ glue("Passed_ss_{comparison_name}")
+ )
+ } else {
+ stop("Could not identify group-specific column for 2-group comparison")
+ }
+ } else {
+ # Multi-group comparisons
+ new_colnames <- map_chr(output$res_pair %>% colnames,
+ function(colname) {
+ # Columns comparing a group to the reference group
+ if(str_count(colname,group) == 1){
+ str_replace_all(string=colname,
+ pattern=glue("(.+)_{group}(.+)"),
+ replacement=glue("\\1_(\\2)v({ref_group})")) %>%
+ str_replace(pattern = "^lfc_", replacement = "Lnfc_") %>%
+ str_replace(pattern = "^se_", replacement = "Lnfc.SE_") %>%
+ str_replace(pattern = "^W_", replacement = "Stat_") %>%
+ str_replace(pattern = "^p_", replacement = "P.value_") %>%
+ str_replace(pattern = "^q_", replacement = "Q.value_") %>%
+ str_replace(pattern = "^diff_", replacement = "Diff_") %>%
+ str_replace(pattern = "^passed_ss_", replacement = "Passed_ss_")
+
+ # Columns with normal two groups comparison
+ } else if(str_count(colname,group) == 2){
+
+ str_replace_all(string=colname,
+ pattern=glue("(.+)_{group}(.+)_{group}(.+)"),
+ replacement=glue("\\1_(\\2)v(\\3)")) %>%
+ str_replace(pattern = "^lfc_", replacement = "Lnfc_") %>%
+ str_replace(pattern = "^se_", replacement = "Lnfc.SE_") %>%
+ str_replace(pattern = "^W_", replacement = "Stat_") %>%
+ str_replace(pattern = "^p_", replacement = "P.value_") %>%
+ str_replace(pattern = "^q_", replacement = "Q.value_") %>%
+ str_replace(pattern = "^diff_", replacement = "Diff_") %>%
+ str_replace(pattern = "^passed_ss_", replacement = "Passed_ss_")
+
+ # Feature/ ASV column
+ } else{
+
+ return(colname)
+ }
+ } )
+ }
+}, error = function(e) {
+ writeLines(c("ANCOMBC2 script failed at res_pair processing:", e$message,
+ "\n\nDiagnostics:",
+ paste("- Number of taxa after filtering:", nrow(taxonomy_table)),
+ paste("- Number of groups:", length(unique(tse[[group]]))),
+ paste("- Sample sizes per group:"),
+ paste(" ", paste(names(table(tse[[group]])), "=", table(tse[[group]]), collapse="\n ")),
+ "\nPossibly insufficient data for ANCOMBC2 analysis. Consider adjusting filtering parameters or group assignments."),
+ file.path(diff_abund_out_dir, glue("{output_prefix}ancombc2_failure.txt")))
+ quit(status = 0)
+})
+
+# Change the column named taxon to the feature name e.g. ASV
+new_colnames[match("taxon", new_colnames)] <- feature
+
+
+# Rename columns
+if(is_two_group) {
+ # For 2-group comparisons, we need to select the group-specific columns and rename them
+ # The columns are named like "lfc_groupsGround Control", "se_groupsGround Control", etc.
+
+ group_specific_cols <- colnames(output$res)[grepl(paste0("^[a-zA-Z_]+_", group), colnames(output$res))]
+
+ # Create a new data frame with the selected columns
+ paired_stats_df <- output$res %>%
+ select(taxon, all_of(group_specific_cols)) %>%
+ set_names(new_colnames)
+} else {
+ # Multi-group comparisons
+ paired_stats_df <- output$res_pair %>% set_names(new_colnames)
+}
+
+# Get the unique comparison names
+uniq_comps <- str_replace_all(new_colnames, ".+_(\\(.+\\))", "\\1") %>% unique()
+uniq_comps <- uniq_comps[-match(feature, uniq_comps)]
+
+# ------ Sort columns by group comparisons --------#
+# Create a data frame containing only the feature/ASV column
+res_df <- paired_stats_df[1]
+walk(uniq_comps, function(comp){
+
+ # Get the results for a comparison
+ temp_df <- paired_stats_df %>% select(!!sym(feature), contains(comp))
+
+ # Merge the current comparison to previous comparisons by feature/ASV id
+ res_df <<- res_df %>% left_join(temp_df)
+})
+
+# --------- Add NCBI id to feature ---------------#
+
+# Get the best taxonomy assigned to each ASV
+tax_names <- map_chr(str_replace_all(taxonomy_table$species, ";_","") %>%
+ str_split(";"),
+ function(row) row[length(row)])
+
+df <- data.frame(ASV=rownames(taxonomy_table), best_taxonomy=tax_names)
+colnames(df) <- c(feature, "best_taxonomy")
+
+# Querying NCBI...
+# Pull NCBI IDS for unique taxonomy names
+# Filter out unannotated entries before querying NCBI
+valid_taxonomy <- df$best_taxonomy %>% unique() %>% setdiff("_")
+df2_valid <- data.frame(best_taxonomy = valid_taxonomy) %>%
+ mutate(NCBI_id=get_ncbi_ids(best_taxonomy, target_region),
+ .after = best_taxonomy)
+
+# Add unannotated entries with NA NCBI_id
+df2_invalid <- data.frame(best_taxonomy = "_", NCBI_id = NA)
+df2 <- rbind(df2_valid, df2_invalid)
+
+df <- df %>%
+ left_join(df2, join_by("best_taxonomy")) %>%
+ right_join(res_df)
+
+# Retrieve the normalized table
+normalized_table <- output$bias_correct_log_table %>%
+ rownames_to_column(feature) %>%
+ mutate(across(where(is.numeric), ~replace_na(.x, replace=0)))
+
+# Create a missing values / NAs dataframe of samples that were dropped
+# due to prefiltering steps (prevalence and library cut offs filtering)
+# proir to running ANCOMBC2
+samples <- metadata[[samples_column]]
+samplesdropped <- setdiff(x = samples, y = colnames(normalized_table)[-1])
+missing_df <- data.frame(ASV=normalized_table[[feature]],
+ matrix(data = NA,
+ nrow = nrow(normalized_table),
+ ncol = length(samplesdropped)
+ )
+ )
+colnames(missing_df) <- c(feature, samplesdropped)
+
+group_means_df <- normalized_table[feature]
+walk(group_levels, function(group_level){
+
+ mean_col <- glue("Group.Mean_({group_level})")
+ std_col <- glue("Group.Stdev_({group_level})")
+
+ # Samples that belong to the current group
+ Samples <- metadata %>%
+ filter(!!sym(groups_colname) == group_level) %>%
+ pull(!!sym(samples_column))
+ # Samples that belong to the current group that are in the normalized table
+ Samples <- intersect(colnames(normalized_table), Samples)
+
+ temp_df <- normalized_table %>% select(!!feature, all_of(Samples)) %>%
+ rowwise() %>%
+ mutate(!!mean_col := mean(c_across(where(is.numeric)), na.rm = TRUE),
+ !!std_col := sd(c_across(where(is.numeric)), na.rm = TRUE) ) %>%
+ select(!!feature,!!sym(mean_col), !!sym(std_col))
+
+ group_means_df <<- group_means_df %>% left_join(temp_df)
+
+})
+
+# Append missing samples columns to normalized table
+normalized_table <- normalized_table %>%
+ left_join(missing_df, by = feature) %>%
+ select(!!feature, all_of(samples))
+
+# Calculate global mean and standard deviation
+All_mean_sd <- normalized_table %>%
+ rowwise() %>%
+ mutate(All.mean=mean(c_across(where(is.numeric)), na.rm = TRUE),
+ All.stdev=sd(c_across(where(is.numeric)), na.rm = TRUE) ) %>%
+ select(!!feature, All.mean, All.stdev)
+
+# Append the taxonomy table to the ncbi and stats table
+merged_df <- df %>%
+ left_join(taxonomy_table %>%
+ as.data.frame() %>%
+ rownames_to_column(feature)) %>%
+ select(!!feature,domain:species,everything())
+
+# Combine tables in the desired order
+merged_df <- merged_df %>%
+ select(!!sym(feature):NCBI_id) %>%
+ left_join(normalized_table, by = feature) %>%
+ left_join(merged_df) %>%
+ left_join(All_mean_sd) %>%
+ left_join(group_means_df, by = feature)
+
+# Writing out results of differential abundance using ANCOMBC2...
+output_file <- glue("{diff_abund_out_dir}{output_prefix}ancombc2_differential_abundance{assay_suffix}.csv")
+# Write out merged stats table but before that
+# drop ANCOMBC inferred differential abundance columns
+write_csv(merged_df %>%
+ select(-starts_with("diff_")),
+ output_file)
+
+# ---------------------- Visualization --------------------------------------- #
+# ------------ Make volcano ---------------- #
+volcano_plots <- map(uniq_comps, function(comparison){
+
+ comp_col <- c(
+ glue("Lnfc_{comparison}"),
+ glue("Lnfc.SE_{comparison}"),
+ glue("Stat_{comparison}"),
+ glue("P.value_{comparison}"),
+ glue("Q.value_{comparison}"),
+ glue("Diff_{comparison}"),
+ glue("Passed_ss_{comparison}")
+ )
+
+ sub_res_df <- res_df %>%
+ select(!!feature, all_of(comp_col))
+ colnames(sub_res_df) <- str_replace_all(colnames(sub_res_df),
+ pattern = "(.+)_.+",
+ replacement = "\\1")
+ # Set default qvalue and plot dimensions.
+ p_val <- 0.1
+ plot_width_inches <- 11.1
+ plot_height_inches <- 8.33
+
+ groups_vec <- comparison %>%
+ str_replace_all("\\)v\\(", ").vs.(") %>%
+ str_remove_all("\\(|\\)") %>%
+ str_split(".vs.") %>% unlist
+
+ group1 <- groups_vec[1]
+ group2 <- groups_vec[2]
+
+ ######Long x-axis label adjustments##########
+ x_label <- glue("ln Fold Change\n\n( {group1} vs {group2} )")
+ label_length <- nchar(x_label)
+ max_allowed_label_length <- plot_width_inches * 10
+
+ # Construct x-axis label with new line breaks if was too long
+ if (label_length > max_allowed_label_length){
+ x_label <- glue("ln Fold Change\n\n( {group1} \n vs \n {group2} )")
+ }
+ #######################################
+
+ p <- ggplot(sub_res_df %>% mutate(diff = Q.value <= p_val),
+ aes(x=Lnfc, y=-log10(Q.value), color=diff, label=!!sym(feature))) +
+ geom_point(alpha=0.7, size=2) +
+ scale_color_manual(values=c("TRUE"="red", "FALSE"="black"),
+ labels=c(paste0("qval > ", p_val),
+ paste0("qval \u2264 ", p_val))) +
+ geom_hline(yintercept = -log10(p_val), linetype = "dashed") +
+ ggrepel::geom_text_repel(show.legend = FALSE) +
+ expandy(-log10(sub_res_df$Q.value)) + # Expand plot y-limit
+ coord_cartesian(clip = 'off') +
+ scale_y_continuous(oob = scales::oob_squish_infinite) + # prevent clipping of infinite values
+ labs(x= x_label, y="-log10(Q-value)",
+ title = "Volcano Plot", color=NULL,
+ caption = glue("dotted line: q-value = {p_val}")) +
+ theme_bw() +
+ theme(legend.position="top", legend.key = element_rect(colour=NA),
+ plot.caption = element_text(face = 'bold.italic'))
+
+ # Save plot
+ file_name <- glue("{output_prefix}{comparison %>% str_replace_all('[:space:]+','_')}_volcano.png")
+ ggsave(filename = file_name,
+ plot = p, device = "png",
+ width = plot_width_inches,
+ height = plot_height_inches,
+ units = "in", dpi = 300, path = diff_abund_out_dir)
+
+ return(p)
+})
+```
+
+**Custom Functions Used**
+
+* [expandy()](#expandy)
+* [get_ncbi_ids()](#get_ncbi_ids)
+* [fix_names()](#fix_names)
+* [ancombc2()](#ancombc2) (the wrapper function that calls ANCOMBC::ancombc2)
+
+**Parameter Definitions:**
+
+* `ancombc2()` - ANCOMBC::ancombc2 function (*pipeline uses default values unless defined below*)
+ * `data` - a TreeSummarizedExperiment object created from `feature_table` input data
+ * `fix_formula` - a string specifying the variable in the metadata to use for the fixed effects formula (e.g. group names), set by `groups_colname` input data
+ * `prv_cut` - fraction between 0 and 1 specifying the taxon prevalence cut-off, set by `prevalence_cutoff` input data
+ * `lib_cut` - a numerical threshold for filtering samples based on library sizes, set by `library_cutoff` input data
+ * `group` - the name of the group variable in the metadata, set by `groups_colname` input data
+ * `struc_zero` - logical value indicating whether or not group-wise structural zeros should be detected, set by `remove_struc_zero` input data
+ * `n_cl` - specifies the number of processes to run in parallel, set to `threads` input data
+ * `p_adj_method` - a string specifying the p-value adjustment method for multiple comparisons testing, set to "fdr" to standardize the multiple comparisons method across all three differential abundance methods.
+ * `pairwise` - logical value indicating whether or not to perform the pairwise directional test, set to "TRUE" to compute all pairwise comparisons
+ * `iter_control` - a named list of control parameters for the iterative MLE or RMEL algorithm
+ * `tol` - iteration convergence tolerance, set to "1e-5" to match ancombc
+ * `mdfdr_control` - a named list of control parameters for mixed directional false discovery rate (mdFDR)
+ * `fwer_ctrl_method` - family-wise error controlling procedure, set to 'fdr' to match p_adj_method
+ * `lme_control` - a named list of control parameters for mixed model fitting, set to 'NULL' to disable
+ * `verbose` - logical value specifying whether or not to generate verbose output
+
+**Input Data:**
+
+* `feature` (a string specifying the feature type, i.e. "ASV" or "OTU")
+* `groups_colname` (a string specifying the name of the column in the metadata table containing the group names)
+* `samples_column` (a string specifying the name of the column in the metadata table containing the sample names)
+* `assay_suffix` (a string specifying the suffix to be added to output files; default is the Genelab assay suffix, "_GLAmpSeq")
+* `output_prefix` (a string specifying an additional prefix to be added to the output files; default is no additional prefix, "")
+* `threads` (a number specifying the number of cpus to use for parallel processing)
+* `prevalence_cutoff` (a decimal between 0 and 1 specifying the proportion of samples required to contain a taxon in order to keep the taxon when `remove_rare` (set in [Step 6b.iv. Preprocessing](#6bv-preprocessing)) is set to TRUE; default is 0, i.e. do not exclude any taxon/feature)
+* `library_cutoff` (a numerical value specifying the number of total counts a sample must have across all features to be retained when `remove_rare` (set in [Step 6b.iv. Preprocessing](#6bv-preprocessing)) is set to TRUE; default is 0, i.e. no samples will be dropped)
+* `target_region` (a string specifying the amplicon target region; options are either "16S", "18S", or "ITS")
+* `remove_struc_zero` (a boolean value specifying whether or not structural zeros (a.k.a ASVs with zero count in at least one group) should be removed; default is FALSE i.e. structural zeros won't be removed)
+* `metadata` (a dataframe containing the sample metadata, with samples as row names and sample info as columns, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+* `feature_table` (a dataframe containing a filtered subset of the samples feature dataframe (i.e. ASV), output from [6b.v. Preprocessing](#6bv-preprocessing))
+* `taxonomy_table` (a dataframe containing a filtered subset of the feature taxonomy dataframe with ASV taxonomy assignments, output from [6b.v. Preprocessing](#6bv-preprocessing))
+
+**Output Data:**
+
+* **differential_abundance/ancombc2/(\)v(\)_volcano.png** (volcano plots for each pariwise comparison)
+* **differential_abundance/ancombc2/ancombc1_differential_abundance_GLAmpSeq.csv** (a comma-separated ANCOM-BC2 differential abundance results table containing the following columns:
+ - ASV (identified ASVs)
+ - taxonomic assignment columns
+ - NCBI identifier for the best taxonomic assignment for each ASV
+ - Normalized abundance values for each ASV for each sample
+ - For each pairwise group comparison:
+ - natural log of the fold change (Lnfc)
+ - standard error for the lnFC (Lnfc.SE)
+ - test statistic from the primary result (Stat)
+ - P-value (P.value)
+ - Adjusted p-value (Q.value)
+ - All.mean (mean across all samples)
+ - All.stdev (standard deviation across all samples)
+ - For each group:
+ - Group.Mean_(group) (mean within group)
+ - Group.Stdev_(group) (standard deviation within group))
+
+
+
+---
+
+### 10c. DESeq2
+
+```R
+# Create output directory if it doesn't already exist
+diff_abund_out_dir <- "differential_abundance/deseq2/"
+if(!dir.exists(diff_abund_out_dir)) dir.create(diff_abund_out_dir, recursive = TRUE)
+metadata <- {DATAFRAME}
+feature_table <- {DATAFRAME}
+taxonomy_table <- {DATAFRAME}
+feature <- "ASV"
+samples_column <- "Sample Name"
+groups_colname <- "groups"
+assay_suffix <- "_GLAmpSeq"
+target_region <- "16S" # "16S", "18S" or "ITS"
+output_prefix <- ""
+
+# Get long asv taxonomy names and clean
+species <- taxonomy_table %>%
+ unite(species,domain:species,sep = ";") %>%
+pull %>% str_replace_all("Other", "_")
+
+taxonomy_table <- fix_names(taxonomy_table, "Other", ";_")
+
+taxonomy_table[,"species"] <- species
+
+# Create phyloseq object from the feature, taxonomy and metadata tables
+ASV_physeq <- phyloseq(otu_table(feature_table, taxa_are_rows = TRUE),
+ tax_table(as.matrix(taxonomy_table)),
+ sample_data(metadata))
+# Convert the phyloseq object to a deseq object
+deseq_obj <- phyloseq_to_deseq2(physeq = ASV_physeq,
+ design = reformulate(groups_colname))
+
+# Add pseudocount if any 0 count samples are present
+if (sum(colSums(counts(deseq_obj)) == 0) > 0) {
+ # Add pseudo count of 1
+ count_data <- counts(deseq_obj) + 1
+ # Make a columns of integer type
+ count_data <- as.matrix(apply(count_data, 2, as.integer))
+ rownames(count_data) <- rownames(counts(deseq_obj))
+ colnames(count_data) <- colnames(counts(deseq_obj))
+ counts(deseq_obj) <- count_data
+}
+
+# ---------------------- Run DESeq ---------------------------------- #
+# https://rdrr.io/bioc/phyloseq/src/inst/doc/phyloseq-mixture-models.R
+deseq_modeled <- tryCatch({
+ # Attempt to run DESeq, if error occurs then attempt an alternative
+ # size factor estimation method
+ DESeq(deseq_obj)
+}, error = function(e) {
+ message("Error encountered in DESeq, applying alternative \
+ method for size factor estimation...")
+
+ geoMeans <- apply(counts(deseq_obj), 1, gm_mean)
+
+ # Apply the alternative size factor estimation method
+ deseq_obj <- estimateSizeFactors(deseq_obj, geoMeans=geoMeans)
+
+ # Call DESeq again with alternative geom mean size est
+ tryCatch({
+ DESeq(deseq_obj)
+ }, error = function(e2) {
+
+ writeLines(c("Error:", e2$message,
+ "\nUsing gene-wise estimates as final estimates instead of standard curve fitting."),
+ file.path(diff_abund_out_dir, glue("{output_prefix}deseq2_warning.txt")))
+
+ # Use gene-wise estimates as final estimates
+ deseq_obj <- estimateDispersionsGeneEst(deseq_obj)
+ dispersions(deseq_obj) <- mcols(deseq_obj)$dispGeneEst
+ # Continue with testing using nbinomWaldTest
+ nbinomWaldTest(deseq_obj)
+ })
+})
+
+
+# Make ASV Sparsity plot
+sparsity_plot <- plotSparsity(deseq_modeled)
+ggsave(filename = glue("{diff_abund_out_dir}/{output_prefix}asv_sparsity_plot.png"),
+ plot = sparsity_plot, width = 14, height = 10, dpi = 300, units = "in")
+
+# Get unique group comparison as a matrix
+pairwise_comp.m <- utils::combn((metadata[,group] %>% unique %>% sort), 2)
+pairwise_comp_df <- pairwise_comp.m %>% as.data.frame
+# Set the colnames as group1vgroup2
+colnames(pairwise_comp_df) <- map_chr(pairwise_comp_df,
+ \(col) str_c(col, collapse = "v"))
+comparisons <- colnames(pairwise_comp_df)
+names(comparisons) <- comparisons
+
+# Retrieve statistics table
+merged_stats_df <- data.frame(ASV=rownames(feature_table))
+colnames(merged_stats_df) <- feature
+
+walk(pairwise_comp_df, function(col){
+
+ group1 <- col[1]
+ group2 <- col[2]
+
+# Retrieve the statistics table for the current pair and rename the columns
+df <- results(deseq_modeled, contrast = c(group, group2, group1)) %>%
+ data.frame() %>%
+ rownames_to_column(feature) %>%
+ set_names(c(feature ,
+ glue("baseMean_({group2})v({group1})"),
+ glue("Log2fc_({group2})v({group1})"),
+ glue("lfcSE_({group2})v({group1})"),
+ glue("Stat_({group2})v({group1})"),
+ glue("P.value_({group2})v({group1})"),
+ glue("Adj.p.value_({group2})v({group1})")
+ ))
+
+ merged_stats_df <<- merged_stats_df %>%
+ dplyr::left_join(df, join_by(!!feature))
+})
+
+# ---------------------- Add NCBI id to feature, i.e. ASV
+# Get the best / lowest possible taxonomy assignment for the features, i.e. ASVs
+tax_names <- map_chr(str_replace_all(taxonomy_table$species, ";_","") %>%
+ str_split(";"),
+ function(row) row[length(row)])
+
+df <- data.frame(ASV=rownames(taxonomy_table), best_taxonomy=tax_names)
+colnames(df) <- c(feature, "best_taxonomy")
+
+# Pull NCBI IDS for unique taxonomy names
+# Filter out unannotated entries before querying NCBI
+valid_taxonomy <- df$best_taxonomy %>% unique() %>% setdiff("_")
+df2_valid <- data.frame(best_taxonomy = valid_taxonomy) %>%
+ mutate(NCBI_id=get_ncbi_ids(best_taxonomy, target_region),
+ .after = best_taxonomy)
+
+# Add unannotated entries with NA NCBI_id
+df2_invalid <- data.frame(best_taxonomy = "_", NCBI_id = NA)
+df2 <- rbind(df2_valid, df2_invalid)
+
+# -------- Retrieve deseq normalized table from the deseq model
+normalized_table <- counts(deseq_modeled, normalized=TRUE) %>%
+ as.data.frame() %>%
+ rownames_to_column(feature)
+
+# Creating a dataframe of samples that were dropped because they didn't
+# meet our cut-off criteria
+samples <- metadata[[samples_column]]
+samplesdropped <- setdiff(x = samples, y = colnames(normalized_table)[-1])
+missing_df <- data.frame(ASV=normalized_table[[feature]],
+ matrix(data = NA,
+ nrow = nrow(normalized_table),
+ ncol = length(samplesdropped)
+ )
+)
+colnames(missing_df) <- c(feature,samplesdropped)
+
+# Calculate mean and standard deviation of all ASVs for each group in
+# a dataframe called group_means_df
+group_levels <- metadata[, groups_colname] %>% unique() %>% sort()
+group_means_df <- normalized_table[feature]
+walk(group_levels, function(group_level){
+
+ # Initializing mean and std column names
+ mean_col <- glue("Group.Mean_({group_level})")
+ std_col <- glue("Group.Stdev_({group_level})")
+
+ # Get a vector of samples that belong to the current group
+ Samples <- metadata %>%
+ filter(!!sym(groups_colname) == group_level) %>%
+ pull(!!sym(samples_column))
+ # Retain only samples that belong to the current group that are in the normalized table
+ Samples <- intersect(colnames(normalized_table), Samples)
+
+ # Calculate the means and standard deviations for the current group
+ temp_df <- normalized_table %>% select(!!feature, all_of(Samples)) %>%
+ rowwise() %>%
+ mutate(!!mean_col := mean(c_across(where(is.numeric)), na.rm = TRUE),
+ !!std_col := sd(c_across(where(is.numeric)), na.rm = TRUE) ) %>%
+ select(!!feature,!!sym(mean_col), !!sym(std_col))
+
+ group_means_df <<- group_means_df %>% left_join(temp_df)
+
+})
+
+# Append mean, standard deviation and missing samples to the normalized table
+normalized_table <- normalized_table %>%
+ left_join(missing_df, by = feature) %>%
+ select(!!feature, all_of(samples))
+
+# Calculate mean global means and standard deviations
+All_mean_sd <- normalized_table %>%
+ rowwise() %>%
+ mutate(All.mean=mean(c_across(where(is.numeric)), na.rm = TRUE),
+ All.stdev=sd(c_across(where(is.numeric)), na.rm = TRUE) ) %>%
+ select(!!feature, All.mean, All.stdev)
+
+# Add taxonomy
+merged_df <- df %>% # statistics table
+ left_join(taxonomy_table %>%
+ as.data.frame() %>%
+ rownames_to_column(feature)) %>% # append taxonomy table
+ select(!!feature, domain:species,everything()) # select columns of interest
+
+# Merge all prepared tables in the desired order
+merged_df <- merged_df %>%
+ select(!!sym(feature):NCBI_id) %>% # select only the features and NCBI ids
+ left_join(normalized_table, by = feature) %>% # append the normalized table
+ left_join(merged_df) %>% # append the stats table
+ left_join(All_mean_sd) %>% # append the global/ASV means and stds
+ left_join(group_means_df, by = feature) %>% # append the group means and stds
+ mutate(across(where(is.matrix), as.numeric)) # convert meatrix columns to numeric columns
+
+# Defining the output file
+output_file <- glue("{diff_abund_out_dir}/{output_prefix}deseq2_differential_abundance{assay_suffix}.csv")
+# Writing out results of differential abundance using DESeq2
+# after dropping baseMean columns
+write_csv(merged_df %>%
+ select(-starts_with("baseMean_")),
+ output_file)
+
+# ------------------------- Make volcano plots ------------------------ #
+# Loop over group pairs and make a volcano comparing the pair
+walk(pairwise_comp_df, function(col){
+
+ group1 <- col[1]
+ group2 <- col[2]
+
+ # Setting plot dimensions
+ plot_width_inches <- 11.1
+ plot_height_inches <- 8.33
+ p_val <- 0.1 # logfc cutoff
+
+ # Retrieve data for plotting
+ deseq_res <- results(deseq_modeled, contrast = c(group, group2, group1))
+ volcano_data <- as.data.frame(deseq_res)
+ volcano_data <- volcano_data[!is.na(volcano_data$padj), ]
+ volcano_data$significant <- volcano_data$padj <= p_val
+
+ ######Long x-axis label adjustments##########
+ x_label <- glue("Log2 Fold Change\n\n( {group2} vs {group1} )")
+ label_length <- nchar(x_label)
+ max_allowed_label_length <- plot_width_inches * 10
+
+ # Construct x-axis label with new line breaks if was too long
+ if (label_length > max_allowed_label_length){
+ x_label <- glue("Log2 Fold Change\n\n( {group2} \n vs \n {group1} )")
+ }
+ #######################################
+
+ # ASVs promoted in space on right, reduced on left
+ p <- ggplot(volcano_data %>%
+ as.data.frame() %>%
+ rownames_to_column(feature),
+ aes(x = log2FoldChange, y = -log10(padj),
+ color = significant, label = !!sym(feature))
+ ) +
+ geom_point(alpha=0.7, size=2) +
+ geom_hline(yintercept = -log10(p_val), linetype = "dashed") +
+ scale_color_manual(values=c("black", "red"),
+ labels=c(paste0("padj > ", p_val),
+ paste0("padj \u2264 ", p_val))) +
+ ggrepel::geom_text_repel(show.legend = FALSE) +
+ expandy(-log10(volcano_data$padj)) + # Expand plot y-limit
+ coord_cartesian(clip = 'off') +
+ scale_y_continuous(oob = scales::oob_squish_infinite) +
+ theme_bw() +
+ labs(title = "Volcano Plot",
+ x = x_label,
+ y = "-Log10 P-value",
+ color = NULL,
+ caption = glue("dotted line: padj = {p_val}")) +
+ theme(legend.position="top", legend.key = element_rect(colour=NA),
+ plot.caption = element_text(face = 'bold.italic'))
+
+ # --- Save Plot
+ # Replace space in group name with underscore
+ group1 <- str_replace_all(group1, "[:space:]+", "_")
+ group2 <- str_replace_all(group2, "[:space:]+", "_")
+ ggsave(filename = glue("{output_prefix}({group2})v({group1})_volcano.png"),
+ plot = p,
+ width = plot_width_inches,
+ height = plot_height_inches,
+ dpi = 300,
+ path = diff_abund_out_dir)
+})
+```
+
+**Custom Functions Used:**
+
+* [expandy()](#expandy)
+* [get_ncbi_ids()](#get_ncbi_ids)
+* [fix_names()](#fix_names)
+* [gm_mean()](#gm_mean)
+* [plotSparsity()](#plotSparsity)
+
+**Parameter Definitions:**
+* *pipeline uses default values for `DESeq()` analysis*
+
+**Input Data:**
+
+* `feature` (a string specifying the feature type, i.e. "ASV" or "OTU")
+* `groups_colname` (a string specifying the name of the column in the metadata table containing the group names)
+* `samples_column` (a string specifying the name of the column in the metadata table containing the sample names)
+* `assay_suffix` (a string specifying the suffix to be added to output files; default is the Genelab assay suffix, "_GLAmpSeq")
+* `output_prefix` (a string specifying an additional prefix to be added to the output files; default is no additional prefix, "")
+* `target_region` (a string specifying the amplicon target region; options are either "16S", "18S", or "ITS")
+* `metadata` (a dataframe containing the sample metadata, with samples as row names and sample info as columns, output from [6b.iv. Read-in Input Tables](#6biv-read-in-input-tables))
+* `feature_table` (a dataframe containing a filtered subset of the samples feature dataframe (i.e. ASV), output from [6b.v. Preprocessing](#6bv-preprocessing))
+* `taxonomy_table` (a dataframe containing a filtered subset of the feature taxonomy dataframe with ASV taxonomy assignments, output from [6b.v. Preprocessing](#6bv-preprocessing))
+
+**Output Data:**
+
+* **differential_abundance/deseq2/(\)v(\)_volcano.png** (volcano plots for each pariwise comparison)
+* **differential_abundance/deseq2/deseq2_differential_abundance_GLAmpSeq.csv** (a comma-separated DESeq2 differential abundance results table containing the following columns:
+ - ASV (identified ASVs)
+ - taxonomic assignment columns
+ - NCBI identifier for the best taxonomic assignment for each ASV
+ - Normalized abundance values for each ASV for each sample
+ - For each pairwise group comparison:
+ - log2 of the fold change (Log2fc)
+ - standard error for the log2FC (lfcSE)
+ - test statistic from the primary result (Stat)
+ - P-value (P.value)
+ - Adjusted p-value (Adj.p.value)
+ - All.mean (mean across all samples)
+ - All.stdev (standard deviation across all samples)
+ - For each group:
+ - Group.Mean_(group) (mean within group)
+ - Group.Stdev_(group) (standard deviation within group))
+* **differential_abundance/deseq2/asv_sparsity_plot.png** (a diagnostic plot of ASV sparsity to be used to assess if running DESeq2 is appropriate)
+
+
+---
diff --git a/Amplicon/Illumina/README.md b/Amplicon/Illumina/README.md
index db5a8d75..a2b54fbd 100644
--- a/Amplicon/Illumina/README.md
+++ b/Amplicon/Illumina/README.md
@@ -1,13 +1,13 @@
# GeneLab bioinformatics processing pipeline for Illumina amplicon sequencing data
-> **The document [`GL-DPPD-7104-B.md`](Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md) holds an overview and example commands for how GeneLab processes Illumina amplicon sequencing datasets. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and processing code are provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**
+> **The document [`GL-DPPD-7104-C.md`](Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-C.md) holds an overview and example commands for how GeneLab processes Illumina amplicon sequencing datasets. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and processing code are provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**
>
> Note: The exact processing commands and AmpIllumina version used for specific GLDS datasets can be found in the *_processing_info.zip file under "Files" for each respective GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).
---
-
+
---
@@ -30,6 +30,7 @@
**Developed by:**
Michael D. Lee (Mike.Lee@nasa.gov)
**Maintained by:**
+Olabiyi A. Obayomi (olabiyi.a.obayomi@nasa.gov)
Michael D. Lee (Mike.Lee@nasa.gov)
Alexis Torres (alexis.torres@nasa.gov)
Amanda Saravia-Butler (amanda.m.saravia-butler@nasa.gov)
diff --git a/Amplicon/Illumina/Workflow_Documentation/NF_AmpIllumina b/Amplicon/Illumina/Workflow_Documentation/NF_AmpIllumina
new file mode 160000
index 00000000..216dcadc
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/NF_AmpIllumina
@@ -0,0 +1 @@
+Subproject commit 216dcadc250b286af92a5ba0b8830e842b44cb61
diff --git a/Amplicon/Illumina/Workflow_Documentation/README.md b/Amplicon/Illumina/Workflow_Documentation/README.md
index 450d8c86..4285062b 100644
--- a/Amplicon/Illumina/Workflow_Documentation/README.md
+++ b/Amplicon/Illumina/Workflow_Documentation/README.md
@@ -6,9 +6,11 @@
|Pipeline Version|Current Workflow Version (for respective pipeline version)|
|:---------------|:---------------------------------------------------------|
-|*[GL-DPPD-7104-B.md](../Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md)|[1.2.2](SW_AmpIllumina-B)|
-|[GL-DPPD-7104-A.md](../Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-A.md)|[1.1.1](SW_AmpIllumina-A)|
+|*[GL-DPPD-7104-C.md](../Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-C.md)|[NF_AmpIllumina_1.0.0](https://github.com/nasa/GeneLab_AmpliconSeq_Workflow)|
+|[GL-DPPD-7104-B.md](../Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md)|[SW_AmpIllumina-B_1.2.3](SW_AmpIllumina-B)|
+|[GL-DPPD-7104-A.md](../Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-A.md)|[SW_AmpIllumina-A_1.1.1](SW_AmpIllumina-A)|
*Current GeneLab Pipeline/Workflow Implementation
-> See the [workflow change log](SW_AmpIllumina-B/CHANGELOG.md) to access previous workflow versions and view all changes associated with each version update.
+> See the [NF_AmpIllumina Change Log](https://github.com/nasa/GeneLab_AmpliconSeq_Workflow/blob/main/CHANGELOG.md) to access the most recent changes to the workflow and view all changes associated with each update.
+> All workflow changes associated with the previous version of the GeneLab Amplicon Pipeline ([GL-DPPD-7104-B](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md) and earlier) can be found in the [SW_AmpIllumina-B Change Log](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/CHANGELOG.md)
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/CHANGELOG.md b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/CHANGELOG.md
index 59e04298..49048695 100644
--- a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/CHANGELOG.md
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/CHANGELOG.md
@@ -35,7 +35,7 @@
- Added [run_workflow.py](workflow_code/scripts/run_workflow.py)
- Sets up runsheet for OSDR datasets. Uses a runsheet to set up [config.yaml](workflow_code/config.yaml) and [unique-sample-IDs.txt](workflow_code/unique-sample-IDs.txt). Runs the Snakemake workflow.
- Updated instructions in [README.md](README.md) to use [run_workflow.py](workflow_code/scripts/run_workflow.py)
-- Added downstream analysis visualizations rule using [Illumina-R-Visualizations.R](workflow_code/scripts/Illumina-R-visualizations.R)
+- Added downstream analysis visualizations rule using [Illumina-R-Visualizations.R](workflow_code/visualizations/Illumina-R-visualizations.R)
- Volcano plots, dendrogram, PCoA, rarefaction, richness, taxonomy plots
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/README.md b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/README.md
index 0605b646..72506160 100644
--- a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/README.md
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/README.md
@@ -1,6 +1,214 @@
# SW_AmpIllumina-B Workflow Information and Usage Instructions
-The SW_AmpIllumina-B workflow is curently under development and will be available soon.
-> A beta version of SW_AmpIllumina-B is available under development branch: [dev2-amplicon-add-runsheet-visualizations](https://github.com/nasa/GeneLab_Data_Processing/tree/dev2-amplicon-add-runsheet-visualizations/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B)
+## General workflow info
+The current GeneLab Illumina amplicon sequencing data processing pipeline (AmpIllumina), [GL-DPPD-7104-B.md](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md), is implemented as a [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow and utilizes [conda](https://docs.conda.io/en/latest/) environments to install/run all tools. This workflow (SW_AmpIllumina-B) is run using the command line interface (CLI) of any unix-based system. The workflow can be used even if you are unfamiliar with Snakemake and conda, but if you want to learn more about those, [this Snakemake tutorial](https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html) within [Snakemake's documentation](https://snakemake.readthedocs.io/en/stable/) is a good place to start for that, and an introduction to conda with installation help and links to other resources can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/unix/conda-intro).
+
+
+---
+
+## Utilizing the workflow
+
+- [1. Install conda, mamba, and `genelab-utils` package](#1-install-conda-mamba-and-genelab-utils-package)
+- [2. Download the workflow template files](#2-download-the-workflow-template-files)
+- [3. Run the workflow using `run_workflow.py`](#3-run-the-workflow-using-run_workflowpy)
+ - [3a. Approach 1: Run the workflow on a GeneLab Amplicon (Illumina) sequencing dataset with automatic retrieval of raw read files and metadata](#3a-approach-1-run-the-workflow-on-a-genelab-amplicon-illumina-sequencing-dataset-with-automatic-retrieval-of-raw-read-files-and-metadata)
+ - [3b. Approach 2: Run the workflow on a non-OSD dataset using a user-created runsheet](#3b-approach-2-run-the-workflow-on-a-non-osd-dataset-using-a-user-created-runsheet)
+- [4. Parameter definitions](#4-parameter-definitions)
+- [5. Additional output files](#5-additional-output-files)
+
+
+
+___
+
+### 1. Install conda, mamba, and `genelab-utils` package
+We recommend installing a Miniconda, Python3 version appropriate for your system, as exemplified in [the above link](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda).
+
+Once conda is installed on your system, we recommend installing [mamba](https://github.com/mamba-org/mamba#mamba), as it generally allows for much faster conda installations:
+
+```bash
+conda install -n base -c conda-forge mamba
+```
+
+> You can read a quick intro to mamba [here](https://astrobiomike.github.io/unix/conda-intro#bonus-mamba-no-5) if wanted.
+
+Once mamba is installed, you can install the genelab-utils conda package in a new environment with the following command:
+
+```bash
+mamba create -n genelab-utils -c conda-forge -c bioconda -c defaults -c astrobiomike 'genelab-utils==1.3.35'
+```
+
+The environment then needs to be activated and updated by running the following commands:
+
+```bash
+conda activate genelab-utils
+pip install --upgrade pyOpenSSL
+```
+
+
+___
+
+### 2. Download the workflow template files
+
+
+All files required for utilizing the GeneLab workflow for processing Illumina amplicon sequencing data are in the [workflow_code](workflow_code) directory. To get a copy of the latest SW_AmpIllumina-B version on to your system, run the following command:
+
+```bash
+GL-get-workflow Amplicon-Illumina
+```
+
+This downloaded the workflow into a directory called `SW_AmpIllumina-*/`, with the workflow version number at the end.
+
+> Note: If wanting an earlier version, the wanted version can be provided as an optional argument like so:
+> ```bash
+> GL-get-workflow Amplicon-Illumina --wanted-version 1.0.0
+> ```
+
+
+
+___
+
+### 3. Run the workflow using `run_workflow.py`
+
+While in the `SW_AmpIllumina-*/` directory that was downloaded in [step 2](#2-download-the-workflow-template-files), you are now able to run the workflow using the `run_workflow.py` script in the [scripts/](workflow_code/scripts) sub-directory to set up the configuration files needed to execute the workflow.
+
+> Note: The commands to run the workflow in each approach listed below allows for two sets of options. The options specified outside of the quotation marks are specific to the `run_workflow.py` script, and the options specified within the quotation marks are specific to `snakemake`.
+
+
+
+___
+
+#### 3a. Approach 1: Run the workflow on a GeneLab Amplicon (Illumina) sequencing dataset with automatic retrieval of raw read files and metadata
+
+> This approach processes data hosted on the [NASA Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). Upon execution, the command downloads then parses the OSD ISA.zip file to create a runsheet containing link(s) to the raw reads and the metadata required for processing. The runsheet is then used to prepare the necessary configuration files before executing the workflow using the specified Snakemake run command.
+
+```bash
+python ./scripts/run_workflow.py --OSD OSD-487 --target 16S --run "snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 2 -p"
+```
+
+
+
+___
+
+#### 3b. Approach 2: Run the workflow on a non-OSD dataset using a user-created runsheet
+
+> If processing a non-OSD dataset, you must manually create the runsheet for your dataset to run the workflow. Specifications for creating a runsheet manually are described [here](examples/runsheet/README.md).
+
+```bash
+python ./scripts/run_workflow.py --runsheetPath --run "snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 2 -p"
+```
+
+
+
+___
+
+### 4. Parameter definitions
+
+
+
+**Parameter definitions for `run_workflow.py`:**
+
+* `--OSD OSD-###` - specifies the OSD dataset to process through the SW_AmpIllumina workflow (replace ### with the OSD number)
+ > *Used for Approach 1 only.*
+
+* `--target` - this is a required parameter that specifies the genomic target for the assay. Options: 16S, 18S, ITS. This determines which reference database is used for taxonomic classification, and it is used to select the appropriate dataset from an OSD study when multiple options are available.
+ > *Note: Swift Amplicon 16S+ITS datasets, which use multiple primer sets, are not suitable for processing with this workflow. OSD datasets of this type are processed using alternative methods.*
+
+* `--runsheetPath` - specifies the path to a local runsheet containing the metadata and raw reads location (as a link or local path), used for processing a non-OSD dataset through the SW_AmpIllumina workflow
+ > *Optionally used for Approach 2 only, the form can be used instead of providing a runsheet on the NASA EDGE platform.*
+
+* `--run` - specifies the command used to execute the snakemake workflow; snakemake-specific parameters are defined below
+
+* `--outputDir` - specifies the output directory for the output files generated by the workflow
+ > *This is an optional command that can be added outside the quotation marks in either approach to specify the output directory. If this option is not used, the output files will be printed to the current working directory, i.e. in the `SW_AmpIllumina-B_1.2.3` directory that was downloaded in [step 2](#2-download-the-workflow-template-files).*
+
+* `--trim-primers TRUE/FALSE` - specifies to trim primers (TRUE) or not (FALSE). Default: TRUE
+ > *Note: Primers should virtually always be trimmed from amplicon datasets. This option is here for cases where they have already been removed.*
+
+* `--min_trimmed_length` - specifies the minimum length of trimmed reads during cutadapt filtering. For paired-end data: if one read gets filtered, both reads are discarded. Default: 130
+ > *Note: For paired-end data, all filtering and trimming should leave a minimum of an 8-base overlap of forward and reverse reads.*
+
+* `--primers-linked TRUE/FALSE` - if set to TRUE, instructs cutadapt to treat the primers as linked. Default: FALSE
+ > *Note: See [cutadapt documentation here](https://cutadapt.readthedocs.io/en/stable/recipes.html#trimming-amplicon-primers-from-paired-end-reads) for more info.*
+
+* `--anchor-primers TRUE/FALSE` - indicates if primers should be anchored (TRUE) or not (FALSE) when provided to cutadapt. Default: FALSE
+ > *Note: See [cutadapt documentation here](https://cutadapt.readthedocs.io/en/stable/guide.html#anchored-5adapters) for more info.*
+
+* `--discard-untrimmed TRUE/FALSE` - if set to TRUE, instructs cutadapt to remove reads if the primers were not found in the expected location; if set to FALSE, these reads are kept. Default: TRUE
+
+* `--left-trunc` - dada2 parameter that specifies to truncate the forward reads to this length, bases beyond this length will be removed and reads shorter than this length are discarded. Default: 0 (no truncation)
+ > *Note: See dada2 [filterAndTrim documentation](https://rdrr.io/bioc/dada2/man/filterAndTrim.html) for more info.*
+
+* `--right-trunc` - dada2 parameter that specifies to truncate the reverse reads, bases beyond this length will be truncated and reads shorter than this length are discarded. Default: 0 (no truncation)
+ > *Note: See dada2 [filterAndTrim documentation](https://rdrr.io/bioc/dada2/man/filterAndTrim.html) for more info.*
+
+* `--left-maxEE` - dada2 parameter that specifies the maximum expected error (maxEE) allowed for each forward read, reads with a higher maxEE than provided will be discarded. Default: 1
+ > *Note: See dada2 [filterAndTrim documentation](https://rdrr.io/bioc/dada2/man/filterAndTrim.html) for more info.*
+
+* `--right-maxEE` - dada2 parameter that specifies the maximum expected error (maxEE) allowed for each forward read, reads with a higher maxEE than provided will be discarded. Default: 1
+ > *Note: See dada2 [filterAndTrim documentation](https://rdrr.io/bioc/dada2/man/filterAndTrim.html) for more info.*
+
+* `--concatenate_reads_only TRUE/FALSE` - if set to TRUE, specifies to concatenate forward and reverse reads only with dada2 instead of merging paired reads. Default: FALSE
+
+* `--output-prefix ""` - specifies the prefix to use on all output files to distinguish multiple primer sets, leave as an empty string if only one primer set is being processed (if used, be sure to include a connecting symbol, e.g. "ITS-"). Default: ""
+
+* `--specify-runsheet` - specifies the runsheet to use when multiple runsheets are generated
+ > *Optional parameter used in Approach 1 for datasets that have multiple assays for the same amplicon target (e.g. [OSD-249](https://osdr.nasa.gov/bio/repo/data/studies/OSD-249)).*
+
+* `--visualizations TRUE/FALSE` - if set to TRUE, the [visualizations script](workflow_code/visualizations/Illumina-R-visualizations.R) will be run. Default: TRUE
+ > *Note: For instructions on manually executing the visualizations script, refer to the [stand-alone execution documentation](./workflow_code/visualizations/README.md).*
+
+
+
+**Parameter definitions for `snakemake`**
+
+* `--use-conda` – specifies to use the conda environments included in the workflow (these are specified in the [envs](workflow_code/envs) directory)
+* `--conda-prefix` – indicates where the needed conda environments will be stored. Adding this option will also allow the same conda environments to be re-used when processing additional datasets, rather than making new environments each time you run the workflow. The value listed for this option, `${CONDA_PREFIX}/envs`, points to the default location for conda environments (note: the variable `${CONDA_PREFIX}` will be expanded to the appropriate location on whichever system it is run on).
+* `-j` – assigns the number of jobs Snakemake should run concurrently
+* `-p` – specifies to print out each command being run to the screen
+* `--cluster-status` – specifies a script for monitoring the status of jobs on a cluster, improving Snakemake's handling of job timeouts and exceeding memory limits
+ > This is an optional parameter that can be used on a SLURM cluster by adding `--cluster-status scripts/slurm-status.py` to the Snakemake command.
+
+See `snakemake -h` and [Snakemake's documentation](https://snakemake.readthedocs.io/en/stable/) for more options and details.
+
+
+
+___
+
+### 5. Additional output files
+
+The outputs from the `run_workflow.py` and differential abundance analysis (DAA) / visualizations scripts are described below:
+> Note: Outputs from the Amplicon Seq - Illumina pipeline are documented in the [GL-DPPD-7104-B.md](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md) processing protocol.
+
+- **Metadata Outputs:**
+ - \*_AmpSeq_v1_runsheet.csv (table containing metadata required for processing, including the raw reads files location)
+ - \*-ISA.zip (the ISA archive of the OSD datasets to be processed, downloaded from the OSDR)
+ - config.yaml (configuration file containing the metadata from the runsheet (\*_AmpSeq_v1_runsheet.csv), required for running the SW_AmpIllumina workflow)
+ - unique-sample-IDs.txt (text file containing the IDs of each sample used, required for running the SW_AmpIllumina workflow)
+- **DAA and Visualization Outputs:**
+ - dendrogram_by_group_GLAmpSeq.png (Dendrogram of euclidean distance - based hierarchical clustering of the samples, colored by experimental groups)
+ - PCoA_w_labels_GLAmpSeq.png (Principle Coordinates Analysis plot of VST transformed ASV counts, with sample labels)
+ - PCoA_without_labels_GLAmpSeq.png (Principle Coordinates Analysis plot of VST transformed ASV counts, without labels)
+ - rarefaction_curves_GLAmpSeqs.png (Rarefaction plot visualizing species richness for each sample)
+ - richness_and_diversity_estimates_by_sample_GLAmpSeq.png (Chao1 richness estimates and Shannon diversity estimates for each sample)
+ - richness_and_diversity_estimates_by_group_GLAmpSeq.png (Chao1 richness estimates and Shannon diversity estimates for each group)
+ - relative_classes_GLAmpSeq.png (Bar plot taxonomic summary of proportions of phyla identified in each group, by class)
+ - relative_phyla_GLAmpSeq.png (Bar plot taxonomic summary of proportions of phyla identified in each group, by phyla)
+ - samplewise_relative_classes_GLAmpSeq.png (Bar plot taxonomic summary of proportions of phyla identified in each sample, by class)
+ - samplewise_relative_phyla_GLAmpSeq.png (Bar plot taxonomic summary of proportions of phyla identified in each sample, by phyla)
+ - normalized_counts_GLAmpSeq.tsv (Size factor normalized ASV counts table)
+ - {group1}\_vs_{group2}.csv (Differential abundance tables for all pairwise contrasts of groups)
+ - volcano\_{group1}\_vs_{group2}.png (Volcano plots for all pairwise contrasts of groups)
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/Snakefile b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/Snakefile
new file mode 100644
index 00000000..394dde82
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/Snakefile
@@ -0,0 +1,697 @@
+############################################################################################
+## Snakefile for GeneLab's Illumina amplicon workflow ##
+## Version 1.2.1 ##
+## Initially developed by Michael D. Lee (Mike.Lee@nasa.gov) ##
+## Developed and maintained by Michael D. Lee and Alexis Torres (alexis.torres@nasa.gov) ##
+############################################################################################
+
+import os
+
+configfile: "config.yaml"
+
+enable_visualizations = config["enable_visualizations"]
+
+########################################
+############# General Info #############
+########################################
+
+"""
+See the corresponding 'config.yaml' file for general use information.
+Variables that may need to be adjusted should be changed there, not here.
+"""
+
+## example usage command ##
+# snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 2 -p
+
+# `--use-conda` – this specifies to use the conda environments included in the workflow
+# `--conda-prefix` – this allows us to point to where the needed conda environments should be stored. Including this means if we use the workflow on a different dataset somewhere else in the future, it will re-use the same conda environments rather than make new ones. The value listed here, `${CONDA_PREFIX}/envs`, is the default location for conda environments (the variable `${CONDA_PREFIX}` will be expanded to the appropriate location on whichever system it is run on).
+# `-j` – this lets us set how many jobs Snakemake should run concurrently (keep in mind that many of the thread and cpu parameters set in the config.yaml file will be multiplied by this)
+# `-p` – specifies to print out each command being run to the screen
+
+# See `snakemake -h` for more options and details.
+
+########################################
+####### Assay-specific GL suffix #######
+########################################
+
+assay_suffix = "_GLAmpSeq"
+
+
+########################################
+#### Reading samples file into list ####
+########################################
+
+sample_IDs_file = config["sample_info_file"]
+sample_ID_list = [line.strip() for line in open(sample_IDs_file)]
+
+# making sure there are all unique names
+if len(set(sample_ID_list)) != len(sample_ID_list):
+
+ print("\n Not all sample IDs in the " + str(config["sample_info_file"]) + " file are unique :(\n")
+ print(" Exiting for now.\n")
+ exit()
+
+########################################
+######## Setting up directories ########
+########################################
+
+# Initialize the list of needed directories without plots_dir
+if config["trim_primers"] == "TRUE":
+ needed_dirs = [
+ config["info_out_dir"],
+ config["fastqc_out_dir"],
+ config["trimmed_reads_dir"],
+ config["filtered_reads_dir"],
+ config["final_outputs_dir"],
+ "benchmarks"
+ ]
+else:
+ needed_dirs = [
+ config["info_out_dir"],
+ config["fastqc_out_dir"],
+ config["filtered_reads_dir"],
+ config["final_outputs_dir"],
+ "benchmarks"
+ ]
+
+# Conditionally add plots_dir if enable_visualizations is True
+if enable_visualizations == "TRUE":
+ needed_dirs.append(config["plots_dir"])
+
+# Try to create the directories
+for dir in needed_dirs:
+ try:
+ os.makedirs(dir, exist_ok=True)
+ except Exception as e:
+ print(f"Could not create directory {dir}: {e}")
+
+########################################
+########## Setting up outputs ##########
+########################################
+
+# Base rule all inputs (final outs) for PE, with or without trimming
+base_PE_inputs = [
+ expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"], ID = sample_ID_list),
+ expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R2_suffix"], ID = sample_ID_list),
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts{assay_suffix}.biom.zip",
+ config["final_outputs_dir"] + config["output_prefix"] + f"ASVs{assay_suffix}.fasta",
+ config["final_outputs_dir"] + config["output_prefix"] + f"read-count-tracking{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"counts{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts{assay_suffix}.tsv",
+ config["fastqc_out_dir"] + config["output_prefix"] + f"raw_multiqc{assay_suffix}_report.zip",
+ config["fastqc_out_dir"] + config["output_prefix"] + f"filtered_multiqc{assay_suffix}_report.zip"
+]
+
+# Base rule all inputs (final outs) for SE, with or without trimming
+base_SE_inputs = [
+ expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"], ID = sample_ID_list),
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts{assay_suffix}.biom.zip",
+ config["final_outputs_dir"] + config["output_prefix"] + f"ASVs{assay_suffix}.fasta",
+ config["final_outputs_dir"] + config["output_prefix"] + f"read-count-tracking{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"counts{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts{assay_suffix}.tsv",
+ config["fastqc_out_dir"] + config["output_prefix"] + f"raw_multiqc{assay_suffix}_report.zip",
+ config["fastqc_out_dir"] + config["output_prefix"] + f"filtered_multiqc{assay_suffix}_report.zip"
+]
+
+# Add additional inputs for trimming
+if config["trim_primers"] == "TRUE":
+ if config["data_type"] == "PE":
+ base_PE_inputs += [
+ expand(config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R1_suffix"], ID = sample_ID_list),
+ expand(config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R2_suffix"], ID = sample_ID_list),
+ config["trimmed_reads_dir"] + config["output_prefix"] + f"cutadapt{assay_suffix}.log",
+ config["trimmed_reads_dir"] + config["output_prefix"] + f"trimmed-read-counts{assay_suffix}.tsv",
+ ]
+ else: # SE with primer trimming
+ base_SE_inputs += [
+ expand(config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R1_suffix"], ID = sample_ID_list),
+ config["trimmed_reads_dir"] + config["output_prefix"] + f"cutadapt{assay_suffix}.log",
+ config["trimmed_reads_dir"] + config["output_prefix"] + f"trimmed-read-counts{assay_suffix}.tsv",
+ ]
+
+# Conditional addition of visualization outputs (color legend only to keep it simple)
+visualization_outputs = [config["plots_dir"] + config["output_prefix"] + f"color_legend{assay_suffix}.png"] if enable_visualizations == "TRUE" else []
+
+########################################
+############# Rules start ##############
+########################################
+
+#### rules if paired-end data ####
+if config["data_type"] == "PE":
+
+ rule all:
+ input: base_PE_inputs + visualization_outputs
+ shell:
+ """
+ bash scripts/combine-benchmarks.sh
+ python scripts/copy_info.py
+ """
+
+
+ # R processing rule for paired-end data
+ if config["trim_primers"] == "TRUE":
+
+ rule run_R_PE:
+ conda:
+ "envs/R.yaml"
+ input:
+ expand(config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R1_suffix"], ID = sample_ID_list),
+ expand(config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R2_suffix"], ID = sample_ID_list),
+ config["trimmed_reads_dir"] + config["output_prefix"] + f"trimmed-read-counts{assay_suffix}.tsv"
+ output:
+ expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"], ID = sample_ID_list),
+ expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R2_suffix"], ID = sample_ID_list),
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts{assay_suffix}.biom",
+ config["final_outputs_dir"] + config["output_prefix"] + f"ASVs{assay_suffix}.fasta",
+ config["final_outputs_dir"] + config["output_prefix"] + f"read-count-tracking{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"counts{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts{assay_suffix}.tsv"
+ params:
+ left_trunc = config["left_trunc"],
+ right_trunc = config["right_trunc"],
+ left_maxEE = config["left_maxEE"],
+ right_maxEE = config["right_maxEE"],
+ trim_primers = config["trim_primers"],
+ trimmed_reads_dir = config["trimmed_reads_dir"],
+ filtered_reads_dir = config["filtered_reads_dir"],
+ primer_trimmed_R1_suffix = config["primer_trimmed_R1_suffix"],
+ primer_trimmed_R2_suffix = config["primer_trimmed_R2_suffix"],
+ filtered_R1_suffix = config["filtered_R1_suffix"],
+ filtered_R2_suffix = config["filtered_R2_suffix"],
+ final_outputs_dir = config["final_outputs_dir"],
+ target_region = config["target_region"],
+ output_prefix = config["output_prefix"],
+ concatenate_reads_only = config["concatenate_reads_only"],
+ assay_suffix = assay_suffix
+ resources:
+ mem_mb = 200000,
+ cpus = 10
+ log:
+ "R-processing.log"
+ benchmark:
+ "benchmarks/run_R-benchmarks.tsv"
+ shell:
+ """
+ Rscript scripts/Illumina-PE-R-processing.R "{params.left_trunc}" "{params.right_trunc}" "{params.left_maxEE}" "{params.right_maxEE}" "{params.trim_primers}" "{sample_IDs_file}" "{params.trimmed_reads_dir}" "{params.filtered_reads_dir}" "{params.primer_trimmed_R1_suffix}" "{params.primer_trimmed_R2_suffix}" "{params.filtered_R1_suffix}" "{params.filtered_R2_suffix}" "{params.final_outputs_dir}" "{params.output_prefix}" "{params.target_region}" "{params.concatenate_reads_only}" "{params.assay_suffix}" > {log} 2>&1
+ """
+
+ # if we did not trim the primers
+ else:
+
+ rule run_R_PE:
+ conda:
+ "envs/R.yaml"
+ input:
+ expand(config["raw_reads_dir"] + "{ID}" + config["raw_R1_suffix"], ID = sample_ID_list),
+ expand(config["raw_reads_dir"] + "{ID}" + config["raw_R2_suffix"], ID = sample_ID_list)
+ output:
+ expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"], ID = sample_ID_list),
+ expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R2_suffix"], ID = sample_ID_list),
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts{assay_suffix}.biom",
+ config["final_outputs_dir"] + config["output_prefix"] + f"ASVs{assay_suffix}.fasta",
+ config["final_outputs_dir"] + config["output_prefix"] + f"read-count-tracking{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"counts{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts{assay_suffix}.tsv"
+ params:
+ left_trunc = config["left_trunc"],
+ right_trunc = config["right_trunc"],
+ left_maxEE = config["left_maxEE"],
+ right_maxEE = config["right_maxEE"],
+ trim_primers = config["trim_primers"],
+ raw_reads_dir = config["raw_reads_dir"],
+ filtered_reads_dir = config["filtered_reads_dir"],
+ raw_R1_suffix = config["raw_R1_suffix"],
+ raw_R2_suffix = config["raw_R2_suffix"],
+ filtered_R1_suffix = config["filtered_R1_suffix"],
+ filtered_R2_suffix = config["filtered_R2_suffix"],
+ final_outputs_dir = config["final_outputs_dir"],
+ target_region = config["target_region"],
+ output_prefix = config["output_prefix"],
+ concatenate_reads_only = config["concatenate_reads_only"],
+ assay_suffix = assay_suffix
+ resources:
+ mem_mb = 200000,
+ cpus = 10
+ log:
+ "R-processing.log"
+ benchmark:
+ "benchmarks/run_R-benchmarks.tsv"
+ shell:
+ """
+ Rscript scripts/Illumina-PE-R-processing.R "{params.left_trunc}" "{params.right_trunc}" "{params.left_maxEE}" "{params.right_maxEE}" "{params.trim_primers}" "{sample_IDs_file}" "{params.raw_reads_dir}" "{params.filtered_reads_dir}" "{params.raw_R1_suffix}" "{params.raw_R2_suffix}" "{params.filtered_R1_suffix}" "{params.filtered_R2_suffix}" "{params.final_outputs_dir}" "{params.output_prefix}" "{params.target_region}" "{params.concatenate_reads_only}" "{params.assay_suffix}" > {log} 2>&1
+ """
+
+
+ # cutadapt rule for paired-end data
+ rule cutadapt_PE:
+ """ this rule runs cutadapt. It is only executed if config["trim_primers"] is "TRUE" """
+ conda:
+ "envs/cutadapt.yaml"
+ input:
+ R1 = config["raw_reads_dir"] + "{ID}" + config["raw_R1_suffix"],
+ R2 = config["raw_reads_dir"] + "{ID}" + config["raw_R2_suffix"]
+ output:
+ R1 = config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R1_suffix"],
+ R2 = config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R2_suffix"],
+ log = config["trimmed_reads_dir"] + "{ID}-cutadapt.log",
+ trim_counts = config["trimmed_reads_dir"] + "{ID}-trimmed-counts.tsv"
+ params:
+ F_linked_primer = config["F_linked_primer"],
+ R_linked_primer = config["R_linked_primer"],
+ F_primer = config["F_primer"],
+ R_primer = config["R_primer"],
+ min_cutadapt_len = config["min_cutadapt_len"],
+ primers_linked = config["primers_linked"],
+ discard_untrimmed = config["discard_untrimmed"]
+ log:
+ config["trimmed_reads_dir"] + "{ID}-cutadapt.log"
+ benchmark:
+ "benchmarks/cutadapt-{ID}-benchmarks.tsv"
+ shell:
+ """
+ # command depends on if primers are linked or not
+ if [ {params.primers_linked} == "TRUE" ]; then
+
+ if [ {params.discard_untrimmed} == "TRUE" ]; then
+ cutadapt -a {params.F_linked_primer} -A {params.R_linked_primer} -o {output.R1} -p {output.R2} --discard-untrimmed -m {params.min_cutadapt_len} {input.R1} {input.R2} > {log} 2>&1
+ else
+ cutadapt -a {params.F_linked_primer} -A {params.R_linked_primer} -o {output.R1} -p {output.R2} -m {params.min_cutadapt_len} {input.R1} {input.R2} > {log} 2>&1
+ fi
+
+ else
+
+ if [ {params.discard_untrimmed} == "TRUE" ]; then
+ cutadapt -g {params.F_primer} -G {params.R_primer} -o {output.R1} -p {output.R2} --discard-untrimmed -m {params.min_cutadapt_len} {input.R1} {input.R2} > {log} 2>&1
+ else
+ cutadapt -g {params.F_primer} -G {params.R_primer} -o {output.R1} -p {output.R2} -m {params.min_cutadapt_len} {input.R1} {input.R2} > {log} 2>&1
+ fi
+
+ fi
+
+ paste <( printf "{wildcards.ID}" ) <( grep "read pairs processed" {output.log} | tr -s " " "\t" | cut -f 5 | tr -d "," ) <( grep "Pairs written" {output.log} | tr -s " " "\t" | cut -f 5 | tr -d "," ) > {output.trim_counts}
+ """
+
+ # rule for raw fastqc for paired-end data
+ rule raw_fastqc_PE:
+ """
+ This rule runs fastqc on all raw input fastq files.
+ """
+
+ conda:
+ "envs/qc.yaml"
+ input:
+ config["raw_reads_dir"] + "{ID}" + config["raw_R1_suffix"],
+ config["raw_reads_dir"] + "{ID}" + config["raw_R2_suffix"]
+ output:
+ config["raw_reads_dir"] + "{ID}" + config["raw_R1_suffix"].rsplit(".", 2)[0] + "_fastqc.zip",
+ config["raw_reads_dir"] + "{ID}" + config["raw_R2_suffix"].rsplit(".", 2)[0] + "_fastqc.zip"
+ benchmark:
+ "benchmarks/raw_fastqc-{ID}-benchmarks.tsv"
+ shell:
+ """
+ fastqc {input} -t 2 -q
+ """
+
+ # rule for raw multiqc for paired-end data
+ rule raw_multiqc_PE:
+ """
+ This rule collates all raw fastqc outputs.
+ """
+
+ conda:
+ "envs/qc.yaml"
+ input:
+ expand(config["raw_reads_dir"] + "{ID}" + config["raw_R1_suffix"].rsplit(".", 2)[0] + "_fastqc.zip", ID = sample_ID_list),
+ expand(config["raw_reads_dir"] + "{ID}" + config["raw_R2_suffix"].rsplit(".", 2)[0] + "_fastqc.zip", ID = sample_ID_list)
+ params:
+ int_out_dir = config["output_prefix"] + "raw_multiqc_report",
+ out_filename_prefix = config["output_prefix"] + "raw_multiqc",
+ int_out_data_dir = config["output_prefix"] + "raw_multiqc_data",
+ int_html_file = config["output_prefix"] + "raw_multiqc.html",
+ int_zip = config["output_prefix"] + "raw_multiqc_report.zip",
+ r1_html_files = expand(config["raw_reads_dir"] + "{ID}" + config["raw_R1_suffix"].rsplit(".", 2)[0] + "_fastqc.html", ID = sample_ID_list),
+ r2_html_files = expand(config["raw_reads_dir"] + "{ID}" + config["raw_R2_suffix"].rsplit(".", 2)[0] + "_fastqc.html", ID = sample_ID_list),
+ config_file = "config/multiqc.config"
+ output:
+ final_out_zip = config["fastqc_out_dir"] + config["output_prefix"] + f"raw_multiqc{assay_suffix}_report.zip"
+ benchmark:
+ "benchmarks/raw_multiqc-benchmarks.tsv"
+ shell:
+ """
+ multiqc -q -n {params.out_filename_prefix} --force --cl-config 'max_table_rows: 99999999' --interactive --config {params.config_file} {input} > /dev/null 2>&1
+
+ # removing the individual fastqc files
+ rm -rf {input} {params.r1_html_files} {params.r2_html_files}
+
+ # making an output report directory and moving things into it
+ mkdir -p {params.int_out_dir}
+ mv {params.int_html_file} {params.int_out_data_dir} {params.int_out_dir}
+
+ # zipping and removing unzipped dir
+ zip -q -r {params.int_zip} {params.int_out_dir} && rm -rf {params.int_out_dir}
+
+ # moving to final wanted location
+ mv {params.int_zip} {output.final_out_zip}
+ """
+
+
+ # rule for filtered fastqc for paired-end data (inherits from rule raw_fastqc_PE)
+ use rule raw_fastqc_PE as filtered_fastqc_PE with:
+ input:
+ config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"],
+ config["filtered_reads_dir"] + "{ID}" + config["filtered_R2_suffix"]
+ output:
+ config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"].rsplit(".", 2)[0] + "_fastqc.zip",
+ config["filtered_reads_dir"] + "{ID}" + config["filtered_R2_suffix"].rsplit(".", 2)[0] + "_fastqc.zip"
+ benchmark:
+ "benchmarks/filtered_fastqc-{ID}-benchmarks.tsv"
+
+
+ # rule for filtered multiqc for paired-end data (inherits from raw_multiqc_PE)
+ use rule raw_multiqc_PE as filtered_multiqc_PE with:
+ input:
+ expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"].rsplit(".", 2)[0] + "_fastqc.zip", ID = sample_ID_list),
+ expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R2_suffix"].rsplit(".", 2)[0] + "_fastqc.zip", ID = sample_ID_list)
+ params:
+ int_out_dir = config["output_prefix"] + "filtered_multiqc_report",
+ out_filename_prefix = config["output_prefix"] + "filtered_multiqc",
+ int_out_data_dir = config["output_prefix"] + "filtered_multiqc_data",
+ int_html_file = config["output_prefix"] + "filtered_multiqc.html",
+ int_zip = config["output_prefix"] + "filtered_multiqc_report.zip",
+ r1_html_files = expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"].rsplit(".", 2)[0] + "_fastqc.html", ID = sample_ID_list),
+ r2_html_files = expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R2_suffix"].rsplit(".", 2)[0] + "_fastqc.html", ID = sample_ID_list),
+ config_file = "config/multiqc.config"
+ output:
+ final_out_zip = config["fastqc_out_dir"] + config["output_prefix"] + f"filtered_multiqc{assay_suffix}_report.zip"
+ benchmark:
+ "benchmarks/filtered_multiqc-benchmarks.tsv"
+
+
+
+#### end of rules specific for paired-end data ####
+
+##################################
+#### rules if single-end data ####
+##################################
+if config["data_type"] == "SE":
+
+ rule all:
+ input: base_SE_inputs + visualization_outputs
+ shell:
+ """
+ bash scripts/combine-benchmarks.sh
+ python scripts/copy_info.py
+ """
+
+
+
+ # R processing rule for single-end data
+ if config["trim_primers"] == "TRUE":
+
+ rule run_R_SE:
+ conda:
+ "envs/R.yaml"
+ input:
+ expand(config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R1_suffix"], ID = sample_ID_list),
+ config["trimmed_reads_dir"] + config["output_prefix"] + f"trimmed-read-counts{assay_suffix}.tsv"
+ output:
+ expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"], ID = sample_ID_list),
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts{assay_suffix}.biom",
+ config["final_outputs_dir"] + config["output_prefix"] + f"ASVs{assay_suffix}.fasta",
+ config["final_outputs_dir"] + config["output_prefix"] + f"read-count-tracking{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"counts{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts{assay_suffix}.tsv"
+ params:
+ left_trunc = config["left_trunc"],
+ left_maxEE = config["left_maxEE"],
+ trim_primers = config["trim_primers"],
+ trimmed_reads_dir = config["trimmed_reads_dir"],
+ filtered_reads_dir = config["filtered_reads_dir"],
+ primer_trimmed_R1_suffix = config["primer_trimmed_R1_suffix"],
+ filtered_R1_suffix = config["filtered_R1_suffix"],
+ final_outputs_dir = config["final_outputs_dir"],
+ target_region = config["target_region"],
+ output_prefix = config["output_prefix"],
+ assay_suffix = assay_suffix
+ resources:
+ mem_mb = 200000,
+ cpus = 10
+ log:
+ "R-processing.log"
+ benchmark:
+ "benchmarks/run_R-benchmarks.tsv"
+ shell:
+ """
+ Rscript scripts/Illumina-SE-R-processing.R "{params.left_trunc}" "{params.left_maxEE}" "{params.trim_primers}" "{sample_IDs_file}" "{params.trimmed_reads_dir}" "{params.filtered_reads_dir}" "{params.primer_trimmed_R1_suffix}" "{params.filtered_R1_suffix}" "{params.final_outputs_dir}" "{params.output_prefix}" "{params.target_region}" "{params.assay_suffix}" > {log} 2>&1
+ """
+
+ # if we did not trim the primers
+ else:
+
+ rule run_R_SE:
+ conda:
+ "envs/R.yaml"
+ input:
+ expand(config["raw_reads_dir"] + "{ID}" + config["raw_R1_suffix"], ID = sample_ID_list)
+ output:
+ expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"], ID = sample_ID_list),
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts{assay_suffix}.biom",
+ config["final_outputs_dir"] + config["output_prefix"] + f"ASVs{assay_suffix}.fasta",
+ config["final_outputs_dir"] + config["output_prefix"] + f"read-count-tracking{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"counts{assay_suffix}.tsv",
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts{assay_suffix}.tsv"
+ params:
+ left_trunc = config["left_trunc"],
+ left_maxEE = config["left_maxEE"],
+ trim_primers = config["trim_primers"],
+ raw_reads_dir = config["raw_reads_dir"],
+ filtered_reads_dir = config["filtered_reads_dir"],
+ raw_R1_suffix = config["raw_R1_suffix"],
+ filtered_R1_suffix = config["filtered_R1_suffix"],
+ final_outputs_dir = config["final_outputs_dir"],
+ target_region = config["target_region"],
+ output_prefix = config["output_prefix"],
+ assay_suffix = assay_suffix
+ resources:
+ mem_mb = 200000,
+ cpus = 10
+ log:
+ "R-processing.log"
+ benchmark:
+ "benchmarks/run_R-benchmarks.tsv"
+ shell:
+ """
+ Rscript scripts/Illumina-SE-R-processing.R "{params.left_trunc}" "{params.left_maxEE}" "{params.trim_primers}" "{sample_IDs_file}" "{params.raw_reads_dir}" "{params.filtered_reads_dir}" "{params.raw_R1_suffix}" "{params.filtered_R1_suffix}" "{params.final_outputs_dir}" "{params.output_prefix}" "{params.target_region}" "{params.assay_suffix}" > {log} 2>&1
+ """
+
+
+ # cutadapt rule for single-end data
+ rule cutadapt_SE:
+ """ this rule runs cutadapt. It is only executed if config["trim_primers"] is "TRUE" """
+ conda:
+ "envs/cutadapt.yaml"
+ input:
+ R1 = config["raw_reads_dir"] + "{ID}" + config["raw_R1_suffix"]
+ output:
+ R1 = config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R1_suffix"],
+ log = config["trimmed_reads_dir"] + "{ID}-cutadapt.log",
+ trim_counts = config["trimmed_reads_dir"] + "{ID}-trimmed-counts.tsv"
+ params:
+ F_linked_primer = config["F_linked_primer"],
+ F_primer = config["F_primer"],
+ min_cutadapt_len = config["min_cutadapt_len"],
+ primers_linked = config["primers_linked"],
+ discard_untrimmed = config["discard_untrimmed"]
+ log:
+ config["trimmed_reads_dir"] + "{ID}-cutadapt.log"
+ benchmark:
+ "benchmarks/cutadapt-{ID}-benchmarks.tsv"
+ shell:
+ """
+ # command depends on if primers are linked or not
+ if [ {params.primers_linked} == "TRUE" ]; then
+
+ if [ {params.discard_untrimmed} == "TRUE" ]; then
+ cutadapt -a {params.F_linked_primer} -o {output.R1} --discard-untrimmed -m {params.min_cutadapt_len} {input.R1} > {log} 2>&1
+ else
+ cutadapt -a {params.F_linked_primer} -o {output.R1} -m {params.min_cutadapt_len} {input.R1} > {log} 2>&1
+ fi
+
+ else
+
+ if [ {params.discard_untrimmed} == "TRUE" ]; then
+ cutadapt -g {params.F_primer} -o {output.R1} --discard-untrimmed -m {params.min_cutadapt_len} {input.R1} > {log} 2>&1
+ else
+ cutadapt -g {params.F_primer} -o {output.R1} -m {params.min_cutadapt_len} {input.R1} > {log} 2>&1
+ fi
+
+ fi
+
+ paste <( printf "{wildcards.ID}" ) <( grep "reads processed" {output.log} | tr -s " " "\t" | cut -f 4 | tr -d "," ) <( grep "Reads written" {output.log} | tr -s " " "\t" | cut -f 5 | tr -d "," ) > {output.trim_counts}
+ """
+
+
+ # rule for raw fastqc for single-end data
+ rule raw_fastqc_SE:
+ """
+ This rule runs fastqc on all raw input fastq files.
+ """
+
+ conda:
+ "envs/qc.yaml"
+ input:
+ config["raw_reads_dir"] + "{ID}" + config["raw_R1_suffix"]
+ output:
+ config["raw_reads_dir"] + "{ID}" + config["raw_R1_suffix"].rsplit(".", 2)[0] + "_fastqc.zip"
+ benchmark:
+ "benchmarks/raw_fastqc-{ID}-benchmarks.tsv"
+ shell:
+ """
+ fastqc {input} -t 1 -q
+ """
+
+
+ # rule for raw multiqc for single-end data
+ rule raw_multiqc_SE:
+ """
+ This rule collates all raw fastqc outputs.
+ """
+
+ conda:
+ "envs/qc.yaml"
+ input:
+ expand(config["raw_reads_dir"] + "{ID}" + config["raw_R1_suffix"].rsplit(".", 2)[0] + "_fastqc.zip", ID = sample_ID_list)
+ params:
+ int_out_dir = config["output_prefix"] + "raw_multiqc_report",
+ out_filename_prefix = config["output_prefix"] + "raw_multiqc",
+ int_out_data_dir = config["output_prefix"] + "raw_multiqc_data",
+ int_html_file = config["output_prefix"] + "raw_multiqc.html",
+ int_zip = config["output_prefix"] + "raw_multiqc_report.zip",
+ r1_html_files = expand(config["raw_reads_dir"] + "{ID}" + config["raw_R1_suffix"].rsplit(".", 2)[0] + "_fastqc.html", ID = sample_ID_list),
+ config_file = "config/multiqc.config"
+ output:
+ final_out_zip = config["fastqc_out_dir"] + config["output_prefix"] + f"raw_multiqc{assay_suffix}_report.zip"
+ benchmark:
+ "benchmarks/raw_multiqc-benchmarks.tsv"
+ shell:
+ """
+ multiqc -q -n {params.out_filename_prefix} --force --cl-config 'max_table_rows: 99999999' --interactive --config {params.config_file} {input} > /dev/null 2>&1
+
+ # removing the individual fastqc files
+ rm -rf {input} {params.r1_html_files}
+
+ # making an output report directory and moving things into it
+ mkdir -p {params.int_out_dir}
+ mv {params.int_html_file} {params.int_out_data_dir} {params.int_out_dir}
+
+ # zipping and removing unzipped dir
+ zip -q -r {params.int_zip} {params.int_out_dir} && rm -rf {params.int_out_dir}
+
+ # moving to final wanted location
+ mv {params.int_zip} {output.final_out_zip}
+ """
+
+
+ # rule for filtered fastqc for single-end data (inherits from rule raw_fastqc_SE)
+ use rule raw_fastqc_SE as filtered_fastqc_SE with:
+ input:
+ config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"]
+ output:
+ config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"].rsplit(".", 2)[0] + "_fastqc.zip"
+ benchmark:
+ "benchmarks/filtered_fastqc-{ID}-benchmarks.tsv"
+
+
+ # rule for filtered multiqc for single-end data (inherits from rule raw_multiqc_SE)
+ use rule raw_multiqc_SE as filtered_multiqc_SE with:
+ input:
+ expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"].rsplit(".", 2)[0] + "_fastqc.zip", ID = sample_ID_list)
+ params:
+ int_out_dir = config["output_prefix"] + "filtered_multiqc_report",
+ out_filename_prefix = config["output_prefix"] + "filtered_multiqc",
+ int_out_data_dir = config["output_prefix"] + "filtered_multiqc_data",
+ int_html_file = config["output_prefix"] + "filtered_multiqc.html",
+ int_zip = config["output_prefix"] + "filtered_multiqc_report.zip",
+ r1_html_files = expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"].rsplit(".", 2)[0] + "_fastqc.html", ID = sample_ID_list),
+ config_file = "config/multiqc.config"
+ output:
+ final_out_zip = config["fastqc_out_dir"] + config["output_prefix"] + f"filtered_multiqc{assay_suffix}_report.zip"
+ benchmark:
+ "benchmarks/filtered_multiqc-benchmarks.tsv"
+
+
+#### end of rules specific for single-end data ####
+
+##################################################################
+#### rules that are the same whether paired-end or single-end ####
+##################################################################
+rule r_visualizations:
+ """ This rule generates R visualizations using trimmed read data and grouping info from the runsheet"""
+ conda:
+ "visualizations/R_visualizations.yaml"
+ input:
+ runsheet = config["runsheet"],
+ sample_info = config["sample_info_file"],
+ counts = config["final_outputs_dir"] + config["output_prefix"] + f"counts{assay_suffix}.tsv",
+ taxonomy = config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy{assay_suffix}.tsv"
+ output:
+ # Use completion flag file in place of plot outputs for variable plots
+ legend = config["plots_dir"] + config["output_prefix"] + f"color_legend{assay_suffix}.png"
+ params:
+ assay_suffix = assay_suffix,
+ plots_dir = config["plots_dir"],
+ output_prefix = config["output_prefix"]
+ resources:
+ mem_mb = 200000,
+ cpus = 10
+ log:
+ "R-visualizations.log"
+ benchmark:
+ "benchmarks/r-visualizations-benchmarks.tsv"
+ shell:
+ """
+ Rscript visualizations/Illumina-R-visualizations.R "{input.runsheet}" "{input.sample_info}" "{input.counts}" "{input.taxonomy}" "{params.plots_dir}" "{params.output_prefix}" "{params.assay_suffix}" > {log} 2>&1
+ """
+
+
+rule zip_biom:
+ input:
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts{assay_suffix}.biom"
+ output:
+ config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts{assay_suffix}.biom.zip"
+ shell:
+ """
+ zip -j -q {output} {input} && rm {input}
+ """
+
+
+rule combine_cutadapt_logs_and_summarize:
+ """ this rule combines the cutadapt logs and summarizes them. It is only executed if config["trim_primers"] is "TRUE" """
+ input:
+ counts = expand(config["trimmed_reads_dir"] + "{ID}-trimmed-counts.tsv", ID = sample_ID_list),
+ logs = expand(config["trimmed_reads_dir"] + "{ID}-cutadapt.log", ID = sample_ID_list)
+ output:
+ combined_log = config["trimmed_reads_dir"] + config["output_prefix"] + f"cutadapt{assay_suffix}.log",
+ combined_counts = config["trimmed_reads_dir"] + config["output_prefix"] + f"trimmed-read-counts{assay_suffix}.tsv"
+ benchmark:
+ "benchmarks/combine_cutadapt_logs_and_summarize-benchmarks.tsv"
+ shell:
+ """
+ cat {input.logs} > {output.combined_log}
+ rm {input.logs}
+
+ cat <( printf "sample\traw_reads\tcutadapt_trimmed\n" ) <( cat {input.counts} ) > {output.combined_counts}
+ rm {input.counts}
+ """
+
+
+rule clean_all:
+ shell:
+ "rm -rf {needed_dirs}"
\ No newline at end of file
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/config.yaml b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/config.yaml
new file mode 100644
index 00000000..d2cfcbf9
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/config.yaml
@@ -0,0 +1,150 @@
+############################################################################################
+## Configuration file for GeneLab Illumina amplicon processing workflow ##
+## Initially developed by Michael D. Lee (Mike.Lee@nasa.gov) ##
+## Developed and maintained by Michael D. Lee and Alexis Torres (alexis.torres@nasa.gov) ##
+############################################################################################
+
+
+############################################################
+##################### VARIABLES TO SET #####################
+############################################################
+
+###########################################################################
+##### These need to match what is specific to our system and our data #####
+###########################################################################
+
+## Path to ISA archive, only needed for saving a copy as metadata:
+isa_archive:
+ ""
+
+## Path to runsheet:
+runsheet:
+ "OSD-num_amplicon_v1_runsheet.csv"
+
+## Set to "PE" for paired-end, "SE" for single-end.
+data_type:
+ "PE"
+
+## single-column file with unique sample identifiers:
+sample_info_file:
+ "unique-sample-IDs.txt"
+
+## input reads directory (can be relative to workflow directory, or needs to be full path):
+raw_reads_dir:
+ "raw_reads/"
+
+## raw read suffixes:
+ # e.g. for paired-end data, Sample-1_R1_raw.fastq.gz would be _R1_raw.fastq.gz for 'raw_R1_suffix' below
+ # e.g. if single-end, Sample-1.fastq.gz would be .fastq.gz for 'raw_R1_suffix' below, and 'raw_R2_suffix' won't be used
+raw_R1_suffix:
+ "_R1_raw.fastq.gz"
+raw_R2_suffix:
+ "_R2_raw.fastq.gz"
+
+## if we are trimming primers or not ("TRUE", or "FALSE")
+trim_primers:
+ "TRUE"
+
+## primer sequences if we are trimming them (include anchoring symbols, e.g. '^', as needed, see: https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types)
+F_primer:
+ "^GTGCCAGCMGCCGCGGTAA"
+R_primer:
+ "^GGACTACHVGGGTWTCTAA"
+
+## should cutadapt treat these as linked primers? (https://cutadapt.readthedocs.io/en/stable/recipes.html#trimming-amplicon-primers-from-paired-end-reads)
+primers_linked:
+ "TRUE"
+
+## if primers are linked, we need to provide them as below, where the second half, following three periods, is the other primer reverse-complemented
+ # (can reverse complement while retaining ambiguous bases at this site: http://arep.med.harvard.edu/labgc/adnan/projects/Utilities/revcomp.html)
+ # include anchoring symbols, e.g. '^', as needed, see: https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types
+F_linked_primer:
+ "^GTGCCAGCMGCCGCGGTAA...TTAGAWACCCBDGTAGTCC"
+R_linked_primer:
+ "^GGACTACHVGGGTWTCTAA...TTACCGCGGCKGCTGGCAC"
+
+## discard untrimmed, sets the "--discard-untrimmed" option if TRUE
+discard_untrimmed:
+ "TRUE"
+
+## target region (16S, 18S, or ITS is acceptable)
+ # this determines which reference database is used for taxonomic classification
+ # all are pulled from the pre-packaged DECIPHER downloads page here: http://www2.decipher.codes/Downloads.html
+ # 16S uses SILVA
+ # ITS uses UNITE
+ # 18S uses PR2
+target_region:
+ "16S"
+
+## concatenate only with dada2 instead of merging paired reads if TRUE
+ # this is typically used with primers like 515-926, that captured 18S fragments that are typically too long to merge
+ # note that 16S and 18S should have been separated already prior to running this workflow
+ # this should likely be left as FALSE for any option other than "18S" above
+
+concatenate_reads_only:
+ "FALSE"
+
+## values to be passed to dada2's filterAndTrim() function:
+left_trunc:
+ 0
+right_trunc:
+ 0
+left_maxEE:
+ 1
+right_maxEE:
+ 1
+
+## minimum length threshold for cutadapt
+min_cutadapt_len:
+ 130
+
+######################################################################
+##### The rest only need to be altered if we want to change them #####
+######################################################################
+
+## filename suffixes
+primer_trimmed_R1_suffix:
+ "_R1_trimmed.fastq.gz"
+primer_trimmed_R2_suffix:
+ "_R2_trimmed.fastq.gz"
+
+filtered_R1_suffix:
+ "_R1_filtered.fastq.gz"
+filtered_R2_suffix:
+ "_R2_filtered.fastq.gz"
+
+## output prefix (if needed to distinguish from multiple primer sets, leave as empty string if not, include connecting symbol if adding, e.g. "ITS-")
+output_prefix:
+ ""
+
+## output directories (all relative to processing directory, they will be created if needed)
+info_out_dir:
+ "workflow_output/Metadata/"
+fastqc_out_dir:
+ "workflow_output/FastQC_Outputs/"
+trimmed_reads_dir:
+ "workflow_output/Trimmed_Sequence_Data/"
+filtered_reads_dir:
+ "workflow_output/Filtered_Sequence_Data/"
+final_outputs_dir:
+ "workflow_output/Final_Outputs/"
+plots_dir:
+ "workflow_output/Final_Outputs/Plots/"
+
+enable_visualizations:
+ "TRUE"
+
+############################################################
+###################### GENERAL INFO ########################
+############################################################
+# Workflow is currently equipped to work with paired-end data only, and reads are expected to be gzipped
+
+## example usage command ##
+# snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 2 -p
+
+# `--use-conda` – this specifies to use the conda environments included in the workflow
+# `--conda-prefix` – this allows us to point to where the needed conda environments should be stored...
+# `-j` – this lets us set how many jobs Snakemake should run concurrently...
+# `-p` – specifies to print out each command being run to the screen
+
+# See `snakemake -h` for more options and details.
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/config/multiqc.config b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/config/multiqc.config
new file mode 100644
index 00000000..7338ff1b
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/config/multiqc.config
@@ -0,0 +1,6 @@
+extra_fn_clean_exts:
+ - "_raw"
+ - "_filtered"
+
+show_analysis_paths: False
+show_analysis_time: False
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/envs/R.yaml b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/envs/R.yaml
new file mode 100644
index 00000000..73690882
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/envs/R.yaml
@@ -0,0 +1,9 @@
+channels:
+ - conda-forge
+ - bioconda
+ - defaults
+dependencies:
+ - r-base==4.3.2
+ - bioconductor-dada2==1.30.0
+ - bioconductor-decipher==2.30.0
+ - bioconductor-biomformat==1.30.0
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/envs/cutadapt.yaml b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/envs/cutadapt.yaml
new file mode 100644
index 00000000..8598a8e7
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/envs/cutadapt.yaml
@@ -0,0 +1,6 @@
+channels:
+ - conda-forge
+ - bioconda
+ - defaults
+dependencies:
+ - cutadapt==4.6
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/envs/qc.yaml b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/envs/qc.yaml
new file mode 100644
index 00000000..73e48c83
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/envs/qc.yaml
@@ -0,0 +1,9 @@
+channels:
+ - conda-forge
+ - bioconda
+ - defaults
+dependencies:
+ - fastqc==0.12.1
+ - multiqc==1.19
+ - zip==3.0
+ - python==3.8
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/Illumina-PE-R-processing.R b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/Illumina-PE-R-processing.R
new file mode 100644
index 00000000..3b87b5d2
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/Illumina-PE-R-processing.R
@@ -0,0 +1,273 @@
+##################################################################################
+## R processing script for Illumina paired-end amplicon data ##
+## Developed by Michael D. Lee (Mike.Lee@nasa.gov) ##
+##################################################################################
+
+# as called from the associated Snakefile, this expects to be run as: Rscript full-R-processing.R
+ # where and are the values to be passed to the truncLen parameter of dada2's filterAndTrim()
+ # and and are the values to be passed to the maxEE parameter of dada2's filterAndTrim()
+
+# checking at least 8 arguments provided, first 4 are integers, and setting variables used within R:
+args <- commandArgs(trailingOnly = TRUE)
+
+if (length(args) < 17) {
+ stop("At least 17 positional arguments are required, see top of this R script for more info.", call.=FALSE)
+} else {
+ suppressWarnings(left_trunc <- as.integer(args[1]))
+ suppressWarnings(right_trunc <- as.integer(args[2]))
+ suppressWarnings(left_maxEE <- as.integer(args[3]))
+ suppressWarnings(right_maxEE <- as.integer(args[4]))
+
+ suppressWarnings(GL_trimmed_primers <- args[5])
+ suppressWarnings(sample_IDs_file <- args[6])
+ suppressWarnings(input_reads_dir <- args[7])
+ suppressWarnings(filtered_reads_dir <- args[8])
+ suppressWarnings(input_file_R1_suffix <- args[9])
+ suppressWarnings(input_file_R2_suffix <- args[10])
+ suppressWarnings(filtered_filename_R1_suffix <- args[11])
+ suppressWarnings(filtered_filename_R2_suffix <- args[12])
+ suppressWarnings(final_outputs_dir <- args[13])
+ suppressWarnings(output_prefix <- args[14])
+ suppressWarnings(target_region <- args[15])
+ suppressWarnings(concatenate_reads_only <- args[16])
+ suppressWarnings(assay_suffix <- args[17])
+
+}
+
+if ( is.na(left_trunc) || is.na(right_trunc) || is.na(left_maxEE) || is.na(right_maxEE) ) {
+ stop("All 4 first arguments must be integers, see top of R script for more info.", call.=FALSE)
+}
+
+if ( ! GL_trimmed_primers %in% c("TRUE", "FALSE") ) {
+ stop("The fifth positional argument needs to be 'TRUE' or 'FALSE' for whether or not GL trimmed primers on this dataset, see top of R script and config.yaml for more info.", call.=FALSE)
+} else {
+ GL_trimmed_primers <- as.logical(GL_trimmed_primers)
+}
+
+if ( ! concatenate_reads_only %in% c("TRUE", "FALSE") ) {
+ stop("The sixteenth positional argument needs to be 'TRUE' or 'FALSE' for whether or not the mergePairs function of dada2 should just concatenate the reads on this dataset, see top of R script and config.yaml for more info.", call.=FALSE)
+} else {
+ GL_trimmed_primers <- as.logical(GL_trimmed_primers)
+}
+
+# general procedure comes largely from these sources:
+ # https://benjjneb.github.io/dada2/tutorial.html
+ # https://astrobiomike.github.io/amplicon/dada2_workflow_ex
+
+ # loading libraries
+library(dada2); packageVersion("dada2")
+library(DECIPHER); packageVersion("DECIPHER")
+library(biomformat); packageVersion("biomformat")
+
+ ### general processing ###
+ # reading in unique sample names into variable
+sample.names <- scan(sample_IDs_file, what="character")
+
+ # setting variables holding the paths to the forward and reverse primer-trimmed reads (or "raw" if primers were trimmed prior to submission to GeneLab)
+forward_reads <- paste0(input_reads_dir, sample.names, input_file_R1_suffix)
+reverse_reads <- paste0(input_reads_dir, sample.names, input_file_R2_suffix)
+
+ # setting variables holding what will be the output paths of all forward and reverse filtered reads
+forward_filtered_reads <- paste0(filtered_reads_dir, sample.names, filtered_filename_R1_suffix)
+reverse_filtered_reads <- paste0(filtered_reads_dir, sample.names, filtered_filename_R2_suffix)
+
+ # adding sample names to the vectors holding the filtered-reads' paths
+names(forward_filtered_reads) <- sample.names
+names(reverse_filtered_reads) <- sample.names
+
+ # running filering step
+ # reads are written to the files specified in the variables, the "filtered_out" object holds the summary results within R
+filtered_out <- filterAndTrim(fwd=forward_reads, forward_filtered_reads, reverse_reads, reverse_filtered_reads, truncLen=c(left_trunc,right_trunc), maxN=0, maxEE=c(left_maxEE,right_maxEE), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=10)
+
+ # making and writing out summary table that includes counts of filtered reads
+if ( GL_trimmed_primers ) {
+
+ filtered_count_summary_tab <- data.frame(sample=sample.names, cutadapt_trimmed=filtered_out[,1], dada2_filtered=filtered_out[,2])
+
+} else {
+
+ filtered_count_summary_tab <- data.frame(sample=sample.names, starting_reads=filtered_out[,1], dada2_filtered=filtered_out[,2])
+
+}
+
+write.table(filtered_count_summary_tab, paste0(filtered_reads_dir, output_prefix, "filtered-read-counts", assay_suffix, ".tsv"), sep="\t", quote=F, row.names=F)
+
+ # learning errors step
+forward_errors <- learnErrors(forward_filtered_reads, multithread=10)
+reverse_errors <- learnErrors(reverse_filtered_reads, multithread=10)
+
+ # inferring sequences
+forward_seqs <- dada(forward_filtered_reads, err=forward_errors, pool="pseudo", multithread=10)
+reverse_seqs <- dada(reverse_filtered_reads, err=reverse_errors, pool="pseudo", multithread=10)
+
+ # merging forward and reverse reads (just concatenating if that was specified)
+if ( concatenate_reads_only ) {
+
+ merged_contigs <- mergePairs(forward_seqs, forward_filtered_reads, reverse_seqs, reverse_filtered_reads, verbose=TRUE, justConcatenate=TRUE)
+
+} else {
+
+ merged_contigs <- mergePairs(forward_seqs, forward_filtered_reads, reverse_seqs, reverse_filtered_reads, verbose=TRUE)
+
+}
+
+
+ # generating a sequence table that holds the counts of each sequence per sample
+seqtab <- makeSequenceTable(merged_contigs)
+
+ # removing putative chimeras
+seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=10, verbose=TRUE)
+
+ # checking what percentage of sequences were retained after chimera removal
+sum(seqtab.nochim)/sum(seqtab) * 100
+
+ # making and writing out a summary table that includes read counts at all steps
+ # helper function
+getN <- function(x) sum(getUniques(x))
+
+if ( GL_trimmed_primers ) {
+
+ raw_and_trimmed_read_counts <- read.table(paste0(input_reads_dir, output_prefix, "trimmed-read-counts", assay_suffix, ".tsv"), header=T, sep="\t")
+ # reading in filtered read counts
+ filtered_read_counts <- read.table(paste0(filtered_reads_dir, output_prefix, "filtered-read-counts", assay_suffix, ".tsv"), header=T, sep="\t")
+
+ count_summary_tab <- data.frame(raw_and_trimmed_read_counts, dada2_filtered=filtered_read_counts[,3],
+ dada2_denoised_F=sapply(forward_seqs, getN),
+ dada2_denoised_R=sapply(reverse_seqs, getN),
+ dada2_merged=rowSums(seqtab),
+ dada2_chimera_removed=rowSums(seqtab.nochim),
+ final_perc_reads_retained=round(rowSums(seqtab.nochim)/raw_and_trimmed_read_counts$raw_reads * 100, 1),
+ row.names=NULL)
+
+} else {
+
+ count_summary_tab <- data.frame(filtered_count_summary_tab,
+ dada2_denoised_F=sapply(forward_seqs, getN),
+ dada2_denoised_R=sapply(reverse_seqs, getN),
+ dada2_merged=rowSums(seqtab),
+ dada2_chimera_removed=rowSums(seqtab.nochim),
+ final_perc_reads_retained=round(rowSums(seqtab.nochim)/filtered_count_summary_tab$starting_reads * 100, 1),
+ row.names=NULL)
+
+}
+
+write.table(count_summary_tab, paste0(final_outputs_dir, output_prefix, "read-count-tracking", assay_suffix, ".tsv"), sep = "\t", quote=F, row.names=F)
+
+ ### assigning taxonomy ###
+ # creating a DNAStringSet object from the ASVs
+dna <- DNAStringSet(getSequences(seqtab.nochim))
+
+ # downloading reference R taxonomy object (at some point this will be stored somewhere on GeneLab's server and we won't download it, but should leave the code here, just commented out)
+cat("\n\n Downloading reference database...\n\n")
+if ( target_region == "16S" ) {
+ download.file("https://www2.decipher.codes/data/Downloads/TrainingSets/SILVA_SSU_r138_2019.RData", "SILVA_SSU_r138_2019.RData")
+ # loading reference taxonomy object
+ load("SILVA_SSU_r138_2019.RData")
+ # removing downloaded file
+ file.remove("SILVA_SSU_r138_2019.RData")
+
+ ranks <- c("domain", "phylum", "class", "order", "family", "genus", "species")
+
+} else if (target_region == "ITS" ) {
+
+ download.file("https://www2.decipher.codes/data/Downloads/TrainingSets/UNITE_v2023_July2023.RData", "UNITE_v2023_July2023.RData")
+ # loading reference taxonomy object
+ load("UNITE_v2023_July2023.RData")
+ # removing downloaded file
+ file.remove("UNITE_v2023_July2023.RData")
+
+ ranks <- c("kingdom", "phylum", "class", "order", "family", "genus", "species")
+
+} else if (target_region == "18S" ) {
+
+ download.file("https://www2.decipher.codes/data/Downloads/TrainingSets/PR2_v4_13_March2021.RData", "PR2_v4_13_March2021.RData")
+ # loading reference taxonomy object
+ load("PR2_v4_13_March2021.RData")
+ # removing downloaded file
+ file.remove("PR2_v4_13_March2021.RData")
+
+ ranks <- c("kingdom", "division", "phylum", "class", "order", "family", "genus", "species")
+
+} else {
+ cat("\n\n The requested target_region of ", target_region, " is not accepted.\n\n")
+ quit(status = 1)
+}
+
+ # classifying
+cat("\n\n Assigning taxonomy...\n\n")
+tax_info <- IdTaxa(dna, trainingSet, strand="both", processors=NULL)
+
+ ### generating and writing out standard outputs ###
+ # giving our sequences more manageable names (e.g. ASV_1, ASV_2..., rather than the sequence itself)
+asv_seqs <- colnames(seqtab.nochim)
+asv_headers <- vector(dim(seqtab.nochim)[2], mode="character")
+
+# adding additional prefix to ASV headers of target region if one was provided
+if ( output_prefix != "" ) {
+ for (i in 1:dim(seqtab.nochim)[2]) {
+ asv_headers[i] <- paste(">ASV", target_region, i, sep="_")
+ }
+} else {
+ for (i in 1:dim(seqtab.nochim)[2]) {
+ asv_headers[i] <- paste(">ASV", i, sep="_")
+ }
+}
+
+cat("\n\n Making and writing outputs...\n\n")
+ # making and writing out a fasta of our final ASV seqs:
+asv_fasta <- c(rbind(asv_headers, asv_seqs))
+write(asv_fasta, paste0(final_outputs_dir, output_prefix, "ASVs", assay_suffix, ".fasta"))
+
+ # making and writing out a count table:
+asv_tab <- t(seqtab.nochim)
+asv_ids <- sub(">", "", asv_headers)
+row.names(asv_tab) <- NULL
+asv_tab <- data.frame("ASV_ID"=asv_ids, asv_tab, check.names=FALSE)
+
+write.table(asv_tab, paste0(final_outputs_dir, output_prefix, "counts", assay_suffix, ".tsv"), sep="\t", quote=F, row.names=FALSE)
+
+ # making and writing out a taxonomy table:
+ # vector of desired ranks was created above in ITS/16S/18S target_region if statement
+
+ # creating table of taxonomy and setting any that are unclassified as "NA"
+tax_tab <- t(sapply(tax_info, function(x) {
+ m <- match(ranks, x$rank)
+ taxa <- x$taxon[m]
+ taxa[startsWith(taxa, "unclassified_")] <- NA
+ taxa
+}))
+
+colnames(tax_tab) <- ranks
+row.names(tax_tab) <- NULL
+
+# need to add domain values if this is ITS (due to how the reference taxonomy object is structured, which doesn't have a domain entry)
+if ( target_region == "ITS" ) {
+
+ # only want to add if at least a kingdom was identified (so we don't add Eukarya if nothing was found)
+ new_vec <- ifelse(!is.na(tax_tab[, "kingdom"]) & tax_tab[, "kingdom"] != "NA", "Eukarya", "NA")
+
+ tax_tab <- data.frame("ASV_ID"=asv_ids, "domain" = new_vec, tax_tab, check.names=FALSE)
+
+} else {
+
+ tax_tab <- data.frame("ASV_ID"=asv_ids, tax_tab, check.names=FALSE)
+
+}
+
+# need to change "kingdom" to "domain" if this is 18S (due to how the reference taxonomy object is structured)
+if ( target_region == "18S" ) {
+ colnames(tax_tab)[colnames(tax_tab) == "kingdom"] <- "domain"
+}
+
+write.table(tax_tab, paste0(final_outputs_dir, output_prefix, "taxonomy", assay_suffix, ".tsv"), sep = "\t", quote=F, row.names=FALSE)
+
+ ### generating and writing out biom file format ###
+biom_object <- make_biom(data=asv_tab, observation_metadata=tax_tab)
+write_biom(biom_object, paste0(final_outputs_dir, output_prefix, "taxonomy-and-counts", assay_suffix, ".biom"))
+
+ # making a tsv of combined tax and counts
+tax_and_count_tab <- merge(tax_tab, asv_tab)
+write.table(tax_and_count_tab, paste0(final_outputs_dir, output_prefix, "taxonomy-and-counts", assay_suffix, ".tsv"), sep="\t", quote=FALSE, row.names=FALSE)
+
+cat("\n\n Session info:\n\n")
+sessionInfo()
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/Illumina-SE-R-processing.R b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/Illumina-SE-R-processing.R
new file mode 100644
index 00000000..ed898d6c
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/Illumina-SE-R-processing.R
@@ -0,0 +1,242 @@
+##################################################################################
+## R processing script for Illumina single-end amplicon data ##
+## Developed by Michael D. Lee (Mike.Lee@nasa.gov) ##
+##################################################################################
+
+# as called from the associated Snakefile, this expects to be run as: Rscript full-R-processing.R
+ # where is the value to be passed to the truncLen parameter of dada2's filterAndTrim()
+ # and is the value to be passed to the maxEE parameter of dada2's filterAndTrim()
+
+# checking arguments were provided, first 2 are integers, and setting variables used within R:
+args <- commandArgs(trailingOnly = TRUE)
+
+if (length(args) < 12) {
+ stop("At least 12 positional arguments are required, see top of this R script for more info.", call.=FALSE)
+} else {
+ suppressWarnings(left_trunc <- as.integer(args[1]))
+ suppressWarnings(left_maxEE <- as.integer(args[2]))
+
+ suppressWarnings(GL_trimmed_primers <- args[3])
+ suppressWarnings(sample_IDs_file <- args[4])
+ suppressWarnings(input_reads_dir <- args[5])
+ suppressWarnings(filtered_reads_dir <- args[6])
+ suppressWarnings(input_file_R1_suffix <- args[7])
+ suppressWarnings(filtered_filename_R1_suffix <- args[8])
+ suppressWarnings(final_outputs_dir <- args[9])
+ suppressWarnings(output_prefix <- args[10])
+ suppressWarnings(target_region <- args[11])
+ suppressWarnings(assay_suffix <- args[12])
+
+}
+
+if ( is.na(left_trunc) || is.na(left_maxEE) ) {
+ stop("The 2 first arguments must be integers, see top of R script for more info.", call.=FALSE)
+}
+
+if ( ! GL_trimmed_primers %in% c("TRUE", "FALSE") ) {
+ stop("The third positional argument needs to be 'TRUE' or 'FALSE' for whether or not GL trimmed primers on this dataset, see top of R script for more info.", call.=FALSE)
+} else {
+ GL_trimmed_primers <- as.logical(GL_trimmed_primers)
+}
+
+# general procedure comes largely from these sources:
+ # https://benjjneb.github.io/dada2/tutorial.html
+ # https://astrobiomike.github.io/amplicon/dada2_workflow_ex
+
+ # loading libraries
+library(dada2); packageVersion("dada2")
+library(DECIPHER); packageVersion("DECIPHER")
+library(biomformat); packageVersion("biomformat")
+
+ ### general processing ###
+ # reading in unique sample names into variable
+sample.names <- scan(sample_IDs_file, what="character")
+
+ # setting variables holding the paths to the forward primer-trimmed reads (or "raw" if primers were trimmed prior to submission to GeneLab)
+forward_reads <- paste0(input_reads_dir, sample.names, input_file_R1_suffix)
+
+ # setting variables holding what will be the output paths of all forward filtered reads
+forward_filtered_reads <- paste0(filtered_reads_dir, sample.names, filtered_filename_R1_suffix)
+
+ # adding sample names to the vectors holding the filtered-reads' paths
+names(forward_filtered_reads) <- sample.names
+
+ # running filering step
+ # reads are written to the files specified in the variables, the "filtered_out" object holds the summary results within R
+filtered_out <- filterAndTrim(fwd=forward_reads, forward_filtered_reads, truncLen=c(left_trunc), maxN=0, maxEE=c(left_maxEE), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=10)
+
+ # making and writing out summary table that includes counts of filtered reads
+if ( GL_trimmed_primers ) {
+
+ filtered_count_summary_tab <- data.frame(sample=sample.names, cutadapt_trimmed=filtered_out[,1], dada2_filtered=filtered_out[,2])
+
+} else {
+
+ filtered_count_summary_tab <- data.frame(sample=sample.names, starting_reads=filtered_out[,1], dada2_filtered=filtered_out[,2])
+
+}
+
+write.table(filtered_count_summary_tab, paste0(filtered_reads_dir, output_prefix, "filtered-read-counts", assay_suffix, ".tsv"), sep="\t", quote=F, row.names=F)
+
+ # learning errors step
+forward_errors <- learnErrors(forward_filtered_reads, multithread=10)
+
+ # inferring sequences
+forward_seqs <- dada(forward_filtered_reads, err=forward_errors, pool="pseudo", multithread=10)
+
+ # generating a sequence table that holds the counts of each sequence per sample
+seqtab <- makeSequenceTable(forward_seqs)
+
+ # removing putative chimeras
+seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=10, verbose=TRUE)
+
+ # checking what percentage of sequences were retained after chimera removal
+sum(seqtab.nochim)/sum(seqtab) * 100
+
+ # making and writing out a summary table that includes read counts at all steps
+ # helper function
+getN <- function(x) sum(getUniques(x))
+
+if ( GL_trimmed_primers ) {
+
+ raw_and_trimmed_read_counts <- read.table(paste0(input_reads_dir, output_prefix, "trimmed-read-counts", assay_suffix, ".tsv"), header=T, sep="\t")
+ # reading in filtered read counts
+ filtered_read_counts <- read.table(paste0(filtered_reads_dir, output_prefix, "filtered-read-counts", assay_suffix, ".tsv"), header=T, sep="\t")
+
+ count_summary_tab <- data.frame(raw_and_trimmed_read_counts, dada2_filtered=filtered_read_counts[,3],
+ dada2_denoised_F=sapply(forward_seqs, getN),
+ dada2_chimera_removed=rowSums(seqtab.nochim),
+ final_perc_reads_retained=round(rowSums(seqtab.nochim)/raw_and_trimmed_read_counts$raw_reads * 100, 1),
+ row.names=NULL)
+
+} else {
+
+ count_summary_tab <- data.frame(filtered_count_summary_tab,
+ dada2_denoised_F=sapply(forward_seqs, getN),
+ dada2_chimera_removed=rowSums(seqtab.nochim),
+ final_perc_reads_retained=round(rowSums(seqtab.nochim)/filtered_count_summary_tab$starting_reads * 100, 1),
+ row.names=NULL)
+
+}
+
+write.table(count_summary_tab, paste0(final_outputs_dir, output_prefix, "read-count-tracking", assay_suffix, ".tsv"), sep = "\t", quote=F, row.names=F)
+
+ ### assigning taxonomy ###
+ # creating a DNAStringSet object from the ASVs
+dna <- DNAStringSet(getSequences(seqtab.nochim))
+
+ # downloading reference R taxonomy object (at some point this will be stored somewhere on GeneLab's server and we won't download it, but should leave the code here, just commented out)
+cat("\n\n Downloading reference database...\n\n")
+if ( target_region == "16S" ) {
+ download.file("https://www2.decipher.codes/data/Downloads/TrainingSets/SILVA_SSU_r138_2019.RData", "SILVA_SSU_r138_2019.RData")
+ # loading reference taxonomy object
+ load("SILVA_SSU_r138_2019.RData")
+ # removing downloaded file
+ file.remove("SILVA_SSU_r138_2019.RData")
+
+ ranks <- c("domain", "phylum", "class", "order", "family", "genus", "species")
+
+} else if (target_region == "ITS" ) {
+
+ download.file("https://www2.decipher.codes/data/Downloads/TrainingSets/UNITE_v2023_July2023.RData", "UNITE_v2023_July2023.RData")
+ # loading reference taxonomy object
+ load("UNITE_v2023_July2023.RData")
+ # removing downloaded file
+ file.remove("UNITE_v2023_July2023.RData")
+
+ ranks <- c("kingdom", "phylum", "class", "order", "family", "genus", "species")
+
+} else if (target_region == "18S" ) {
+
+ download.file("https://www2.decipher.codes/data/Downloads/TrainingSets/PR2_v4_13_March2021.RData", "PR2_v4_13_March2021.RData")
+ # loading reference taxonomy object
+ load("PR2_v4_13_March2021.RData")
+ # removing downloaded file
+ file.remove("PR2_v4_13_March2021.RData")
+
+ ranks <- c("kingdom", "division", "phylum", "class", "order", "family", "genus", "species")
+
+} else {
+
+ cat("\n\n The requested target_region of ", target_region, " is not accepted.\n\n")
+ quit(status = 1)
+}
+
+ # classifying
+cat("\n\n Assigning taxonomy...\n\n")
+tax_info <- IdTaxa(dna, trainingSet, strand="both", processors=NULL)
+
+ ### generating and writing out standard outputs ###
+ # giving our sequences more manageable names (e.g. ASV_1, ASV_2..., rather than the sequence itself)
+asv_seqs <- colnames(seqtab.nochim)
+asv_headers <- vector(dim(seqtab.nochim)[2], mode="character")
+
+# adding additional prefix to ASV headers of target region if one was provided
+if ( output_prefix != "" ) {
+ for (i in 1:dim(seqtab.nochim)[2]) {
+ asv_headers[i] <- paste(">ASV", target_region, i, sep="_")
+ }
+} else {
+ for (i in 1:dim(seqtab.nochim)[2]) {
+ asv_headers[i] <- paste(">ASV", i, sep="_")
+ }
+}
+
+cat("\n\n Making and writing outputs...\n\n")
+ # making and writing out a fasta of our final ASV seqs:
+asv_fasta <- c(rbind(asv_headers, asv_seqs))
+write(asv_fasta, paste0(final_outputs_dir, output_prefix, "ASVs", assay_suffix, ".fasta"))
+
+ # making and writing out a count table:
+asv_tab <- t(seqtab.nochim)
+asv_ids <- sub(">", "", asv_headers)
+row.names(asv_tab) <- NULL
+asv_tab <- data.frame("ASV_ID"=asv_ids, asv_tab, check.names=FALSE)
+
+write.table(asv_tab, paste0(final_outputs_dir, output_prefix, "counts", assay_suffix, ".tsv"), sep="\t", quote=F, row.names=FALSE)
+
+ # making and writing out a taxonomy table:
+ # vector of desired ranks was created above in ITS/16S target_region if statement
+
+ # creating table of taxonomy and setting any that are unclassified as "NA"
+tax_tab <- t(sapply(tax_info, function(x) {
+ m <- match(ranks, x$rank)
+ taxa <- x$taxon[m]
+ taxa[startsWith(taxa, "unclassified_")] <- NA
+ taxa
+}))
+
+colnames(tax_tab) <- ranks
+row.names(tax_tab) <- NULL
+
+# need to add domain values if this is ITS (due to how the reference taxonomy object is structured, which doesn't have a domain entry)
+if ( target_region == "ITS" ) {
+
+ # only want to add if at least a kingdom was identified (so we don't add Eukarya if nothing was found)
+ new_vec <- ifelse(!is.na(tax_tab[, "kingdom"]) & tax_tab[, "kingdom"] != "NA", "Eukarya", "NA")
+
+ tax_tab <- data.frame("ASV_ID"=asv_ids, "domain" = new_vec, tax_tab, check.names=FALSE)
+
+} else {
+
+ tax_tab <- data.frame("ASV_ID"=asv_ids, tax_tab, check.names=FALSE)
+
+}
+
+# need to change "kingdom" to "domain" if this is 18S (due to how the reference taxonomy object is structured)
+if ( target_region == "18S" ) {
+ colnames(tax_tab)[colnames(tax_tab) == "kingdom"] <- "domain"
+}
+
+write.table(tax_tab, paste0(final_outputs_dir, output_prefix, "taxonomy", assay_suffix, ".tsv"), sep = "\t", quote=F, row.names=FALSE)
+
+ ### generating and writing out biom file format ###
+biom_object <- make_biom(data=asv_tab, observation_metadata=tax_tab)
+write_biom(biom_object, paste0(final_outputs_dir, output_prefix, "taxonomy-and-counts", assay_suffix, ".biom"))
+
+ # making a tsv of combined tax and counts
+tax_and_count_tab <- merge(tax_tab, asv_tab)
+write.table(tax_and_count_tab, paste0(final_outputs_dir, output_prefix, "taxonomy-and-counts", assay_suffix, ".tsv"), sep="\t", quote=FALSE, row.names=FALSE)
+
+cat("\n\n Session info:\n\n")
+sessionInfo()
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/combine-benchmarks.sh b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/combine-benchmarks.sh
new file mode 100755
index 00000000..7c006303
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/combine-benchmarks.sh
@@ -0,0 +1,18 @@
+#!/usr/bin/env bash
+set -e
+
+ls benchmarks/ > benchmark-filenames.tmp
+
+head -n 1 benchmarks/$( head -n 1 benchmark-filenames.tmp ) > benchmark-header.tmp
+
+paste <( printf "process" ) benchmark-header.tmp > building-tab.tmp
+
+for file in $(cat benchmark-filenames.tmp)
+do
+
+ cat <( paste <( echo ${file} | sed 's/-benchmarks.tsv//' ) <( tail -n +2 benchmarks/${file} ) ) >> building-tab.tmp
+
+done
+
+mv building-tab.tmp all-benchmarks.tsv
+rm -rf benchmark-filenames.tmp benchmark-header.tmp
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/copy_info.py b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/copy_info.py
new file mode 100644
index 00000000..db43e445
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/copy_info.py
@@ -0,0 +1,64 @@
+import os
+import shutil
+import yaml
+
+def copy_file(src, dest):
+ try:
+ shutil.copy(src, dest)
+ except Exception as e:
+ print(f"Error copying {src} to {dest}: {e}")
+
+def copy_directory(src, dest):
+ try:
+ shutil.copytree(src, dest, dirs_exist_ok=True)
+ except Exception as e:
+ print(f"Error copying {src} directory to {dest}: {e}")
+
+def main(config, sample_IDs_file):
+ info_out_dir = config["info_out_dir"]
+ output_prefix = config.get("output_prefix", "") # Get the output_prefix, default to empty string if not found
+ os.makedirs(info_out_dir, exist_ok=True)
+ os.makedirs(os.path.join(info_out_dir, "benchmarks"), exist_ok=True)
+
+ # Files to copy
+ files_to_copy = [
+ ("config.yaml", os.path.join(info_out_dir, "config.yaml")),
+ (sample_IDs_file, os.path.join(info_out_dir, os.path.basename(sample_IDs_file))),
+ (config["runsheet"], os.path.join(info_out_dir, os.path.basename(config["runsheet"]))),
+ ("R-processing.log", os.path.join(info_out_dir, "R-processing.log")),
+ ("all-benchmarks.tsv", os.path.join(info_out_dir,"all-benchmarks.tsv")),
+ ("Snakefile", os.path.join(info_out_dir, "Snakefile"))
+ ]
+
+ # Check and add "R-visualizations.log" if it exists (visualizations are optional)
+ r_visualizations_log_path = "R-visualizations.log"
+ if os.path.isfile(r_visualizations_log_path):
+ files_to_copy.append((r_visualizations_log_path, os.path.join(info_out_dir, "R-visualizations.log")))
+
+ # Optional ISA archive
+ if config.get("isa_archive") and os.path.isfile(config["isa_archive"]):
+ files_to_copy.append((config["isa_archive"], os.path.join(info_out_dir, os.path.basename(config["isa_archive"]))))
+
+ # Directories to copy
+ directories_to_copy = [
+ ("benchmarks", os.path.join(info_out_dir, "benchmarks")),
+ ("envs", os.path.join(info_out_dir, "envs")),
+ ("scripts", os.path.join(info_out_dir, "scripts")),
+ ("config", os.path.join(info_out_dir, "config"))
+ ]
+
+ # Copy directories
+ for src, dest in directories_to_copy:
+ copy_directory(src, dest)
+
+ # Copy files
+ for src, dest in files_to_copy:
+ copy_file(src, dest)
+
+
+if __name__ == "__main__":
+ with open('config.yaml') as f:
+ config = yaml.safe_load(f)
+ sample_IDs_file = config['sample_info_file']
+
+ main(config, sample_IDs_file)
\ No newline at end of file
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/run_workflow.py b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/run_workflow.py
new file mode 100644
index 00000000..63b6c235
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/run_workflow.py
@@ -0,0 +1,768 @@
+import argparse
+import subprocess
+import os
+import sys
+import tempfile
+import re
+import shutil
+import pandas as pd
+#import pandera as pa
+import requests
+import yaml
+####################
+## 1. For OSD ARG #
+####################
+# 1. Process the OSD arg to proper format
+# 2. Download the ISA file
+# 3. Convert to runsheet(s)
+# 4. Select which runsheet to use
+
+########################
+## 1. For runsheet arg #
+########################
+# 1. Select which runsheet to use
+
+##########################
+## 2. Neutral flow after #
+##########################
+# 1. Validate schema of runsheet
+# 2. Check if read_paths are URLs, prompt for download
+# 3. Create config.yaml and unique-sample-IDs.txt
+# 4. If --run is used: run the workflow
+
+# Process OSD arg: if numeric, append OSD-, if OSD-# or GLDS-#, leave it
+def process_osd_argument(osd_arg):
+ # Check if the argument is just numeric
+ if osd_arg.isdigit():
+ return f"OSD-{osd_arg}"
+ # Check if it's already in the correct format (OSD-numeric or GLDS-numeric)
+ elif re.match(r'^(OSD|GLDS)-\d+$', osd_arg):
+ return osd_arg
+ else:
+ print("Invalid format for --OSD argument. Use 'numeric', 'OSD-numeric', or 'GLDS-numeric'.")
+ sys.exit(1)
+
+# Check provided OSD/GLDS is not on the list of those that can't be autoprocessed
+def check_provided_osd_or_glds(osd_arg):
+ # dictionaries of OSD/GLDS accessions and reason for not running, key = ID: value = reason
+ # there are 3 because ID can be provided prefixed with "OSD-", "GLDS-", or nothing - not the most efficient here, but ¯\_(ツ)_/¯
+ not_autoprocessable_OSD_dict = {
+ "OSD-65": "This dataset has multiple different primers mixed in different orientations in each individual sample, and the workflow is unable to handle it in an automated fashion.",
+ "OSD-66": "This dataset is not a standard amplicon dataset. It is comprised of hundreds of different primers targeting different regions of specific organisms, and the workflow is unable to handle it.",
+ "OSD-82": "This dataset is still multiplexed, and we don't yet have the mapping information to split the samples apart appropriately."
+ }
+
+ not_autoprocessable_GLDS_dict = {
+ "GLDS-65": "This dataset has multiple different primers mixed in different orientations in each individual sample, and the workflow is unable to handle it in an automated fashion.",
+ "GLDS-66": "This dataset is not a standard amplicon dataset. It is comprised of hundreds of different primers targeting different regions of specific organisms, and the workflow is unable to handle it.",
+ "GLDS-82": "This dataset is still multiplexed, and we don't yet have the mapping information to split the samples apart appropriately."
+ }
+
+ not_autoprocessable_dict = {
+ "65": "This dataset has multiple different primers mixed in different orientations in each individual sample, and the workflow is unable to handle it in an automated fashion.",
+ "66": "This dataset is not a standard amplicon dataset. It is comprised of hundreds of different primers targeting different regions of specific organisms, and the workflow is unable to handle it.",
+ "82": "This dataset is still multiplexed, and we don't yet have the mapping information to split the samples apart appropriately."
+ }
+
+ # checking based on OSD IDs
+ if osd_arg in not_autoprocessable_OSD_dict:
+ print(f"\nThe specified dataset {osd_arg} is unable to be processed with this workflow.")
+ print(f" Reason: {not_autoprocessable_OSD_dict[osd_arg]}\n")
+ sys.exit(1)
+
+ # checking based on GLDS IDs
+ if osd_arg in not_autoprocessable_GLDS_dict:
+ print(f"\n The specified dataset {osd_arg} is unable to be processed with this workflow.")
+ print(f" Reason: {not_autoprocessable_GLDS_dict[osd_arg]}\n")
+ sys.exit(1)
+
+ # checking based on plain IDs
+ if osd_arg in not_autoprocessable_dict:
+ print(f"\n The specified dataset {osd_arg} is unable to be processed with this workflow.")
+ print(f" Reason: {not_autoprocessable_dict[osd_arg]}\n")
+ sys.exit(1)
+
+# Run dpt-get-isa-archive in a temp folder, move it back to cd, return the filename
+def download_isa_archive(accession_number):
+ with tempfile.TemporaryDirectory() as temp_dir:
+ try:
+ # Run the command in the temporary directory
+ subprocess.run(
+ ["dpt-get-isa-archive", "--accession", str(accession_number)],
+ check=True,
+ text=True,
+ cwd=temp_dir
+ )
+
+ # Find the downloaded zip file in the temp directory
+ downloaded_files = [f for f in os.listdir(temp_dir) if f.endswith('.zip')]
+ if not downloaded_files:
+ print("No ISA archive file was downloaded.", file=sys.stderr)
+ return None
+
+ # Assuming there's only one file, get its name
+ downloaded_file = downloaded_files[0]
+
+ # Move the file back to the current directory
+ shutil.move(os.path.join(temp_dir, downloaded_file), downloaded_file)
+
+ full_path = os.path.abspath(downloaded_file)
+ return full_path
+
+ except subprocess.CalledProcessError as e:
+ print("An error occurred while downloading ISA archive.", file=sys.stderr)
+ sys.exit(1)
+
+# Run dpt-isa-to-runsheet in a temp folder, move runsheet(s) back to cd, return list of runsheet(s)
+def convert_isa_to_runsheet(accession_number, isa_zip):
+ with tempfile.TemporaryDirectory() as temp_dir:
+ # Copy the ISA archive to the temporary directory
+ temp_isa_zip_path = shutil.copy(isa_zip, temp_dir)
+
+ try:
+ # Run the dpt-isa-to-runsheet command in the temporary directory
+ subprocess.run(
+ ["dpt-isa-to-runsheet", "--accession", accession_number, "--config-type", "amplicon", "--config-version", "Latest", "--isa-archive", os.path.basename(temp_isa_zip_path)],
+ check=True,
+ cwd=temp_dir,
+ stdout=sys.stdout,
+ stderr=sys.stderr
+ )
+
+ # Get the list of created files in the temp directory
+ created_files = [f for f in os.listdir(temp_dir) if os.path.isfile(os.path.join(temp_dir, f)) and f != os.path.basename(temp_isa_zip_path)]
+
+ # Move the created files back to the current directory
+ moved_files = []
+ for file in created_files:
+ shutil.move(os.path.join(temp_dir, file), file)
+ moved_files.append(file)
+
+ return moved_files
+
+ except subprocess.CalledProcessError as e:
+ print("An error occurred while converting ISA archive to runsheet.", file=sys.stderr)
+ sys.exit(1)
+
+
+def handle_runsheet_selection(runsheet_files, target=None, specified_runsheet=None):
+ selected_runsheet = None
+
+ # Change specified_runsheet to a basename in case a path is used as an arg for run_workflow.py
+ if specified_runsheet:
+ specified_runsheet_basename = os.path.basename(specified_runsheet)
+ else:
+ specified_runsheet_basename = None
+
+ # Use the specified runsheet if provided
+ if specified_runsheet and specified_runsheet in runsheet_files:
+ selected_runsheet = specified_runsheet
+ print(f"Using specified runsheet: {selected_runsheet}")
+ return selected_runsheet
+
+ if len(runsheet_files) == 1:
+ if target:
+ runsheet = runsheet_files[0]
+ try:
+ runsheet_df = pd.read_csv(runsheet)
+ target_region = runsheet_df['Parameter Value[Library Selection]'].unique()[0]
+ if target.lower() == target_region.lower():
+ selected_runsheet = runsheet
+ except Exception as e:
+ print(f"Error reading {runsheet}: {e}")
+ print(f"Using runsheet: {selected_runsheet}")
+
+ elif len(runsheet_files) > 1:
+ if target:
+ matching_runsheets = []
+ for runsheet in runsheet_files:
+ try:
+ runsheet_df = pd.read_csv(runsheet)
+ target_region = runsheet_df['Parameter Value[Library Selection]'].unique()[0]
+ if target.lower() == target_region.lower():
+ matching_runsheets.append(runsheet)
+ except Exception as e:
+ print(f"Error reading {runsheet}: {e}")
+
+ if len(matching_runsheets) == 1:
+ # One matching runsheet found
+ selected_runsheet = matching_runsheets[0]
+ print(f"Using runsheet: {selected_runsheet}")
+
+ elif len(matching_runsheets) > 1:
+ # Multiple matching runsheets found
+ print("The study contains multiple assays with the same target. Please specify one of the following runsheet names as a parameter for the --specify-runsheet argument:")
+ for rs in matching_runsheets:
+ print(rs)
+ return None
+
+ else:
+ # No matching runsheets found
+ print("No runsheet matches the specified genomic target. Please check the target or specify a runsheet using --specify-runsheet.")
+ return None
+
+ else:
+ # No target specified and multiple runsheets are available
+ print("Multiple runsheets found but no genomic target specified. Cannot proceed. Use -t {16S, 18S, ITS} or --target {16S, 18S, ITS} to specify which assay/dataset to use.")
+ return None
+
+ # Remove unselected runsheet files if a runsheet was selected
+ if selected_runsheet:
+ unselected_runsheets = [file for file in runsheet_files if file != selected_runsheet]
+ for file in unselected_runsheets:
+ try:
+ os.remove(file)
+ except Exception as e:
+ pass
+
+ return selected_runsheet
+
+def check_runsheet_read_paths(runsheet_df):
+ # Check if a string is a URL / genelab URL
+ def is_url(s):
+ return "http://" in s or "https://" in s or "genelab-data.ndc.nasa.gov" in s
+
+
+ # Check if 'read2_path' column exists
+ paired_end = runsheet_df['paired_end'].eq(True).all()
+
+ # Check the first row to determine if the paths are URLs or local paths
+ first_row = runsheet_df.iloc[0]
+
+ uses_url = is_url(first_row['read1_path'])
+ if uses_url:
+ print("Runsheet references URLs.")
+ else:
+ print("Runsheet references local read files.")
+
+ return uses_url
+
+def sample_IDs_from_local(runsheet_df, output_file='unique-sample-IDs.txt'):
+ # Check if the DataFrame is paired-end
+ paired_end = runsheet_df['paired_end'].eq(True).all()
+
+ with open(output_file, 'w') as file:
+ for index, row in runsheet_df.iterrows():
+ # Extract base names minus the suffixes
+ base_read1 = os.path.basename(row['read1_path']).replace(row['raw_R1_suffix'], '')
+
+ if paired_end:
+ base_read2 = os.path.basename(row['read2_path']).replace(row['raw_R2_suffix'], '')
+ # Check if base names match for paired-end data, necessary for snakemake arg expansion
+ if base_read1 != base_read2:
+ print(f"Mismatch in sample IDs in row {index}: {base_read1} vs {base_read2}")
+ sys.exit(1)
+
+ # Write the base name to the file
+ file.write(f"{base_read1}\n")
+
+ print(f"Unique sample IDs written to {output_file}")
+
+def handle_url_downloads(runsheet_df, output_file='unique-sample-IDs.txt'):
+ print("Downloading read files...")
+ # Check if the DataFrame is paired-end
+ paired_end = runsheet_df['paired_end'].eq(True).all()
+ # Write 'Sample Name' into unique-sample-IDs.txt
+ with open(output_file, 'w') as file:
+ for sample_name in runsheet_df['Sample Name']:
+ file.write(sample_name + '\n')
+
+ # Create ./raw_reads/ directory if it does not exist
+ raw_reads_dir = os.path.abspath('./raw_reads/')
+ if not os.path.exists(raw_reads_dir):
+ os.makedirs(raw_reads_dir)
+
+ # Initialize count for skipped downloads
+ skipped_downloads_count = 0
+ # Iterate over each row and download files if they don't exist
+ for _, row in runsheet_df.iterrows():
+ sample_id = row['Sample Name']
+ read1_path = os.path.join(raw_reads_dir, sample_id + row['raw_R1_suffix'])
+ read2_path = os.path.join(raw_reads_dir, sample_id + row['raw_R2_suffix']) if paired_end else None
+
+ # Download Read 1 if it doesn't exist
+ if not os.path.exists(read1_path):
+ download_url_to_file(row['read1_path'], read1_path)
+ else:
+ skipped_downloads_count += 1
+
+ # Download Read 2 if it doesn't exist and if paired_end
+ if paired_end and read2_path and not os.path.exists(read2_path):
+ download_url_to_file(row['read2_path'], read2_path)
+ elif paired_end and read2_path:
+ skipped_downloads_count += 1
+
+ # Print the number of skipped downloads
+ if skipped_downloads_count > 0:
+ print(f"{skipped_downloads_count} read file(s) were already present and were not downloaded.")
+
+def download_url_to_file(url, file_path, max_retries=3, timeout_seconds=120):
+ retries = 0
+ success = False
+
+ while retries < max_retries and not success:
+ try:
+ response = requests.get(url, stream=True, timeout=timeout_seconds)
+ response.raise_for_status() # Raises an HTTPError for bad status codes
+
+ with open(file_path, 'wb') as file:
+ shutil.copyfileobj(response.raw, file)
+ success = True
+
+ except (requests.exceptions.HTTPError, requests.exceptions.ConnectionError, requests.exceptions.Timeout) as e:
+ retries += 1
+ print(f"Attempt {retries}: Error occurred: {e}")
+
+ except requests.exceptions.RequestException as e:
+ print(f"An unexpected error occurred: {e}")
+ break
+
+ if not success:
+ print("Failed to download the read files.")
+
+
+def reverse_complement(seq):
+ complement = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G',
+ 'R': 'Y', 'Y': 'R', 'S': 'S', 'W': 'W',
+ 'K': 'M', 'M': 'K', 'B': 'V', 'V': 'B',
+ 'D': 'H', 'H': 'D', 'N': 'N'}
+ return ''.join(complement.get(base, base) for base in reversed(seq))
+
+def create_config_yaml(isa_zip,
+ runsheet_file,
+ runsheet_df,
+ uses_urls,
+ output_dir,
+ min_trimmed_length,
+ trim_primers,
+ primers_linked,
+ anchor_primers,
+ discard_untrimmed,
+ left_trunc,
+ right_trunc,
+ left_maxEE,
+ right_maxEE,
+ concatenate_reads_only,
+ output_prefix,
+ enable_visualizations):
+
+ # Extract necessary variables from runsheet_df
+ data_type = "PE" if runsheet_df['paired_end'].eq(True).all() else "SE"
+ raw_R1_suffix = runsheet_df['raw_R1_suffix'].unique()[0]
+ raw_R2_suffix = runsheet_df['raw_R2_suffix'].unique()[0] if data_type == "PE" else ""
+ f_primer = runsheet_df['F_Primer'].unique()[0]
+ r_primer = runsheet_df['R_Primer'].unique()[0] if data_type == "PE" else ""
+ target_region = runsheet_df['Parameter Value[Library Selection]'].unique()[0]
+
+ # Determine raw_reads_directory
+ if uses_urls:
+ raw_reads_directory = os.path.abspath('./raw_reads/') + '/'
+ else:
+ read1_path_dir = os.path.dirname(runsheet_df['read1_path'].iloc[0])
+ raw_reads_directory = os.path.abspath(read1_path_dir) + '/' if read1_path_dir else "./"
+
+
+ # Other default values
+ output_dir = os.path.abspath(output_dir) + '/'
+ primer_anchor = "^" if anchor_primers is True else ""
+
+ f_linked_primer = f"{primer_anchor}{f_primer}...{reverse_complement(r_primer)}"
+ r_linked_primer = f"{primer_anchor}{r_primer}...{reverse_complement(f_primer)}"
+
+ # Make output_dir if it doesn't exist
+ if not os.path.exists(output_dir):
+ os.makedirs(output_dir)
+
+ info_out_dir = os.path.join(output_dir, output_prefix + "Processing_Info") + os.sep
+ fastqc_out_dir = os.path.join(output_dir, "FastQC_Outputs") + os.sep
+ trimmed_reads_dir = os.path.join(output_dir, "Trimmed_Sequence_Data") + os.sep
+ filtered_reads_dir = os.path.join(output_dir, "Filtered_Sequence_Data") + os.sep
+ final_outputs_dir = os.path.join(output_dir, "Final_Outputs") + os.sep
+ plots_dir = final_outputs_dir + "Plots" + os.sep
+
+ # Write to config.yaml
+ with open('config.yaml', 'w') as file:
+ file.write("############################################################################################\n")
+ file.write("## Configuration file for GeneLab Illumina amplicon processing workflow ##\n")
+ file.write("## Developed by Michael D. Lee (Mike.Lee@nasa.gov) ##\n")
+ file.write("############################################################################################\n\n")
+
+ file.write("############################################################\n")
+ file.write("##################### VARIABLES TO SET #####################\n")
+ file.write("############################################################\n\n")
+
+ file.write("###########################################################################\n")
+ file.write("##### These need to match what is specific to our system and our data #####\n")
+ file.write("###########################################################################\n\n")
+
+ file.write("## Path to ISA archive, only needed for saving a copy as metadata:\n")
+ file.write(f"isa_archive:\n \"{isa_zip}\"\n\n")
+
+ file.write("## Path to runsheet:\n")
+ file.write(f"runsheet:\n \"{os.path.abspath(runsheet_file)}\"\n\n")
+
+ file.write("## Set to \"PE\" for paired-end, \"SE\" for single-end.\n")
+ file.write(f"data_type:\n \"{data_type}\"\n\n")
+
+ file.write("## single-column file with unique sample identifiers:\n")
+ file.write("sample_info_file:\n \"unique-sample-IDs.txt\"\n\n")
+
+ file.write("## input reads directory (can be relative to workflow directory, or needs to be full path):\n")
+ file.write(f"raw_reads_dir:\n \"{raw_reads_directory}\"\n\n")
+
+ file.write("## raw read suffixes:\n")
+ file.write(" # e.g. for paired-end data, Sample-1_R1_raw.fastq.gz would be _R1_raw.fastq.gz for 'raw_R1_suffix' below\n")
+ file.write(" # e.g. if single-end, Sample-1.fastq.gz would be .fastq.gz for 'raw_R1_suffix' below, and 'raw_R2_suffix' won't be used\n")
+ file.write(f"raw_R1_suffix:\n \"{raw_R1_suffix}\"\n")
+ file.write(f"raw_R2_suffix:\n \"{raw_R2_suffix}\"\n\n")
+
+ file.write("## if we are trimming primers or not (\"TRUE\", or \"FALSE\")\n")
+ file.write(f"trim_primers:\n \"{trim_primers}\"\n\n")
+
+ file.write("## primer sequences if we are trimming them (include anchoring symbols, e.g. '^', as needed, see: https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types)\n")
+ file.write(f"F_primer:\n \"{primer_anchor}{f_primer}\"\n")
+ file.write(f"R_primer:\n \"{primer_anchor}{r_primer}\"\n\n")
+
+ # For linked primers
+ file.write("## should cutadapt treat these as linked primers? (https://cutadapt.readthedocs.io/en/stable/recipes.html#trimming-amplicon-primers-from-paired-end-reads)\n")
+ file.write(f"primers_linked:\n \"{primers_linked}\"\n\n")
+ file.write("## if primers are linked, we need to provide them as below, where the second half, following three periods, is the other primer reverse-complemented\n")
+ file.write(f" # (can reverse complement while retaining ambiguous bases at this site: http://arep.med.harvard.edu/labgc/adnan/projects/Utilities/revcomp.html)\n")
+ file.write(f" # include anchoring symbols, e.g. '^', as needed, see: https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types\n")
+ file.write(f"F_linked_primer:\n \"{f_linked_primer}\"\n")
+ file.write(f"R_linked_primer:\n \"{r_linked_primer}\"\n\n")
+
+ file.write("## discard untrimmed, sets the \"--discard-untrimmed\" option if TRUE\n")
+ file.write(f"discard_untrimmed:\n \"{discard_untrimmed}\"\n\n")
+
+ file.write("## target region (16S, 18S, or ITS is acceptable)\n")
+ file.write(" # this determines which reference database is used for taxonomic classification\n")
+ file.write(" # all are pulled from the pre-packaged DECIPHER downloads page here: http://www2.decipher.codes/Downloads.html\n")
+ file.write(" # 16S uses SILVA\n")
+ file.write(" # ITS uses UNITE\n")
+ file.write(" # 18S uses PR2\n")
+ file.write(f"target_region:\n \"{target_region}\"\n\n")
+
+ file.write("## concatenate only with dada2 instead of merging paired reads if TRUE\n")
+ file.write(" # this is typically used with primers like 515-926, that captured 18S fragments that are typically too long to merge\n")
+ file.write(" # note that 16S and 18S should have been separated already prior to running this workflow\n")
+ file.write(" # this should likely be left as FALSE for any option other than \"18S\" above\n\n")
+
+ file.write(f"concatenate_reads_only:\n \"{concatenate_reads_only}\"\n\n")
+ file.write(f"## values to be passed to dada2's filterAndTrim() function:\n")
+ file.write(f"left_trunc:\n {left_trunc}\n")
+ file.write(f"right_trunc:\n {right_trunc}\n")
+ file.write(f"left_maxEE:\n {left_maxEE}\n")
+ file.write(f"right_maxEE:\n {right_maxEE}\n\n")
+
+ file.write("## minimum length threshold for cutadapt\n")
+ file.write(f"min_cutadapt_len:\n {min_trimmed_length}\n\n")
+
+ file.write("######################################################################\n")
+ file.write("##### The rest only need to be altered if we want to change them #####\n")
+ file.write("######################################################################\n\n")
+
+ file.write("## filename suffixes\n")
+ file.write("primer_trimmed_R1_suffix:\n \"_R1_trimmed.fastq.gz\"\n")
+ file.write("primer_trimmed_R2_suffix:\n \"_R2_trimmed.fastq.gz\"\n\n")
+
+ file.write("filtered_R1_suffix:\n \"_R1_filtered.fastq.gz\"\n")
+ file.write("filtered_R2_suffix:\n \"_R2_filtered.fastq.gz\"\n\n")
+
+ file.write("## output prefix (if needed to distinguish from multiple primer sets, leave as empty string if not, include connecting symbol if adding, e.g. \"ITS-\")\n")
+ file.write(f"output_prefix:\n \"{output_prefix}\"\n\n")
+
+ file.write("## output directories (all relative to processing directory, they will be created if needed)\n")
+ file.write(f"info_out_dir:\n \"{info_out_dir}\"\n")
+ file.write(f"fastqc_out_dir:\n \"{fastqc_out_dir}\"\n")
+ file.write(f"trimmed_reads_dir:\n \"{trimmed_reads_dir}\"\n")
+ file.write(f"filtered_reads_dir:\n \"{filtered_reads_dir}\"\n")
+ file.write(f"final_outputs_dir:\n \"{final_outputs_dir}\"\n")
+ file.write(f"plots_dir:\n \"{plots_dir}\"\n\n")
+
+ file.write(f"enable_visualizations:\n \"{enable_visualizations}\"\n\n")
+
+ # For general info and example usage command
+ file.write("############################################################\n")
+ file.write("###################### GENERAL INFO ########################\n")
+ file.write("############################################################\n")
+ file.write("# Workflow is currently equipped to work with paired-end data only, and reads are expected to be gzipped\n\n")
+ file.write("## example usage command ##\n")
+ file.write("# snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 2 -p\n\n")
+ file.write("# `--use-conda` – this specifies to use the conda environments included in the workflow\n")
+ file.write("# `--conda-prefix` – this allows us to point to where the needed conda environments should be stored...\n")
+ file.write("# `-j` – this lets us set how many jobs Snakemake should run concurrently...\n")
+ file.write("# `-p` – specifies to print out each command being run to the screen\n\n")
+ file.write("# See `snakemake -h` for more options and details.\n")
+ print("config.yaml was successfully created.")
+
+# Example usage
+# create_config_yaml(runsheet_df, uses_urls)
+
+# Check for single primer set, also check for invalid characters in primers used, exit if either
+def validate_primer_sequences(runsheet_df):
+ errors = []
+
+ # Check that there is only 1 entry in each primer column
+ if len(runsheet_df['F_Primer'].unique()) > 1:
+ errors.append(f"Multiple primer sequences present in F_Primer: {runsheet_df['F_Primer'].unique()}.")
+
+ if len(runsheet_df['R_Primer'].unique()) > 1:
+ errors.append(f"Multiple primer sequences present in R_primer: {runsheet_df['R_Primer'].unique()}.")
+
+
+ # Check for non-letter characters in primer sequences
+ def has_non_letter_characters(primer):
+ # Pattern to find any character that is not a letter
+ non_letter_pattern = re.compile(r'[^A-Za-z]')
+ return non_letter_pattern.search(primer)
+
+ # Check each unique primer in the F_Primer and R_Primer columns
+ for f_primer in runsheet_df['F_Primer'].unique():
+ if has_non_letter_characters(f_primer):
+ errors.append(f"Non-letter characters detected in F_Primer: '{f_primer}'")
+
+ for r_primer in runsheet_df['R_Primer'].unique():
+ if has_non_letter_characters(r_primer):
+ errors.append(f"Non-letter characters detected in R_Primer: '{r_primer}'")
+
+ if errors:
+ print("Error: Invalid primer sequence(s) detected in the runsheet.")
+ for error in errors:
+ print(f" - {error}")
+ print("Correct the primer sequences in the runsheet and rerun the workflow from the runsheet using the --runsheetPath argument.")
+ sys.exit(1)
+
+
+def main():
+ # Argument parser setup with short argument names and an automatic help option
+ parser = argparse.ArgumentParser(
+ description='Run workflow for GeneLab data processing.',
+ add_help=True,
+ usage='%(prog)s [options]' # Custom usage message
+ )
+
+ parser.add_argument('-o', '--OSD',
+ metavar='osd_number',
+ help='Set up the Snakemake workflow for a GeneLab OSD dataset and pull necessary read files and metadata. Acceptable formats: ###, OSD-###, GLDS-###',
+ type=str)
+
+ parser.add_argument('-t', '--target',
+ choices=['16S', '18S', 'ITS'],
+ help='Specify the genomic target for the assay. Options: 16S, 18S, ITS. This is used to select the appropriate dataset from an OSD study when multiple options are available.',
+ type=str)
+
+ parser.add_argument('-r', '--runsheetPath',
+ metavar='/path/to/runsheet.csv',
+ help='Set up the Snakemake workflow using a specified runsheet file.',
+ type=str)
+
+ parser.add_argument('-x', '--run',
+ metavar='command',
+ nargs='?',
+ const="snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 2 -p",
+ type=str,
+ help='Specifies the command used to execute the snakemake workflow; Default: "snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 2 -p"')
+
+ parser.add_argument('-d', '--outputDir',
+ metavar='/path/to/outputDir/',
+ default='./workflow_output/', # Default value
+ help='Specifies the output directory for the output files generated by the workflow. Default: ./workflow_output/',
+ type=str)
+
+ parser.add_argument('--specify-runsheet',
+ help='Specifies the runsheet for an OSD dataset by name. Only used if there are multiple datasets with the same target in the study.',
+ metavar='runsheet_name',
+ type=str)
+
+ parser.add_argument('--trim-primers',
+ choices=['TRUE', 'FALSE'],
+ default='TRUE',
+ help='Specifies to trim primers (TRUE) or not (FALSE). Default: TRUE',
+ type=str)
+
+ parser.add_argument('-m', '--min_trimmed_length',
+ metavar='length',
+ default=130, # Default value
+ help='Specifies the MINIMUM length of trimmed reads. For paired-end data: if one read gets filtered, both reads are discarded. Default: 130',
+ type=int)
+
+ parser.add_argument('--primers-linked',
+ choices=['TRUE', 'FALSE'],
+ default='FALSE',
+ help='If set to TRUE, instructs cutadapt to treat the primers as linked. Default: FALSE',
+ type=str)
+
+ parser.add_argument('--anchor-primers',
+ choices=['TRUE', 'FALSE'],
+ default='FALSE',
+ help='Indicates if primers should be anchored (TRUE) or not (FALSE). Default: FALSE',
+ type=str)
+
+ parser.add_argument('--discard-untrimmed',
+ choices=['TRUE', 'FALSE'],
+ default='TRUE',
+ help='If set to TRUE, instructs cutadapt to remove reads if the primers were not found in the expected location; if FALSE, these reads are kept. Default: TRUE',
+ type=str)
+
+ parser.add_argument('--left-trunc',
+ default=0,
+ help='Specifies the length of the forwards reads, bases beyond this length will be truncated and reads shorter than this length are discarded. Default: 0 (no truncation)',
+ metavar='length',
+ type=int)
+
+ parser.add_argument('--right-trunc',
+ default=0,
+ help='Specifies the length of the reverse reads, bases beyond this length will be truncated and reads shorter than this length are discarded. Default: 0 (no truncation)',
+ metavar='length',
+ type=int)
+
+ parser.add_argument('--left-maxEE',
+ default=1,
+ help='Specifies the maximum expected error (maxEE) allowed for each forward read, reads with higher than maxEE will be discarded. Default: 1',
+ metavar='max_error',
+ type=int)
+
+ parser.add_argument('--right-maxEE',
+ default=1,
+ help='Specifies the maximum expected error (maxEE) allowed for each forward read, reads with higher than maxEE will be discarded. Default: 1',
+ metavar='max_error',
+ type=int)
+
+ parser.add_argument('--concatenate_reads_only',
+ choices=['TRUE', 'FALSE'],
+ default='FALSE',
+ help='If set to TRUE, specifies to concatenate forward and reverse reads only with dada2 instead of merging paired reads. Default: FALSE',
+ type=str)
+
+ parser.add_argument('--output-prefix',
+ default='',
+ help='Specifies the prefix to use on all output files to distinguish multiple primer sets, leave as an empty string if only one primer set is being processed. Default: ""',
+ metavar='prefix',
+ type=str)
+
+ parser.add_argument('--visualizations',
+ choices=['TRUE', 'FALSE'],
+ default='TRUE',
+ help='If set to FALSE, disables visualization of workflow results. Default: TRUE')
+
+ # Check if no arguments were provided
+ if len(sys.argv) == 1:
+ parser.print_help()
+ sys.exit(1)
+
+ try:
+ args = parser.parse_args()
+ except SystemExit:
+ parser.print_help()
+ sys.exit(1)
+
+ output_dir = args.outputDir
+ min_trimmed_length = args.min_trimmed_length
+ target = args.target
+ isa_zip = ""
+
+ # If OSD is used, pull ISA metadata for the study, create and select the runsheet
+ if args.OSD:
+ accession_number = process_osd_argument(args.OSD)
+
+ # checking OSD/GLDS ID is not on the list of those the workflow definitely can't handle
+ check_provided_osd_or_glds(args.OSD)
+
+ isa_zip = download_isa_archive(accession_number)
+ if isa_zip:
+ runsheet_files = convert_isa_to_runsheet(accession_number, isa_zip)
+ if runsheet_files:
+ runsheet_file = handle_runsheet_selection(runsheet_files, target, args.specify_runsheet)
+ if runsheet_file is None:
+ sys.exit()
+ else:
+ print("No runsheet files were created.")
+ else:
+ print("No ISA archive was downloaded. Cannot proceed to runsheet conversion.", file=sys.stderr)
+ sys.exit(1)
+
+ # If a runsheet is specified, use that runsheet
+ elif args.runsheetPath:
+ runsheet_file = args.runsheetPath
+
+ # Load the runsheet if a file is specified
+ # Create unique-sample-IDs.txt based on filenames or 'Sample Name' if URLs
+ # Download files if necessary
+ if args.OSD or args.runsheetPath:
+ if runsheet_file:
+ #runsheet_df = validate_runsheet_schema(runsheet_file)
+ runsheet_df = pd.read_csv(runsheet_file)
+ if runsheet_df is not None:
+ uses_urls = check_runsheet_read_paths(runsheet_df)
+
+ # Check for primer file / invalid primers
+ validate_primer_sequences(runsheet_df)
+
+ # Create the 'unique-sample-IDs.txt' file and download read files if necessary
+ if uses_urls:
+ handle_url_downloads(runsheet_df, output_file='unique-sample-IDs.txt')
+ else:
+ sample_IDs_from_local(runsheet_df, output_file='unique-sample-IDs.txt')
+
+ # Create the config.yaml file
+ create_config_yaml(isa_zip=isa_zip,
+ runsheet_file=runsheet_file,
+ runsheet_df=runsheet_df,
+ uses_urls=uses_urls,
+ output_dir=output_dir,
+ min_trimmed_length=args.min_trimmed_length,
+ trim_primers=args.trim_primers,
+ primers_linked=args.primers_linked,
+ anchor_primers=args.anchor_primers,
+ discard_untrimmed=args.discard_untrimmed,
+ left_trunc=args.left_trunc,
+ right_trunc=args.right_trunc,
+ left_maxEE=args.left_maxEE,
+ right_maxEE=args.right_maxEE,
+ concatenate_reads_only=args.concatenate_reads_only,
+ output_prefix=args.output_prefix,
+ enable_visualizations=args.visualizations
+ )
+
+ print("Snakemake workflow setup is complete.")
+ else:
+ print("Failed to validate the runsheet file.", file=sys.stderr)
+ sys.exit(1)
+ else:
+ print("No runsheet file specified.", file=sys.stderr)
+ sys.exit(1)
+
+ # Run the snakemake workflow if --run is used
+ if args.run:
+ snakemake_command = args.run if args.run is not None else "snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 2 -p"
+ print(f"Running Snakemake command: {snakemake_command}")
+ subprocess.run(snakemake_command, shell=True, check=True)
+
+ # # Remove sample ID file
+ # with open('config.yaml', 'r') as file:
+ # config_data = yaml.safe_load(file)
+ # sample_info_file = config_data.get('sample_info_file', '') # Default to empty string if not found
+
+ # if sample_info_file and os.path.exists(sample_info_file):
+ # os.remove(sample_info_file)
+
+ # if isa_zip:
+ # try:
+ # os.remove(isa_zip)
+ # except FileNotFoundError:
+ # pass # Ignore file not found error silently
+ # except Exception:
+ # pass
+ # # Remove all files if OSD run
+ # if args.OSD:
+ # os.remove(runsheet_file) # Assuming runsheet_file is a variable holding the file name
+ # os.remove("config.yaml") # Ensure this is the correct file name
+
+ # if args.runsheetPath:
+ # os.remove("config.yaml") # Ensure this is the correct file name
+
+
+
+if __name__ == "__main__":
+ main()
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/slurm-status.py b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/slurm-status.py
new file mode 100755
index 00000000..2acb7e3e
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/slurm-status.py
@@ -0,0 +1,17 @@
+#!/usr/bin/env python
+import subprocess
+import sys
+
+jobid = sys.argv[1]
+
+# if wanting to use, this should be added to the snakemake call from the root workflow dir: `--cluster-status scripts/slurm-status.py`
+
+output = str(subprocess.check_output("sacct -j %s --format State --noheader | head -1 | awk '{print $1}'" % jobid, shell=True).strip())
+
+running_status=["PENDING", "CONFIGURING", "COMPLETING", "RUNNING", "SUSPENDED"]
+if "COMPLETED" in output:
+ print("success")
+elif any(r in output for r in running_status):
+ print("running")
+else:
+ print("failed")
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/unique-sample-IDs.txt b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/unique-sample-IDs.txt
new file mode 100644
index 00000000..4389d342
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/unique-sample-IDs.txt
@@ -0,0 +1,2 @@
+Sample1
+Sample2
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/visualizations/Illumina-R-visualizations.R b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/visualizations/Illumina-R-visualizations.R
new file mode 100644
index 00000000..5d9d8b55
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/visualizations/Illumina-R-visualizations.R
@@ -0,0 +1,641 @@
+pdf(file = NULL)
+library(vegan)
+library(tidyverse)
+library(dendextend)
+library(phyloseq)
+library(DESeq2)
+library(ggrepel)
+library(dplyr)
+library(RColorBrewer)
+library(grid)
+
+##################################################################################
+## R visualization script for Illumina paired-end amplicon data ##
+##################################################################################
+# This script is automatically executed as part of the Snakemake workflow when the run_workflow.py --visualizations TRUE argument is used.
+# This script can also be manually executed using processed data from the workflow
+
+# Store command line args as variables #
+args <- commandArgs(trailingOnly = TRUE)
+runsheet_file <- paste0(args[1])
+sample_info <- paste0(args[2])
+counts <- paste0(args[3])
+taxonomy <- paste0(args[4])
+plots_dir <- paste0(args[5])
+output_prefix <- paste0(args[6])
+assay_suffix <- paste(args[7])
+########################################
+
+RColorBrewer_Palette <- "Set1"
+
+# Runsheet read1 path/filename column name
+read1_path_colname <- 'read1_path'
+# Runsheet read1 suffix column name
+raw_R1_suffix_colname <- 'raw_R1_suffix'
+# Runsheet groups column name
+groups_colname <- 'groups'
+# Runsheet colors column name
+color_colname <- 'color'
+
+####################
+# Helper functions #
+####################
+
+# Identify the matching rows by removing suffix from basename of file
+remove_suffix <- function(path, suffix) {
+ file_name <- basename(path)
+ sub(suffix, "", file_name)
+}
+
+# Remove the longest common prefix from the sample names (only used for visualizations)
+longest_common_prefix <- function(strs) {
+ if (length(strs) == 1) return(strs)
+
+ prefix <- strs[[1]]
+ for (str in strs) {
+ while (substring(str, 1, nchar(prefix)) != prefix) {
+ prefix <- substr(prefix, 1, nchar(prefix) - 1)
+ }
+ }
+
+ return(prefix)
+}
+remove_common_prefix <- function(strs) {
+ prefix <- longest_common_prefix(strs)
+ sapply(strs, function(x) substr(x, nchar(prefix) + 1, nchar(x)))
+}
+
+# Adjust cex based on number of samples
+adjust_cex <- function(num_samples, start_samples = 40, end_samples = 150, default_cex = 1, min_cex = 0.6) {
+ slope <- (min_cex - default_cex) / (end_samples - start_samples)
+
+ new_cex <- default_cex + slope * (num_samples - start_samples)
+
+ adjusted_cex <- max(min(new_cex, default_cex), min_cex)
+
+ return(adjusted_cex)
+}
+
+
+# Extract legend from a plot
+g_legend <- function(a.gplot){
+ tmp <- ggplot_gtable(ggplot_build(a.gplot))
+ leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
+ legend <- tmp$grobs[[leg]]
+ legend
+}
+
+###########################################
+# Read in data, create output directories #
+###########################################
+
+# Assign the directory paths to variables
+beta_diversity_out_dir <- file.path(plots_dir, "beta_diversity")
+alpha_diversity_out_dir <- file.path(plots_dir, "alpha_diversity")
+taxonomy_out_dir <- file.path(plots_dir, "taxonomy")
+de_out_dir <- file.path(plots_dir, "da")
+
+abundance_out_dir <- file.path(de_out_dir, "differential_abundance")
+volcano_out_dir <- file.path(de_out_dir, "volcano")
+
+# List of all directory variables
+out_dirs <- list(plots_dir, beta_diversity_out_dir, alpha_diversity_out_dir, taxonomy_out_dir, de_out_dir, abundance_out_dir, volcano_out_dir)
+
+# Loop through each directory path to check and create if necessary
+for (dir_path in out_dirs) {
+ if (!dir.exists(dir_path)) {
+ dir.create(dir_path, recursive = TRUE)
+ }
+}
+
+# Read in processed data
+runsheet <- as.data.frame(read.table(file = runsheet_file,
+ header = TRUE, sep = ","))
+row.names(runsheet) <- runsheet$'Sample.Name'
+runsheet$'Sample.Name' <- NULL
+
+count_tab <- read.table(file = counts,
+ header = TRUE, row.names = 1, sep = "\t")
+tax_tab <- read.table(file = taxonomy,
+ header = TRUE, row.names = 1, sep = "\t")
+# Use only samples listed in sample_info, which should correspond to the file names
+sample_names <- readLines(sample_info)
+deseq2_sample_names <- make.names(sample_names, unique = TRUE)
+
+
+# Check if the runsheet uses links instead of local file paths
+# Extract what is after 'file=' and before any '&' or other URL parameters
+uses_links <- any(grepl("^(http|https)://|genelab-data.ndc.nasa.gov", runsheet[[read1_path_colname]]))
+if (uses_links) {
+ # Use rownames as basenames if links are used
+ runsheet$basename <- rownames(runsheet)
+} else {
+ # Remove extensions from filenames in runsheet for local file paths
+ runsheet$basename <- mapply(remove_suffix, runsheet[[read1_path_colname]], runsheet[[raw_R1_suffix_colname]])
+}
+
+# Make the basenames DESeq2 compatible, add temporary s_ prefix to fix bugs caused by basenames starting w/ number
+runsheet$basename <- paste0("s_", runsheet$basename)
+runsheet$basename <- make.names(runsheet$basename, unique = TRUE)
+runsheet$basename <- sub("^s_", "", runsheet$basename)
+
+# Subset runsheet and count tab to only include samples in sample_info
+runsheet <- runsheet[runsheet$basename %in% deseq2_sample_names, ]
+count_tab <- count_tab[, colnames(count_tab) %in% runsheet$basename]
+
+# Order runsheet based on the groups column
+runsheet <- runsheet[order(runsheet[[groups_colname]]), ]
+
+# Reorder count_tab columns to match the order in the runsheet
+count_tab <- count_tab[, runsheet$basename]
+
+# Rename runsheet row names
+rownames(runsheet) <- runsheet$basename
+
+if (!identical(rownames(runsheet), colnames(count_tab))) {
+ stop("The read file names in the runsheet do not match the colnames of count_tab.")
+}
+
+# Keep only genes with at least 1 count
+count_tab <- count_tab[rowSums(count_tab) > 0, ]
+count_tab_vst <- count_tab
+
+# Check if every gene has a 0 in the row, add +1 pseudocount for VST, not ideal but fixes VST for sparse counts if needed
+if (all(apply(count_tab_vst, 1, any))) {
+ count_tab_vst <- count_tab_vst + 1
+}
+
+# Create VST normalized counts matrix
+deseq_counts <- DESeqDataSetFromMatrix(countData = count_tab_vst,
+ colData = runsheet,
+ design = ~1)
+deseq_counts_vst <- varianceStabilizingTransformation(deseq_counts)
+vst_trans_count_tab <- assay(deseq_counts_vst)
+
+
+###########################################
+# Create plots, save them in output dirs #
+###########################################
+
+# Add colors to runsheet
+num_colors <- length(unique(runsheet[[groups_colname]]))
+
+# List of RColorBrewer Palette lengths
+RColorBrewer_Palette_lengths <- c(Accent=8, Dark2=8, Paired=12, Pastel1=9, Pastel2=8, Set1=9, Set2=8, Set3=12)
+
+# Check if number of colors exceeds the limit of the selected RColorBrewer_Palette palette
+if (num_colors > RColorBrewer_Palette_lengths[RColorBrewer_Palette]) {
+ # If so, create a custom palette with more colors
+ custom_palette <- colorRampPalette(brewer.pal(RColorBrewer_Palette_lengths[RColorBrewer_Palette], RColorBrewer_Palette))(num_colors)
+ colors <- custom_palette
+} else {
+ # Else just use the standard RColorBrewer_Palette palette
+ colors <- brewer.pal(num_colors, RColorBrewer_Palette)
+}
+
+
+
+group_colors <- setNames(colors, unique(runsheet[[groups_colname]]))
+runsheet <- runsheet %>%
+ mutate(!!color_colname := group_colors[.data[[groups_colname]]])
+
+
+########
+## Save original par settings
+## Par may be temporarily changed for plotting purposes and reset once the plotting is done
+original_par <- par(no.readonly = TRUE)
+options(preferRaster=TRUE) # use Raster when possible to avoid antialiasing artifacts in images
+
+
+width_in_inches <- 11.1
+height_in_inches <- 8.33
+dpi <- 300
+width_in_pixels <- width_in_inches * dpi
+height_in_pixels <- height_in_inches * dpi
+
+
+# Hierarchical Clustering
+sample_info_tab <- runsheet[, c(groups_colname, color_colname)]
+
+# Add short group names and legend text column to sample_info
+sample_info_tab$short_groups <- as.integer(factor(sample_info_tab[[groups_colname]], levels = unique(sample_info_tab[[groups_colname]])))
+
+group_levels <- unique(sample_info_tab[[groups_colname]])
+short_group_labels <- sprintf("%d: %s", seq_along(group_levels), group_levels)
+names(short_group_labels) <- group_levels
+sample_info_tab$short_group_labels <- short_group_labels[sample_info_tab[[groups_colname]]]
+colors_vector <- unique(setNames(sample_info_tab[[color_colname]], sample_info_tab$short_group_labels))
+
+euc_dist <- dist(t(vst_trans_count_tab))
+euc_clust <- hclust(d = euc_dist, method = "ward.D2")
+
+# Color the dendrogram sample labels by group using dendxtend
+euc_dend <- as.dendrogram(euc_clust, h = .1)
+dend_cols <- sample_info_tab[[color_colname]][order.dendrogram(euc_dend)]
+labels_colors(euc_dend) <- dend_cols
+
+
+##########
+default_cex = 1
+# Lower cex if over 40 samples to prevent names from crashing on plot
+default_cex <- adjust_cex(length(rownames(sample_info_tab)))
+
+
+# Set for 11x8 plot margins, else try ggdendrogram
+space_available <- height_in_inches/5.4
+
+longest_name <- rownames(sample_info_tab)[which.max(nchar(rownames(sample_info_tab)))]
+
+calculate_max_cex <- function(longest_name, space_avail) {
+ # Define weights for lower case letters, periods, else
+ lower_case_weight <- 0.10
+ other_char_weight <- 0.15
+ dot_weight <- 0.02 # Weight for the period character
+
+ # Calculate weights in longest sample name
+ char_weights <- sapply(strsplit(longest_name, "")[[1]], function(char) {
+ if (char == ".") {
+ return(dot_weight)
+ } else if (grepl("[a-z]", char)) {
+ return(lower_case_weight)
+ } else {
+ return(other_char_weight)
+ }
+ })
+
+ average_weight <- mean(char_weights)
+
+ # Calculate the maximum cex that fits the space using the average weight
+ n = nchar(longest_name)
+ max_cex <- space_avail / (n * average_weight)
+
+ return(max_cex)
+}
+
+max_cex <- calculate_max_cex(longest_name, space_available)
+dendro_cex <- min(max_cex, default_cex)
+
+
+legend_groups <- unique(sample_info_tab$groups)
+legend_colors <- unique(sample_info_tab$color)
+num_unique_groups <- length(legend_groups)
+legend_cex <- ifelse(num_unique_groups > 5, 1 / (num_unique_groups / 5), 1)
+
+png(file.path(beta_diversity_out_dir, paste0(output_prefix, "dendrogram_by_group", assay_suffix, ".png")),
+ width = width_in_pixels,
+ height = height_in_pixels,
+ res = dpi)
+par(mar = c(10.5, 4.1, 0.6 , 2.1))
+euc_dend %>% set("labels_cex", dendro_cex) %>% plot(ylab = "VST Euc. dist.")
+par(xpd=TRUE)
+legend("bottom", inset = c(0, -.34), legend = legend_groups, fill = legend_colors, bty = 'n', cex = legend_cex)
+dev.off()
+par(original_par)
+
+
+
+
+
+
+# making a phyloseq object with our transformed table
+vst_count_phy <- otu_table(object = vst_trans_count_tab, taxa_are_rows = TRUE)
+sample_info_tab_phy <- sample_data(sample_info_tab)
+vst_physeq <- phyloseq(vst_count_phy, sample_info_tab_phy)
+vst_physeq
+
+# generating a PCoA with phyloseq
+vst_pcoa <- ordinate(physeq = vst_physeq, method = "PCoA", distance = "euclidean")
+eigen_vals <- vst_pcoa$values$Eigenvalues
+
+# Calculate the percentage of variance
+percent_variance <- eigen_vals / sum(eigen_vals) * 100
+
+betadisper(d = euc_dist, group = sample_info_tab$groups) %>% anova()
+adonis_res <- adonis2(formula = euc_dist ~ sample_info_tab$groups)
+r2_value <- adonis_res$R2[1]
+prf_value <- adonis_res$`Pr(>F)`[1]
+
+label_PC1 <- sprintf("PC1 [%.1f%%]", percent_variance[1])
+label_PC2 <- sprintf("PC2 [%.1f%%]", percent_variance[2])
+
+
+# Save unlabeled PCoA plot
+ordination_plot_u <- plot_ordination(vst_physeq, vst_pcoa, color = "groups") +
+ geom_point(size = 1) +
+ labs(
+ x = label_PC1,
+ y = label_PC2,
+ col = "Groups"
+ ) +
+ coord_fixed(sqrt(eigen_vals[2]/eigen_vals[1])) +
+ scale_color_manual(values = unique(sample_info_tab[[color_colname]][order(sample_info_tab[[groups_colname]])]),
+ labels = unique(sample_info_tab$short_group_labels[order(sample_info_tab[[groups_colname]])])) +
+ theme_bw() + theme(legend.position = "bottom", text = element_text(size = 15, ),
+ legend.direction = "vertical",
+ legend.justification = "center",
+ legend.box.just = "center",
+ legend.title.align = 0.5) +
+ annotate("text", x = Inf, y = -Inf, label = paste("R2:", toString(round(r2_value, 3))), hjust = 1.1, vjust = -2, size = 4)+
+ annotate("text", x = Inf, y = -Inf, label = paste("Pr(>F)", toString(round(prf_value,4))), hjust = 1.1, vjust = -0.5, size = 4)+ ggtitle("PCoA")
+ggsave(filename=file.path(beta_diversity_out_dir, paste0(output_prefix, "PCoA_without_labels", assay_suffix, ".png")), plot=ordination_plot_u, width = 11.1, height = 8.33, dpi = 300)
+# Save labeled PCoA plot
+ordination_plot <- plot_ordination(vst_physeq, vst_pcoa, color = "groups") +
+ geom_point(size = 1) +
+ labs(
+ col = "Groups",
+ x = label_PC1,
+ y = label_PC2
+ ) +
+ geom_text(aes(label = rownames(sample_info_tab)), show.legend = FALSE, hjust = 0.3, vjust = -0.4, size = 4) +
+ coord_fixed(sqrt(eigen_vals[2]/eigen_vals[1])) +
+ scale_color_manual(values = unique(sample_info_tab[[color_colname]][order(sample_info_tab[[groups_colname]])]),
+ labels = unique(sample_info_tab$short_group_labels[order(sample_info_tab[[groups_colname]])])) +
+ theme_bw() + theme(legend.position = "bottom", text = element_text(size = 15, ),
+ legend.direction = "vertical",
+ legend.justification = "center",
+ legend.box.just = "center",
+ legend.title.align = 0.5) +
+ annotate("text", x = Inf, y = -Inf, label = paste("R2:", toString(round(r2_value, 3))), hjust = 1.1, vjust = -2, size = 4)+
+ annotate("text", x = Inf, y = -Inf, label = paste("Pr(>F)", toString(round(prf_value,4))), hjust = 1.1, vjust = -0.5, size = 4)+ ggtitle("PCoA")
+ggsave(filename=file.path(beta_diversity_out_dir, paste0(output_prefix, "PCoA_w_labels", assay_suffix, ".png")), plot=ordination_plot, width = 11.1, height = 8.33, dpi = 300)
+########################
+
+#4. Alpha diversity
+
+# 4a. Rarefaction curves
+
+p <- rarecurve(x = t(count_tab), step = 100, col = sample_info_tab[[color_colname]],
+ lwd = 2, ylab = "ASVs", label = FALSE, tidy = TRUE)
+
+sample_info_tab_names <- tibble::rownames_to_column(sample_info_tab, var = "Site")
+p <- p %>%
+ left_join(sample_info_tab_names, by = "Site")
+
+rareplot <- ggplot(p, aes(x = Sample, y = Species, group = Site, color = groups)) +
+ geom_line() +
+ scale_color_manual(values = unique(sample_info_tab[[color_colname]][order(sample_info_tab[[groups_colname]])]),
+ labels = unique(sample_info_tab$short_group_labels[order(sample_info_tab[[groups_colname]])]),
+ breaks = unique(sample_info_tab[[groups_colname]])) +
+ labs(x = "Number of Sequences", y = "Number of ASVs", col = "Groups") +
+ theme_bw() +
+ theme(legend.position = "bottom",
+ text = element_text(size = 15),
+ legend.direction = "vertical",
+ legend.justification = "center",
+ legend.box.just = "center",
+ legend.title.align = 0.5,
+ panel.grid.major = element_blank(),
+ panel.grid.minor = element_blank(),
+ plot.margin = margin(t = 10, r = 20, b = 10, l = 10, unit = "pt")) +
+ guides(color = guide_legend(title = "Groups"))
+ggsave(filename = file.path(alpha_diversity_out_dir, paste0(output_prefix, "rarefaction_curves", assay_suffix, ".png")), plot=rareplot, width = 8.33, height = 8.33, dpi = 300)
+
+# 4b. Richness and diversity estimates
+
+# create a phyloseq object similar to how we did above in step 3B, only this time also including our taxonomy table:
+count_tab_phy <- otu_table(count_tab, taxa_are_rows = TRUE)
+tax_tab_phy <- tax_table(as.matrix(tax_tab))
+ASV_physeq <- phyloseq(count_tab_phy, tax_tab_phy, sample_info_tab_phy)
+
+
+calculate_text_size <- function(num_samples, start_samples = 25, min_size = 3) {
+ max_size = 11 # Maximum size for up to start_samples
+ slope = -0.15
+
+ if (num_samples <= start_samples) {
+ return(max_size)
+ } else {
+ # Calculate the current size with the hardcoded slope
+ current_size = max_size + slope * (num_samples - start_samples)
+
+ # Ensure the size doesn't go below the minimum
+ return(max(current_size, min_size))
+ }
+}
+
+richness_sample_label_size <- calculate_text_size(length(rownames(sample_info_tab)))
+
+richness_plot <- plot_richness(ASV_physeq, color = "groups", measures = c("Chao1", "Shannon")) +
+ scale_color_manual(values = unique(sample_info_tab[[color_colname]][order(sample_info_tab[[groups_colname]])]),
+ labels = unique(sample_info_tab$short_group_labels[order(sample_info_tab[[groups_colname]])])) +
+ theme_bw() +labs(x = "Samples",
+ color = "Groups") +
+ theme(
+ text = element_text(size = 15),
+ legend.position = "bottom",
+ legend.direction = "vertical",
+ legend.justification = "center",
+ legend.box.just = "center",
+ legend.title.align = 0.5,
+ axis.text.x = element_text(angle = 90,
+ size = richness_sample_label_size,
+ vjust = 0.5, # Vertically center the text
+ hjust = 1)
+ )
+ggsave(filename = file.path(alpha_diversity_out_dir, paste0(output_prefix, "richness_and_diversity_estimates_by_sample", assay_suffix, ".png")), plot=richness_plot, width = 11.1, height = 8.33, dpi = 300)
+
+richness_by_group <- plot_richness(ASV_physeq, x = "groups", color = "groups", measures = c("Chao1", "Shannon")) +
+ scale_color_manual(values = unique(sample_info_tab[[color_colname]][order(sample_info_tab[[groups_colname]])]),
+ labels = unique(sample_info_tab$short_group_labels[order(sample_info_tab[[groups_colname]])])) +
+ scale_x_discrete(labels = unique(sample_info_tab$short_groups[order(sample_info_tab[[groups_colname]])])) +
+ theme_bw() + labs(colors = "Groups",
+ x = "Groups") +
+ theme(
+ text = element_text(size = 15),
+ legend.position = "bottom",
+ legend.direction = "vertical",
+ legend.justification = "center",
+ legend.box.just = "center",
+ legend.title.align = 0.5,
+ legend.title = element_blank()
+ )
+ggsave(filename = file.path(alpha_diversity_out_dir, paste0(output_prefix, "richness_and_diversity_estimates_by_group", assay_suffix, ".png")), plot=richness_by_group, width = 11.1, height = 8.33, dpi = 300)
+
+# Extract legend from unlabeled pca plot, also save it as its own plot
+legend <- g_legend(ordination_plot)
+grid.newpage()
+grid.draw(legend)
+legend_filename <- file.path(plots_dir, paste0(output_prefix, "color_legend", assay_suffix, ".png"))
+increment <- ifelse(length(unique(sample_info_tab$groups)) > 9, ceiling((length(unique(sample_info_tab$groups)) - 9) / 3), 0)
+legend_height <- 3 + increment
+ggsave(legend_filename, plot = legend, device = "png", width = 11.1, height = legend_height, dpi = 300)
+
+
+# 5. Taxonomic summaries
+# Calculate new plot height of legend is taller than ~ default / 4
+
+# Get height of legend
+heights_in_cm <- sapply(legend$heights, function(h) {
+ if (is.null(h)) {
+ # Assign 0 to null heights
+ return(unit(0, "cm"))
+ } else {
+ # Convert the unit to centimeters
+ return(convertUnit(h, "cm", valueOnly = TRUE))
+ }
+})
+legend_height_in_cm <- sum(heights_in_cm)
+legend_height_in_inches <- legend_height_in_cm / 2.54
+
+taxonomy_plots_height <- width_in_inches
+required_height_in_inches <- 4 * legend_height_in_inches + 0.5
+if (legend_height_in_inches < required_height_in_inches) {
+ # Adjust width_in_inches to make sure the legend fits
+ taxonomy_plots_height <- required_height_in_inches
+}
+
+proportions_physeq <- transform_sample_counts(ASV_physeq, function(ASV) ASV / sum(ASV))
+proportions_physeq@sam_data$short_groups <- as.character(proportions_physeq@sam_data$short_groups)
+
+relative_phyla <- plot_bar(proportions_physeq, x = "short_groups", fill = "phylum") +
+ theme_bw() + theme(text = element_text(size = 9)) + labs(x = "Groups")
+plot_layout <- grid.layout(nrow = 2, heights = unit(c(3, 1), "null"))
+grid.newpage()
+pushViewport(viewport(layout = plot_layout))
+print(relative_phyla, vp = viewport(layout.pos.row = 1))
+pushViewport(viewport(layout.pos.row = 2))
+grid.draw(legend)
+upViewport(0)
+grid_image <- grid.grab()
+ggsave(filename = file.path(taxonomy_out_dir, paste0(output_prefix, "relative_phyla", assay_suffix, ".png")), grid_image, width = height_in_inches, height = taxonomy_plots_height, dpi = 500)
+
+relative_classes <- plot_bar(proportions_physeq, x = "short_groups", fill = "class") +
+ theme_bw() + theme(text = element_text(size = 9)) + labs(x = "Groups")
+plot_layout <- grid.layout(nrow = 2, heights = unit(c(3, 1), "null"))
+grid.newpage()
+pushViewport(viewport(layout = plot_layout))
+print(relative_classes, vp = viewport(layout.pos.row = 1))
+pushViewport(viewport(layout.pos.row = 2))
+grid.draw(legend)
+upViewport(0)
+grid_image <- grid.grab()
+ggsave(filename = file.path(taxonomy_out_dir, paste0(output_prefix, "relative_classes", assay_suffix, ".png")), plot=grid_image, width = height_in_inches, height = taxonomy_plots_height, dpi = 500)
+
+
+#samplewise taxonomy
+
+proportions_physeq <- transform_sample_counts(ASV_physeq, function(ASV) ASV / sum(ASV))
+# Calculate the number of samples from the count_tab
+num_samples <- ncol(count_tab)
+# Scaling function: starts at 1x width for 25 samples
+scaling_factor <- (num_samples - 40) / (200 - 40) * (5 - 1) + 1
+scaling_factor <- max(1, min(scaling_factor, 5))
+
+# samplewise phyla
+samplewise_phylum <- plot_bar(proportions_physeq, fill = "phylum") +
+ theme_bw() +
+ theme(text = element_text(size = 16),
+ axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5, size = 6)) # Rotate and resize x-axis labels
+
+ggsave(filename = file.path(taxonomy_out_dir, paste0(output_prefix, "samplewise_relative_phyla", assay_suffix, ".png")),
+ plot = samplewise_phylum,
+ width = height_in_inches * scaling_factor,
+ height = taxonomy_plots_height)
+
+samplewise_classes <- plot_bar(proportions_physeq, fill = "class") +
+ theme_bw() +
+ theme(text = element_text(size = 16),
+ axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5, size = 6)) # Rotate and resize x-axis labels
+
+ggsave(filename = file.path(taxonomy_out_dir, paste0(output_prefix, "samplewise_relative_classes", assay_suffix, ".png")),
+ plot = samplewise_classes,
+ width = height_in_inches * scaling_factor,
+ height = taxonomy_plots_height)
+
+# 6 Statistically testing for differences
+
+#### pairwise comparisons
+unique_groups <- unique(runsheet$groups)
+deseq_obj <- phyloseq_to_deseq2(physeq = ASV_physeq, design = ~groups)
+
+# add pseudocount if any 0 count samples are present
+if (sum(colSums(counts(deseq_obj)) == 0) > 0) {
+ count_data <- counts(deseq_obj) + 1
+
+ count_data <- as.matrix(apply(count_data, 2, as.integer))
+ rownames(count_data) <- rownames(counts(deseq_obj))
+ colnames(count_data) <- colnames(counts(deseq_obj))
+ counts(deseq_obj) <- count_data
+}
+# https://rdrr.io/bioc/phyloseq/src/inst/doc/phyloseq-mixture-models.R
+deseq_modeled <- tryCatch({
+ # Attempt to run DESeq
+ DESeq(deseq_obj)
+}, error = function(e) {
+ message("Error encountered in DESeq, applying alternative method for size factor estimation...")
+
+ # Define the geometric mean function
+ gm_mean = function(x, na.rm=TRUE) {
+ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
+ }
+ geoMeans = apply(counts(deseq_obj), 1, gm_mean)
+
+ # Apply the alternative size factor estimation method
+ deseq_obj <- estimateSizeFactors(deseq_obj, geoMeans=geoMeans)
+
+ # Call DESeq again with alternative geom mean size est
+ DESeq(deseq_obj)
+})
+
+# save final differential abundance counts, individual group comparison results
+
+write.table(counts(deseq_modeled, normalized=TRUE), file = file.path(de_out_dir, paste0(output_prefix, "normalized_counts", assay_suffix, ".tsv")), sep="\t", row.names=TRUE, quote=FALSE)
+# make the volcanoplot
+plot_comparison <- function(group1, group2) {
+ plot_width_inches = 11.1
+ plot_height_inches = 8.33
+
+ deseq_res <- results(deseq_modeled, contrast = c("groups", group1, group2))
+ norm_tab <- counts(deseq_modeled, normalized = TRUE) %>% data.frame()
+
+ volcano_data <- as.data.frame(deseq_res)
+
+ p_val <- 0.1
+ volcano_data <- volcano_data[!is.na(volcano_data$padj), ]
+ volcano_data$significant <- volcano_data$padj <= p_val #also logfc cutoff?
+
+ ######Long x-axis label adjustments##########
+ x_label <- paste("Log2 Fold Change\n(",group1," vs ",group2,")")
+ label_length <- nchar(x_label)
+ max_allowed_label_length = plot_width_inches * 10
+
+ # Construct x-axis label with new line breaks if was too long
+ if (label_length > max_allowed_label_length){
+ x_label <- paste("Log2 Fold Change\n\n(", group1, "\n vs \n", group2, ")", sep="")
+ }
+ #######################################
+
+ # ASVs promoted in space on right, reduced on left
+ p <- ggplot(volcano_data, aes(x=log2FoldChange, y=-log10(padj), color=significant)) +
+ geom_point(alpha=0.7, size=2) +
+ scale_color_manual(values=c("black", "red"), labels=c(paste0("padj > ", p_val), paste0("padj \u2264 ", p_val))) +
+ theme_bw() +
+ labs(title="Volcano Plot",
+ x=x_label,
+ y="-Log10 P-value",
+ color=paste0("")) +
+ theme(legend.position="top")
+
+ # label points and plot
+ top_points <- volcano_data %>%
+ arrange(padj) %>%
+ filter(significant) %>%
+ head(10)
+
+ volcano_plot <- p + geom_text_repel(data=top_points, aes(label=row.names(top_points)), size=3)
+ ggsave(filename=file.path(volcano_out_dir, paste0(output_prefix, "volcano_", gsub(" ", "_", group1), "_vs_", gsub(" ", "_", group2), ".png")),
+ plot=volcano_plot,
+ width = plot_width_inches, height = plot_height_inches, dpi = 300)
+
+ write.csv(deseq_res, file = file.path(abundance_out_dir, paste0(output_prefix, gsub(" ", "_", group1), "_vs_", gsub(" ", "_", group2), ".csv")))
+}
+
+
+# setting up pairwise comparisons and running
+comparisons <- expand.grid(group1 = unique_groups, group2 = unique_groups)
+comparisons <- subset(comparisons, group1 != group2)
+
+apply(comparisons, 1, function(pair) plot_comparison(pair['group1'], pair['group2']))
+
+
+dev.off()
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/visualizations/README.md b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/visualizations/README.md
new file mode 100644
index 00000000..c036ac54
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/visualizations/README.md
@@ -0,0 +1,100 @@
+# SW_AmpIllumina-B Visualization Script Information and Usage Instructions
+
+
+## General info
+The documentation for this script and its outputs can be found in steps 6-10 of the [GL-DPPD-7104-B.md](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md) pipeline document. This script is automatically executed as an optional step of the [SW_AmpIllumina-B](../../) Snakemake workflow when the `run_workflow.py` argument, `--visualizations TRUE`, is used. Alternatively, the script can be executed independently as detailed below.
+
+
+
+---
+
+## Utilizing the script
+
+
+- [1. Set up the execution environment](#1-set-up-the-execution-environment)
+- [2. Run the visualization script manually](#2-run-the-visualization-script-manually)
+
+
+
+___
+
+### 1. Set up the execution environment
+
+The script should be executed from a [conda](https://docs.conda.io/en/latest/) environment created using the [R_visualizations.yaml](R_visualizations.yaml) environment file.
+> If you do not have conda installed, an introduction to conda with installation help and links to other resources can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/unix/conda-intro).
+
+Download the [R_visualizations.yaml](R_visualizations.yaml) environment file and the [Illumina-R-visualizations.R](Illumina-R-visualizations.R) script by running the following commands:
+
+```
+curl -LO https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/dev2-amplicon-add-runsheet-visualizations/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/visualizations/R_visualizations.yaml
+
+curl -LO https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/dev2-amplicon-add-runsheet-visualizations/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/visualizations/Illumina-R-visualizations.R
+```
+
+Next, create the AmpSeqVisualizations environment by running the following command:
+
+```
+conda env create -f R_visualizations.yaml -n AmpSeqVisualizations
+```
+
+Then activate the environment as follows:
+
+```
+conda activate AmpSeqVisualizations
+```
+
+
+
+
+___
+
+### 2. Run the visualization script manually
+
+The [Illumina-R-visualizations.R](./Illumina-R-visualizations.R) script can be executed from the command line by providing `runsheet_file`, `sample_info`, `counts`, `taxonomy`, `plots_dir`, `output_prefix`, and `assay_suffix` as positional arguments, in their respective order.
+
+The example command below shows how to execute the script with the following parameters:
+ * runsheet_file: /path/to/runsheet.csv
+ * sample_info: /path/to/unique-sample-IDs.txt
+ * counts: /path/to/counts_GLAmpSeq.tsv
+ * taxonomy: /path/to/taxonomy_GLAmpSeq.tsv
+ * plots_dir: /path/to/Plots/
+ * output_prefix: my_prefix_
+ * assay_suffix: _GL_Ampseq
+
+```bash
+Rscript /path/to/visualizations/Illumina-R-visualizations.R /path/to/runsheet.csv /path/to/unique-sample-IDs.txt /path/to/counts_GLAmpSeq.tsv /path/to/taxonomy_GLAmpSeq.tsv /path/to/Plots/ "my_prefix_" "_GL_Ampseq"
+```
+
+Additionally, the `RColorBrewer_Palette` variable can be modified in the script. This variable determines the color palette from the RColorBrewer package that is applied to the plots.
+
+**Parameter Definitions:**
+* `runsheet_file` – specifies the table containing sample metadata required for processing
+* `sample_info` – specifies the text file containing the IDs of each sample used, required for running the SW_AmpIllumina workflow
+* `counts` – specifies the ASV counts table
+* `taxonomy` – specifies the taxonomy table
+* `plots_dir` – specifies the path where output files will be saved
+* `output_prefix` – specifies a string that is prepended to the start of the output file names. Default: ""
+* `assay_suffix` – specifies a string that is appended to the end of the output file names. Default: "_GLAmpSeq"
+* `RColorBrewer_Palette` – specifies the RColorBrewer palette that will be used for coloring in the plots. Options include "Set1", "Accent", "Dark2", "Paired", "Pastel1", "Pastel2", "Set2", and "Set3". Default: "Set1"
+
+**Input Data:**
+* *runsheet.csv (output from [GL-DPPD-7104-B step 6a](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md#6a-create-sample-runsheet))
+* unique-sample-IDs.txt (output from [run_workflow.py](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md#5-additional-output-files))
+* counts_GLAmpSeq.tsv (output from [GL-DPPD-7104-B step 5g](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md#5g-generating-and-writing-standard-outputs))
+* taxonomy_GLAmpSeq.tsv (output from [GL-DPPD-7104-B step 5g](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md#5g-generating-and-writing-standard-outputs))
+
+**Output Data:**
+* **{output_prefix}dendrogram_by_group{assay_suffix}.png** (dendrogram of euclidean distance - based hierarchical clustering of the samples, colored by experimental groups)
+* **{output_prefix}rarefaction_curves{assay_suffix}.png** (Rarefaction curves plot for all samples)
+* **{output_prefix}richness_and_diversity_estimates_by_sample{assay_suffix}.png** (Richness and diversity estimates plot for all samples)
+* **{output_prefix}richness_and_diversity_estimates_by_group{assay_suffix}.png** (Richness and diversity estimates plot for all groups)
+* **{output_prefix}relative_phyla{assay_suffix}.png** (taxonomic summaries plot based on phyla, for all samples)
+* **{output_prefix}relative_classes{assay_suffix}.png** (taxonomic summaries plot based on class, for all samples)
+* **{output_prefix}samplewise_phyla{assay_suffix}.png** (taxonomic summaries plot based on phyla, for all samples)
+* **{output_prefix}samplewise_classes{assay_suffix}.png** (taxonomic summaries plot based on class, for all samples)
+* **{output_prefix}PCoA_w_labels{assay_suffix}.png** (principle Coordinates Analysis plot of VST transformed ASV counts, with sample labels)
+* **{output_prefix}PCoA_without_labels{assay_suffix}.png** (principle Coordinates Analysis plot of VST transformed ASV counts, without sample labels)
+* **{output_prefix}normalized_counts{assay_suffix}.tsv** (size factor normalized ASV counts table)
+* **{output_prefix}group1_vs_group2.csv** (differential abundance tables for all pairwise contrasts of groups)
+* **{output_prefix}volcano_group1_vs_group2.png** (volcano plots for all pairwise contrasts of groups)
+* {output_prefix}color_legend{assay_suffix}.png (color legend for all groups)
diff --git a/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/visualizations/R_visualizations.yaml b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/visualizations/R_visualizations.yaml
new file mode 100644
index 00000000..07c14e60
--- /dev/null
+++ b/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/visualizations/R_visualizations.yaml
@@ -0,0 +1,14 @@
+channels:
+ - conda-forge
+ - bioconda
+ - defaults
+dependencies:
+ - r-base=4.3.2
+ - r-vegan=2.6.4
+ - r-tidyverse=2.0.0
+ - r-dendextend=1.17.1
+ - r-ggrepel=0.9.4
+ - r-dplyr=1.1.3
+ - r-rcolorbrewer=1.1.3
+ - bioconductor-deseq2=1.40.2
+ - bioconductor-phyloseq=1.44.0
diff --git a/Amplicon/README.md b/Amplicon/README.md
index 98ee91c8..44b77f85 100644
--- a/Amplicon/README.md
+++ b/Amplicon/README.md
@@ -3,7 +3,7 @@
---
-
+
---
@@ -19,12 +19,12 @@
## Licenses
-The software for the Amplicon Seq pipelines is released under the [NASA Open Source Agreement (NOSA) Version 1.3](../Licenses/Amplicon_and_Metagenomics_NOSA_License.pdf).
+The software for the Amplicon Seq pipelines is released under the [NASA Open Source Agreement (NOSA) Version 1.3](https://github.com/nasa/GeneLab_AmpliconSeq_Workflow/blob/main/License/Amplicon_NOSA_License.pdf).
### 3rd Party Software Licenses
-Licenses for the 3rd party open source software utilized in the Amplicon Seq pipelines can be found in the [3rd_Party_Licenses sub-directory](../3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software.md) on the GeneLab_Data_Processing GitHub repository landing page.
+Licenses for the 3rd party open source software utilized in the Amplicon Seq pipelines can be found in the [3rd_Party_Licenses sub-directory](https://github.com/nasa/GeneLab_AmpliconSeq_Workflow/tree/main/License/3rd_Party_Licenses/README.md) on the GeneLab_Data_Processing GitHub repository landing page.
diff --git a/Amplicon/images/GL-amplicon-overview.pdf b/Amplicon/images/GL-amplicon-overview.pdf
index aef7d543..74215d61 100644
Binary files a/Amplicon/images/GL-amplicon-overview.pdf and b/Amplicon/images/GL-amplicon-overview.pdf differ
diff --git a/Amplicon/images/GL-amplicon-overview.png b/Amplicon/images/GL-amplicon-overview.png
index 326fb5e1..2a49b2c6 100644
Binary files a/Amplicon/images/GL-amplicon-overview.png and b/Amplicon/images/GL-amplicon-overview.png differ
diff --git a/Amplicon/images/GL-amplicon-subwayplot.pdf b/Amplicon/images/GL-amplicon-subwayplot.pdf
new file mode 100644
index 00000000..d97f4372
Binary files /dev/null and b/Amplicon/images/GL-amplicon-subwayplot.pdf differ
diff --git a/Amplicon/images/GL-amplicon-subwayplot.png b/Amplicon/images/GL-amplicon-subwayplot.png
new file mode 100644
index 00000000..cb0dc7f6
Binary files /dev/null and b/Amplicon/images/GL-amplicon-subwayplot.png differ
diff --git a/Licenses/Amplicon_and_Metagenomics_NOSA_License.pdf b/Licenses/Metagenomics_NOSA_License.pdf
similarity index 100%
rename from Licenses/Amplicon_and_Metagenomics_NOSA_License.pdf
rename to Licenses/Metagenomics_NOSA_License.pdf
diff --git a/Metagenomics/README.md b/Metagenomics/README.md
index bdc0aa53..ebfd4a0d 100644
--- a/Metagenomics/README.md
+++ b/Metagenomics/README.md
@@ -14,12 +14,12 @@
## Licenses
-The software for the Metagenomics pipelines is released under the [NASA Open Source Agreement (NOSA) Version 1.3](../Licenses/Amplicon_and_Metagenomics_NOSA_License.pdf).
+The software for the Metagenomics pipelines is released under the [NASA Open Source Agreement (NOSA) Version 1.3](../Licenses/Metagenomics_NOSA_License.pdf).
### 3rd Party Software Licenses
-Licenses for the 3rd party open source software utilized in the Metagenomics pipelines can be found in the [3rd_Party_Licenses sub-directory](../3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software.md) on the GeneLab_Data_Processing GitHub repository landing page.
+Licenses for the 3rd party open source software utilized in the Metagenomics pipelines can be found in the [3rd_Party_Licenses sub-directory](../3rd_Party_Licenses/Metagenomics_3rd_Party_Software.md) on the GeneLab_Data_Processing GitHub repository landing page.
diff --git a/README.md b/README.md
index 83c8769d..e23f668b 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@
# GeneLab_Data_Processing
## About
-The [NASA GeneLab](https://genelab.nasa.gov/) Data Processing team and [Analysis Working Group](https://osdr.nasa.gov/bio/awg/about.html) members have created standard pipelines for processing omics data from spaceflight and space-relevant experiments. This repository contains the processing pipelines that have been standardized to date for the assay types indicated below. Each subdirectory in this repository holds current and previous pipeline versions for the respective assay type, including detailed descriptions and processing instructions as well as the exact processing commands used to generate processed data for datasets hosted in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).
+The [NASA GeneLab](https://www.nasa.gov/osdr-genelab-about/) Data Processing team and [Analysis Working Group](https://www.nasa.gov/osdr-open-science-analysis-working-groups/) members have created standard pipelines for processing omics data from spaceflight and space-relevant experiments. This repository contains the processing pipelines that have been standardized to date for the assay types indicated below. Each subdirectory in this repository holds current and previous pipeline versions for the respective assay type, including detailed descriptions and processing instructions as well as the exact processing commands used to generate processed data for datasets hosted in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).
---
@@ -29,21 +29,28 @@ Click on an assay type below for data processing information.
---
## Usage
-We encourage all investigators working with space-relevant omics data to process their data using the standard pipelines described here when possible. Anyone planning to publish analyses derived from [GeneLab processed data](https://genelab-data.ndc.nasa.gov/genelab/projects) may refer to this repository for data processing methods. If you have omics data from a spaceflight or space-relevant experiment, you can submit your data to GeneLab through our [submission portal](https://genelab-data.ndc.nasa.gov/geode-sso-login/).
+We encourage all investigators working with space-relevant omics data to process their data using the standard pipelines described here when possible. Anyone planning to publish analyses derived from [GeneLab processed data](https://genelab-data.ndc.nasa.gov/genelab/projects) may refer to this repository for data processing methods. If you have omics data from a spaceflight or space-relevant experiment, you can submit your data to GeneLab through our [submission portal](https://www.nasa.gov/osdr-submission-portal/).
---
## Licenses
The software for each pipeline is released under the NASA Open Source Agreement (NOSA) Version 1.3
-- [Amplicon License](Licenses/Amplicon_and_Metagenomics_NOSA_License.pdf)
-- [Metagenomics License](Licenses/Amplicon_and_Metagenomics_NOSA_License.pdf)
+- [Amplicon License](https://github.com/nasa/GeneLab_AmpliconSeq_Workflow/blob/main/License/Amplicon_NOSA_License.pdf)
+- [Metagenomics License](./Licenses/Metagenomics_NOSA_License.pdf)
+- [Methyl-Seq License](./Licenses/Methylation_Sequencing_NOSA_License.pdf)
+- [RNAseq License](./Licenses/RNA_Sequencing_NOSA_License.pdf)
+- [Microarray License](./Licenses/Microarray_GPL-3.0_with_Additional_Requirements_License.pdf)
### 3rd Party Software Licenses
Licenses for the 3rd party open source software utilized for each pipeline can be found by clicking the respective pipeline link below:
-- [Amplicon License](3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software.md)
-- [Metagenomics License](3rd_Party_Licenses/Amplicon_and_Metagenomics_3rd_Party_Software.md)
+- [Amplicon Licenses](https://github.com/nasa/GeneLab_AmpliconSeq_Workflow/blob/main/License/3rd_Party_Licenses/README.md)
+- [Metagenomics Licenses](./3rd_Party_Licenses/Metagenomics_3rd_Party_Software.md)
+- [Methyl-Seq Licenses](./3rd_Party_Licenses/Methyl-Seq_3rd_Party_Software.md)
+- [RNAseq Licenses](./3rd_Party_Licenses/RNAseq_3rd_Party_Software.md)
+- [Microarray - Agilent 1-channel Licenses](./3rd_Party_Licenses/Microarray_Agilent_1_Channel_3rd_Party_Software.md)
+- [Microarray - Affymetrix Licenses](./3rd_Party_Licenses/Microarray_Affymetrix_3rd_Party_Software.md)
---