diff --git a/Metagenomics/README.md b/Metagenomics/README.md index ebfd4a0d..e5f6d41f 100644 --- a/Metagenomics/README.md +++ b/Metagenomics/README.md @@ -5,7 +5,7 @@ ## Select a specific pipeline for more info: * [Estimating host reads](Estimate_host_reads_in_raw_data) -* [Removing human reads](Remove_human_reads_from_raw_data) +* [Removing host reads](Remove_host_reads) * [Illumina](Illumina)
diff --git a/Metagenomics/Remove_human_reads_from_raw_data/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md b/Metagenomics/Remove_host_reads/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md similarity index 100% rename from Metagenomics/Remove_human_reads_from_raw_data/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md rename to Metagenomics/Remove_host_reads/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md diff --git a/Metagenomics/Remove_host_reads/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-B.md b/Metagenomics/Remove_host_reads/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-B.md new file mode 100644 index 00000000..e58e8c05 --- /dev/null +++ b/Metagenomics/Remove_host_reads/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-B.md @@ -0,0 +1,186 @@ +# GeneLab removal of human reads from metagenomics datasets + +> **It is NASA's policy that any human reads are to be removed from metagenomics datasets prior to being hosted in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). As such, all metagenomics datasets are screened against a human reference-genome [kraken2](https://github.com/DerrickWood/kraken2/wiki) database. This document holds an overview and some example commands of how GeneLab does perform this.** + +--- + +**Date:** November X, 2025 +**Revision:** B +**Document Number:** GL-DPPD-7105-B + +**Submitted by:** +Jihan Yehia (GeneLab Data Processing Team) + +**Approved by:** +Samrawit Gebre (OSDR Project Manager) +Danielle Lopez (OSDR Deputy Project Manager) +Jonathan Galazka (OSDR Project Scientist) +Amanda Saravia-Butler (GeneLab Science Lead) +Barbara Novak (GeneLab Data Processing Lead) + +## Updates from previous revision +* Updated kraken2 from version 2.1.1 to 2.1.6 +* In [Step 1](#1-build-kraken2-database), used kraken2's `k2` wrapper script for `download-taxonomy` because the script supports HTTPS download as mentioned [here](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#introducing-k2) + +--- + +# Table of contents + +- [**Software used**](#software-used) +- [**General processing overview with example commands**](#general-processing-overview-with-example-commands) + - [**1. Build kraken2 database**](#1-build-kraken2-database) + - [**2. Filter out human-classified reads**](#2-filter-out-human-classified-reads) + - [Example if paired-end reads](#example-if-paired-end-reads) + - [Example if single-end reads](#example-if-single-end-reads) + - [**3. Generate a kraken2 summary report**](#3-generate-a-kraken2-summary-report) + +--- + +# Software used + +|Program|Version*|Relevant Links| +|:------|:-----:|------:| +|kraken2|`kraken2 -v`|[https://github.com/DerrickWood/kraken2/wiki](https://github.com/DerrickWood/kraken2/wiki)| + +> \* Exact versions utilized for a given dataset are available along with the processing commands for each specific dataset (this is due to how the system may need to be updated regularly). + +--- + +# General processing overview with example commands + +> Output files listed in **bold** below are included with each Metagenomics dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). + +## 1. Build kraken2 database +This depends on the appropriate host genome. This example is done with the human genome ([GRCh38.p13 | GCF_000001405.39](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39)). +> **Note:** It is recommended to use NCBI with kraken2 because sequences not downloaded from NCBI may require explicit assignment of taxonomy information before they can be used to build the database, as mentioned [here](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown). + +```bash +# downloading and decompressing reference genome +wget -q https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz +gunzip GCF_000001405.39_GRCh38.p13_genomic.fna.gz + + +# building kraken2 database +k2 download-taxonomy --db kraken2-human-db/ +kraken2-build --add-to-library GCF_000001405.39_GRCh38.p13_genomic.fna --no-masking --db kraken2-human-db/ +kraken2-build --build --db kraken2-human-db/ --threads 30 --no-masking +kraken2-build --clean --db kraken2-human-db/ +``` + +**Parameter Definitions:** + +* `download-taxonomy` - downloads taxonomic mapping information via [k2 wrapper script](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#introducing-k2) +* `--add-to-library` - adds the fasta file to the library of sequences being included +* `--db` - specifies the directory we are putting the database in +* `--threads` - specifies the number of threads to use +* `--no-masking` - prevents [masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences) of low-complexity sequences +* `--build` - specifies to construct kraken2-formatted database +* `--clean` - specifies to remove unnecessarily intermediate files + +**Input data:** + +* None + +**Output data:** + +* kraken2 database files (hash.k2d, opts.k2d, and taxo.k2d) + +--- + +## 2. Filter out human-classified reads + +### Example if paired-end reads + +```bash +kraken2 --db kraken2-human-db --gzip-compressed --threads 4 --use-names --paired \ + --output sample-1-kraken2-output.txt --report sample-1-kraken2-report.tsv \ + --unclassified-out sample-1_R#.fastq sample-1-R1.fq.gz sample-1-R2.fq.gz + +# renaming and gzipping output files +mv sample-1_R_1.fastq sample-1_R1_HRremoved_raw.fastq && gzip sample-1_R1_HRremoved_raw.fastq +mv sample-1_R_2.fastq sample-1_R2_HRremoved_raw.fastq && gzip sample-1_R2_HRremoved_raw.fastq +``` + +**Parameter Definitions:** + +* `--db` - specifies the directory holding the kraken2 database files created in step 1 +* `--gzip-compressed` - specifies the input fastq files are gzip-compressed +* `--threads` - specifies the number of threads to use +* `--use-names` - specifies adding taxa names in addition to taxids +* `--paired` - specifies input reads are paired-end +* `--output` - specifies the name of the kraken2 read-based output file (one line per read) +* `--report` - specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it) +* `--unclassified-out` - name of output files of reads that were not classified (the `#` symbol gets replaced with "_1" and "_2" in the output file names) +* last two positional arguments are the input read files + +**Input data:** + +* sample-1-R1.fq.gz (gzipped forward-reads fastq file) +* sample-1-R2.fq.gz (gzipped reverse-reads fastq file) + +**Output data:** + +* sample-1-kraken2-output.txt (kraken2 read-based output file (one line per read)) +* sample-1-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it)) +* **sample-1_R1_HRremoved_raw.fastq.gz** (host-read removed, gzipped forward-reads fastq file) +* **sample-1_R2_HRremoved_raw.fastq.gz** (host-read removed, gzipped reverse-reads fastq file) + +### Example if single-end reads + +```bash +kraken2 --db kraken2-human-db --gzip-compressed --threads 4 --use-names \ + --output sample-1-kraken2-output.txt --report sample-1-kraken2-report.tsv \ + --unclassified-out sample-1_HRremoved_raw.fastq sample-1.fq.gz + +# gzipping output file +gzip sample-1_HRremoved_raw.fastq +``` + +**Parameter Definitions:** + +* `--db` - specifies the directory holding the kraken2 database files created in step 1 +* `--gzip-compressed` - specifies the input fastq files are gzip-compressed +* `--threads` - specifies the number of threads to use +* `--use-names` - specifies adding taxa names in addition to taxids +* `--output` - specifies the name of the kraken2 read-based output file (one line per read) +* `--report` - specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it) +* `--unclassified-out` - name of output files of reads that were not classified +* last positional argument is the input read file + +**Input data:** + +* sample-1.fq.gz (gzipped reads fastq file) + +**Output data:** + +* sample-1-kraken2-output.txt (kraken2 read-based output file (one line per read)) +* sample-1-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it)) +* **sample-1_HRremoved_raw.fastq.gz** (host-read removed, gzipped reads fastq file) + +--- + +## 3. Generate a kraken2 summary report +Utilizes a Unix-like command-line. + +```bash +total_fragments=$(wc -l sample-1-kraken2-output.txt | sed 's/^ *//' | cut -f 1 -d " ") + +fragments_retained=$(grep -w -m 1 "unclassified" sample-1-kraken2-report.tsv | cut -f 2) + +perc_removed=$(printf "%.2f\n" $(echo "scale=4; 100 - ${fragments_retained} / ${total_fragments} * 100" | bc -l)) + +cat <( printf "Sample_ID\tTotal_fragments_before\tTotal_fragments_after\tPercent_host_reads_removed\n" ) \ + <( printf "Sample-1\t${total_fragments}\t${fragments_retained}\t${perc_removed}\n" ) > Human-read-removal-summary.tsv +``` + +**Input data:** + +* sample-1-kraken2-output.txt (kraken2 read-based output file (one line per read)) +* sample-1-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it)) + +**Output data:** + +* -read-removal-summary.tsv (a tab-separated file with 4 columns: "Sample_ID", "Total_fragments_before", "Total_fragments_after", "Percent_host_reads_removed") +* *Note: The percent human reads removed from each sample is provided in the assay table on the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).* + +--- diff --git a/Metagenomics/Remove_human_reads_from_raw_data/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105.md b/Metagenomics/Remove_host_reads/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105.md similarity index 100% rename from Metagenomics/Remove_human_reads_from_raw_data/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105.md rename to Metagenomics/Remove_host_reads/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105.md diff --git a/Metagenomics/Remove_host_reads/README.md b/Metagenomics/Remove_host_reads/README.md new file mode 100644 index 00000000..6ba12a62 --- /dev/null +++ b/Metagenomics/Remove_host_reads/README.md @@ -0,0 +1,24 @@ +# GeneLab pipeline for removing host reads in metagenomics sequencing data + +> **It is NASA's policy that any human reads are to be removed from metagenomics datasets prior to being hosted in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). As such, the document [`GL-DPPD-7105-B.md`](Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-B.md) holds an overview and example commands for how GeneLab identifies and removes human DNA in metagenomics sequencing datasets. See the [Repository Links](#repository-links) descriptions below for more information. The percentage of human reads removed and a GeneLab human read removal summary is provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).** +> +>**Because host-read removal is broadly needed for metagenomics processing, the original _MGRemoveHumanReads_ has been expanded to include removal of other host DNA and renamed _MGRemoveHostReads_. The current pipeline supports removal of human reads by default as well as reads from any other host organism relevant to the dataset.** +> +> Note: The exact human read identification and removal commands as well as pipeline version used for specific GLDS datasets can be found in the *_processing_info.zip file under "Files" for each respective GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). +--- + +## Repository Links + +* [**Pipeline_GL-DPPD-7105_Versions**](Pipeline_GL-DPPD-7105_Versions) + + - Contains the versions documentation of both the current GeneLab pipeline for identifying and removing host reads in metagenomics sequencing data (MGRemoveHostReads) and the previous GeneLab pipeline dedicated to identifying and removing human reads only (MGRemoveHumanReads) + +* [**Workflow_Documentation**](Workflow_Documentation) + + - Contains instructions for installing and running the current GeneLab MGRemoveHostReads workflow and the previous GeneLab MGRemoveHumanReads workflow + +--- + +**Developed and maintained by:** +Michael D. Lee (Mike.Lee@nasa.gov) +Jihan Yehia (jihan.yehia@nasa.gov) diff --git a/Metagenomics/Remove_host_reads/Workflow_Documentation/NF_MGRemoveHostReads/CHANGELOG.md b/Metagenomics/Remove_host_reads/Workflow_Documentation/NF_MGRemoveHostReads/CHANGELOG.md new file mode 100644 index 00000000..72788287 --- /dev/null +++ b/Metagenomics/Remove_host_reads/Workflow_Documentation/NF_MGRemoveHostReads/CHANGELOG.md @@ -0,0 +1,29 @@ +# Workflow change log + +All notable changes to this project will be documented in this file. + +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), +and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). + + +## [1.0.0](https://github.com/nasa/GeneLab_Data_Processing/tree/master/Metagenomics/Remove_host_reads/Workflow_Documentation/NF_MGRemoveHostReads) + +### Changed +- Expand to support removal of host reads beyond human samples, forming the basis of the current MGRemoveHostReads workflow +- Update to the latest pipeline version [GL-DPPD-7105-B](../../Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-B.md) +of the GeneLab Remove-Host-Reads consensus processing pipeline. +- Pipeline implementation as a Nextflow workflow [NF_MGRemoveHostReads](./) rather than Snakemake as in +previous workflow versions. + +### Added +- Build kraken2 database from scratch using host organism's information pulled from [hosts.csv](workflow_code/assets/hosts.csv) +- Create protocol.txt as an output file describing workflow methods + +### Removed +- kraken2-human-db/ no longer automatically downloaded to run with the workflow. It can now be explicitly set or built from scratch in case it doesn't exist. + +
+ +--- + +> ***Note:** Change log of the Snakemake workflow (SW_MGRemoveHumanReads) that is associated with the previous version of the GeneLab Remove-Host-Reads Pipeline [GL-DPPD-7105](../../Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md) can be found [here](../SW_MGRemoveHumanReads/CHANGELOG.md)* \ No newline at end of file diff --git a/Metagenomics/Remove_host_reads/Workflow_Documentation/NF_MGRemoveHostReads/README.md b/Metagenomics/Remove_host_reads/Workflow_Documentation/NF_MGRemoveHostReads/README.md new file mode 100644 index 00000000..e786dae7 --- /dev/null +++ b/Metagenomics/Remove_host_reads/Workflow_Documentation/NF_MGRemoveHostReads/README.md @@ -0,0 +1,202 @@ +# NF_MGRemoveHostReads Workflow Information and Usage Instructions + + +## General workflow info +The current GeneLab Host Identification and Removal pipeline for metagenomics sequencing (MGRemoveHostReads), [GL-DPPD-7105-B.md](../../Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-B.md), is implemented as a [Nextflow](https://www.nextflow.io/docs/stable/index.html) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) containers or [conda](https://docs.conda.io/en/latest/) environments to install/run all tools. This workflow (NF_MGRemoveHostReads) is run using the command line interface (CLI) of any unix-based system. While knowledge of creating or modifying Nextflow workflows is not required to run the workflow as-is, the [Nextflow documentation](https://www.nextflow.io/docs/stable/index.html) is a useful resource for users who wish to modify and/or extend the workflow. + +
+ +## Utilizing the Workflow + +1. [Install Nextflow, Singularity, and Conda](#1-install-nextflow-singularity-and-conda) + 1a. [Install Nextflow and Conda](#1a-install-nextflow-and-conda) + 1b. [Install Singularity](#1b-install-singularity) + +2. [Download the Workflow Files](#2-download-the-workflow-files) + +3. [Fetch Singularity Images](#3-fetch-singularity-images) + +4. [Run the Workflow](#4-run-the-workflow) + 4a. [Start with a sample ID list as input](#4a-start-with-a-sample-ID-list-as-input) + 4b. [Modify parameters and compute resources in the Nextflow config file](#4b-modify-parameters-and-compute-resources-in-the-nextflow-config-file) + +5. [Workflow Outputs](#5-workflow-outputs) + 5a. [Main outputs](#5a-main-outputs) + 5b. [Resource logs](#5b-resource-logs) + +
+ +--- + +### 1. Install Nextflow, Singularity, and Conda + +#### 1a. Install Nextflow and Conda + +Nextflow can be installed either through the [Anaconda bioconda channel](https://anaconda.org/bioconda/nextflow) or as documented on the [Nextflow documentation page](https://www.nextflow.io/docs/latest/getstarted.html). + +> Note: If you wish to install conda, we recommend installing a Miniforge version appropriate for your system, as documented on the [conda-forge website](https://conda-forge.org/download/), where you can find basic binaries for most systems. More detailed miniforge documentation is available in the [miniforge github repository](https://github.com/conda-forge/miniforge). +> +> Once conda is installed on your system, you can install the latest version of Nextflow by running the following commands: +> +> ```bash +> conda install -c bioconda nextflow +> nextflow self-update +> ``` +> You may also install [mamba](https://mamba.readthedocs.io/en/latest/index.html) first which is a faster implementation of conda and can be used as a drop-in replacement: +> ```bash +> conda install -c conda-forge mamba +> ``` + +
+ +#### 1b. Install Singularity + +Singularity is a container platform that allows usage of containerized software. This enables the GeneLab workflow to retrieve and use all software required for processing without the need to install the software directly on the user's system. + +We recommend installing Singularity on a system wide level as per the associated [documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html). + +> Note: Singularity is also available through the [Anaconda conda-forge channel](https://anaconda.org/conda-forge/singularity). + +> Note: Alternatively, Docker can be used in place of Singularity. To get started with Docker, see the [Docker CE installation documentation](https://docs.docker.com/engine/install/). + +
+ +--- + +### 2. Download the Workflow Files + +All files required for utilizing the NF_MGRemoveHostReads GeneLab workflow for removing host reads from metagenomics sequencing data are in the [workflow_code](workflow_code) directory. To get a copy of the latest *NF_MGRemoveHostReads* version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands: + +```bash +wget +unzip NF_MGRemoveHostReads_1.0.0.zip && cd NF_MGRemoveHostReads_1.0.0 +``` + +
+ +--- + +### 3. Fetch Singularity Images + +Although Nextflow can fetch Singularity images from a url, doing so may cause issues as detailed [here](https://github.com/nextflow-io/nextflow/issues/1210). + +To avoid such issues, the required Singularity images can be manually fetched as follows before running the workflow: + +```bash +mkdir -p singularity +cd singularity + +# Pull required containers +singularity pull kraken2_2.1.6.img docker://quay.io/biocontainers/kraken2:2.1.6--pl5321h077b44d_0 + +cd .. +``` + +Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as a Nextflow configuration environment variable to ensure Nextflow can locate the fetched images: + +```bash +export NXF_SINGULARITY_CACHEDIR=$(pwd)/singularity +``` + +
+ +--- + +### 4. Run the Workflow + +> ***Note:** All the commands in this step assume that the workflow will be run from within the `NF_MGRemoveHostReads_1.0.0` directory that was downloaded in [step 2](#2-download-the-workflow-files) above. They may also be run from a different location by providing the full path to the main.nf workflow file in the `NF_MGRemoveHostReads_1.0.0` directory.* + + +This workflow can be run by providing the path to a text file containing a single-column list of unique sample identifiers, an example of which is shown [here](workflow_code/unique-sample-IDs.txt), along with the path to input data (raw reads of samples). + +It also requires setting the root directory for where kraken2 reference databases are (or will be) stored. The workflow assumes databases follow the naming convention `kraken2--db`. If a database for a specified host is not found in the provided root directory, the workflow automatically builds one from scratch and saves it in the same directory using that name convention. + +In cases where the workflow is to build kraken2 database from scratch, it is important to ensure that the host organism's details are present in hosts.csv table [here](workflow_code/assets/hosts.csv). If not, they should be appended to the table in the following format: `name,species,refseq_ID,genome_build,FASTA_URL`. + +Alternatively, a pre-built database can be manually downloaded and unpacked into the root directory, provided it follows the same naming convention. An example of which is available in the [reference database info page](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/reference-database-info.md), which describes how the human database was generated for a previous version of this workflow and how to obtain it for reuse. + +> Note: Nextflow commands use both single hyphen arguments (e.g. -profile) that denote general Nextflow arguments and double hyphen arguments (e.g. --osd) that denote workflow specific parameters. Take care to use the proper number of hyphens for each argument. + +
+ +#### 4a. Start with a sample ID list as input + +```bash +nextflow run main.nf \ + -resume \ + -profile singularity \ + --ref_dbs_Dir \ + --sample_id_list unique_sample_ids.txt \ + --reads_dir +``` + +
+ + +**Required Parameters:** + +* `main.nf` - Instructs Nextflow to run the NF_MGRemoveHostReads workflow. If running in a directory other than `NF_MGRemoveHostReads_1.0.0`, replace with the full path to the NF_MGRemoveHostReads main.nf workflow file. +* `-resume` - Resumes workflow execution using previously cached results +* `-profile` – Specifies the configuration profile(s) to load (multiple options can be provided as a comma-separated list) + * Software environment profile options (choose one): + * `singularity` - instructs Nextflow to use Singularity container environments + * `docker` - instructs Nextflow to use Docker container environments + * `conda` - instructs Nextflow to use conda environments via the conda package manager + * `mamba` - instructs Nextflow to use conda environments via the mamba package manager + * Other option (can be combined with the software environment option above using a comma, e.g. `-profile slurm,singularity`): + * `slurm` - instructs Nextflow to use the [Slurm cluster management and job scheduling system](https://slurm.schedmd.com/overview.html) to schedule and run the jobs on a Slurm HPC cluster +* `--ref_dbs_Dir` - Specifies the path to the directory where kraken2 databases are or will be stored +* `--sample_id_list` - path to a single-column file with unique sample identifiers (type: string, default: null) + > *Note: An example of this file is provided in the [workflow_code](workflow_code) directory [here](workflow_code/unique-sample-IDs.txt).* +* `--reads_dir` - path to raw reads directory (type: string, default: null) + + +**Optional Parameters:** +> *Note: See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nextflow.io/docs/latest/cli.html#run) for more options and details on how to run Nextflow.* + + +* `--is_single` - whether data is single-end (type: boolean, default: false) +* `--single_suffix` - raw reads suffix that follows the unique part of sample names (type: string, default: "_raw.fastq.gz") +* `--R1_suffix` - raw forward reads suffix that follows the unique part of sample names (type: string, default: "_R1_raw.fastq.gz") +* `--R2_suffix` - raw reverse reads suffix that follows the unique part of sample names (type: string, default: "_R2_raw.fastq.gz") +* `--single_out_suffix` - host-removed reads suffix that follows the unique part of sample names (type: string, default: "_HRremoved_raw.fastq.gz") +* `--R1_out_suffix` - host-removed forward reads suffix that follows the unique part of sample names (type: string, default: "_R1_HRremoved_raw.fastq.gz") +* `--R2_out_suffix` - host-removed reverse reads suffix that follows the unique part of sample names (type: string, default: "_R2_HRremoved_raw.fastq.gz") +* `--host` - host organism, should match the entry provided under `name` column in [hosts.csv](workflow_code/assets/hosts.csv) (type: string, default: "human") + +
+ +#### 4b. Modify parameters and compute resources in the Nextflow config file + +Additionally, all parameters and workflow resources can be directly specified in the [nextflow.config](./workflow_code/nextflow.config) file. For detailed instructions on how to modify and set parameters in the config file, please see the [documentation here](https://www.nextflow.io/docs/latest/config.html). + +Once you've downloaded the workflow template, you can modify the parameters in the `params` scope and cpus/memory requirements in the `process` scope in your downloaded version of the [nextflow.config](workflow_code/nextflow.config) file as needed in order to match your dataset and system setup. Additionally, if necessary, you can modify each variable in the [nextflow.config](workflow_code/nextflow.config) file to be consistent with the study you want to process and the computer you're using for processing. + +
+ +--- + +### 5. Workflow Outputs + +#### 5a. Main Outputs + +The outputs from this pipeline are documented in the [GL-DPPD-7105-B](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Metagenomics/Remove_host_reads/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-B.md) processing protocol. + + +The workflow also outputs the following: + - processing_info/protocol.txt (a text file describing the methods used by the workflow) + +#### 5b. Resource Logs + +Standard Nextflow resource usage logs are also produced as follows: + +**Nextflow Resource Usage Logs** + - processing_info/execution_report_{timestamp}.html (an html report that includes metrics about the workflow execution including computational resources and exact workflow process commands) + - processing_info/execution_timeline_{timestamp}.html (an html timeline for all processes executed in the workflow) + - processing_info/execution_trace_{timestamp}.txt (an execution tracing file that contains information about each process executed in the workflow, including: submission time, start time, completion time, cpu and memory used, machine-readable output) + +> Further details about these logs can also found within [this Nextflow documentation page](https://www.nextflow.io/docs/latest/tracing.html#execution-report). + +
+ +--- \ No newline at end of file diff --git a/Metagenomics/Remove_host_reads/Workflow_Documentation/README.md b/Metagenomics/Remove_host_reads/Workflow_Documentation/README.md new file mode 100644 index 00000000..f511791f --- /dev/null +++ b/Metagenomics/Remove_host_reads/Workflow_Documentation/README.md @@ -0,0 +1,14 @@ +# GeneLab Workflow Information for Removing Host Reads in Metagenomics Seq Data + +> **GeneLab has wrapped each step of the removing human reads in metagenomics sequencing data pipeline (MGRemoveHumanReads), starting with pipeline version A, into a workflow. This workflow has since been expanded to support removal of host reads beyond human samples, forming the basis of the current MGRemoveHostReads workflow. The table below lists (and links to) the previous MGRemoveHumanReads version as well as the expanded MGRemoveHostReads versions and the corresponding workflow subdirectories, with the current implementation indicated. The workflow subdirectory contains information about the workflow along with instructions for installation and usage. Exact workflow run info and the MGRemoveHumanReads or MGRemoveHostReads version used to process specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).** + +## MGRemoveHumanReads Pipeline Version and Corresponding Workflow + +|Pipeline Version|Current Workflow Version (for respective pipeline version)|Nextflow Version| +|:---------------|:---------------------------------------------------------|:---------------| +|*[GL-DPPD-7105-B.md](../Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-B.md)|[NF_MGRemoveHostReads_1.0.0](NF_MGRemoveHostReads)|25.04.6| +|[GL-DPPD-7105-A.md](../Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md)|[SW_MGRemoveHumanReads_1.0.0](SW_MGRemoveHumanReads-A)|N/A (Snakemake v7.26.0)| + +*Current GeneLab Pipeline/Workflow Implementation + +> See the [workflow change log](NF_MGRemoveHostReads/CHANGELOG.md) to access previous workflow versions and view all changes associated with each version update. diff --git a/Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/CHANGELOG.md b/Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/CHANGELOG.md similarity index 100% rename from Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/CHANGELOG.md rename to Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/CHANGELOG.md diff --git a/Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/README.md b/Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/README.md similarity index 100% rename from Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/README.md rename to Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/README.md diff --git a/Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/reference-database-info.md b/Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/reference-database-info.md similarity index 100% rename from Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/reference-database-info.md rename to Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/reference-database-info.md diff --git a/Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/Snakefile b/Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/Snakefile similarity index 100% rename from Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/Snakefile rename to Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/Snakefile diff --git a/Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/config.yaml b/Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/config.yaml similarity index 100% rename from Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/config.yaml rename to Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/config.yaml diff --git a/Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/envs/kraken2.yaml b/Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/envs/kraken2.yaml similarity index 100% rename from Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/envs/kraken2.yaml rename to Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/envs/kraken2.yaml diff --git a/Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/example-reads/Sample-1_R1.fastq.gz b/Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/example-reads/Sample-1_R1.fastq.gz similarity index 100% rename from Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/example-reads/Sample-1_R1.fastq.gz rename to Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/example-reads/Sample-1_R1.fastq.gz diff --git a/Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/example-reads/Sample-1_R2.fastq.gz b/Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/example-reads/Sample-1_R2.fastq.gz similarity index 100% rename from Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/example-reads/Sample-1_R2.fastq.gz rename to Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/example-reads/Sample-1_R2.fastq.gz diff --git a/Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/unique-sample-IDs.txt b/Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/unique-sample-IDs.txt similarity index 100% rename from Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/unique-sample-IDs.txt rename to Metagenomics/Remove_host_reads/Workflow_Documentation/SW_MGRemoveHumanReads-A/workflow_code/unique-sample-IDs.txt diff --git a/Metagenomics/Remove_human_reads_from_raw_data/README.md b/Metagenomics/Remove_human_reads_from_raw_data/README.md deleted file mode 100644 index 211db8ed..00000000 --- a/Metagenomics/Remove_human_reads_from_raw_data/README.md +++ /dev/null @@ -1,22 +0,0 @@ -# GeneLab pipeline for removing human reads in Illumina metagenomics sequencing data - -> **It is NASA's policy that any human reads are to be removed from metagenomics datasets prior to being hosted in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). As such, the document [`GL-DPPD-7105-A.md`](Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md) holds an overview and example commands for how GeneLab identifies and removes human DNA in Illumina metagenomics sequencing datasets. See the [Repository Links](#repository-links) descriptions below for more information. The percentage of human reads removed and a GeneLab human read removal summary is provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).** -> -> Note: The exact human read identification and removal commands and MGRemoveHumanReads version used for specific GLDS datasets can be found in the *_processing_info.zip file under "Files" for each respective GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). - ---- - -## Repository Links - -* [**Pipeline_GL-DPPD-7105_Versions**](Pipeline_GL-DPPD-7105_Versions) - - - Contains the current and previous GeneLab pipeline for identifying and removing human reads in Illumina metagenomics sequencing data (MGRemoveHumanReads) versions documentation - -* [**Workflow_Documentation**](Workflow_Documentation) - - - Contains instructions for installing and running the GeneLab MGRemoveHumanReads workflow - ---- - -**Developed and maintained by:** -Michael D. Lee (Mike.Lee@nasa.gov) diff --git a/Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/README.md b/Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/README.md deleted file mode 100644 index 56d474d6..00000000 --- a/Metagenomics/Remove_human_reads_from_raw_data/Workflow_Documentation/README.md +++ /dev/null @@ -1,13 +0,0 @@ -# GeneLab Workflow Information for Removing Human Reads in Illumina Metagenomics Seq Data - -> **GeneLab has wrapped each step of the removing human reads in Illumina metagenomics sequencing data pipeline (MGRemoveHumanReads), starting with pipeline version A, into a workflow. The table below lists (and links to) each MGRemoveHumanReads version and the corresponding workflow subdirectory, the current MGRemoveHumanReads pipeline/workflow implementation is indicated. The workflow subdirectory contains information about the workflow along with instructions for installation and usage. Exact workflow run info and MGRemoveHumanReads version used to process specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).** - -## MGRemoveHumanReads Pipeline Version and Corresponding Workflow - -|Pipeline Version|Current Workflow Version (for respective pipeline version)| -|:---------------|:---------------------------------------------------------| -|*[GL-DPPD-7105-A.md](../Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md)|[1.0.0](SW_MGRemoveHumanReads-A)| - -*Current GeneLab Pipeline/Workflow Implementation - -> See the [workflow change log](SW_MGRemoveHumanReads-A/CHANGELOG.md) to access previous workflow versions and view all changes associated with each version update.