Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
95b29a7
first commit
olabiyi May 1, 2024
ca431fc
Added Documentation
olabiyi May 16, 2024
d4fcd12
fixed params.conda
olabiyi May 16, 2024
f239d0e
Fixed paths and executor issues
olabiyi May 16, 2024
2bb8049
Limited queue size
olabiyi May 17, 2024
ea51ab3
Enabled retrial of failed processes
olabiyi May 17, 2024
9f07478
Edited container engine variable
olabiyi May 20, 2024
deea523
Handled pyfastx fasta index error
olabiyi May 21, 2024
6e68d0e
Now accepts GLDS accessions
olabiyi Jun 24, 2024
e409a28
fix GTDBTK download
olabiyi Jun 24, 2024
72b0601
Fixed humann utilities mounting issue
olabiyi Jun 25, 2024
a4df49d
Edited software collation
olabiyi Jun 25, 2024
d2cfbdf
Updated GTDBTK
olabiyi Jun 27, 2024
5b5cfa3
Added README.md
olabiyi Jun 28, 2024
f299748
Fixed extra header issue
olabiyi Jun 28, 2024
d119d4c
Fixed GTDBTK taxonomy variable issue
olabiyi Jul 1, 2024
9b79149
Changed default database location
olabiyi Jul 1, 2024
6409cbf
Updated versions and README
olabiyi Aug 27, 2024
c750589
fixed README typo
olabiyi Aug 27, 2024
c6c929a
fixed README typo
olabiyi Aug 27, 2024
54effb8
fixed README typo
olabiyi Aug 27, 2024
255744d
Merge pull request #101 from olabiyi/DEV_Metagenomics_Illumina_NF_con…
asaravia-butler Aug 27, 2024
8e36cc8
Updating header and correcting tool version typos
asaravia-butler Aug 28, 2024
b9e234b
Adding updates from previous version
asaravia-butler Aug 28, 2024
42b723a
Update GL-DPPD-7107-A.md
bnovak32 Sep 5, 2024
88739e0
Merge pull request #116 from bnovak32/DEV_Metagenomics_Illumina_NF_co…
asaravia-butler Sep 6, 2024
9a232e6
Updating change description
asaravia-butler Sep 6, 2024
9dfa6fa
Added post-processing workflow
olabiyi Sep 10, 2024
0279deb
Typo fixes
asaravia-butler Sep 11, 2024
ba86dc5
Typo fixes
asaravia-butler Sep 11, 2024
28843a3
Update GL-DPPD-7107-A.md
bnovak32 Sep 11, 2024
5adb13f
Merge pull request #121 from bnovak32/patch-4
asaravia-butler Sep 11, 2024
34b63a7
Merge branch 'nasa:DEV_Metagenomics_Illumina_NF_conversion' into DEV_…
olabiyi Sep 24, 2024
ecb72e8
Merge pull request #117 from olabiyi/DEV_Metagenomics_Illumina_NF_con…
asaravia-butler Oct 22, 2024
0c85354
Adding notes.
asaravia-butler Oct 23, 2024
e5110cb
Typo fix
asaravia-butler Oct 23, 2024
c3ca4ed
Adding missing pipeline info
asaravia-butler Oct 23, 2024
d2f33dd
Version info updates
asaravia-butler Oct 23, 2024
11d1dbd
Formatting update
asaravia-butler Oct 23, 2024
cfdd9a7
renaming NF_MGIllumina to NF_MGIllumina-A
asaravia-butler Oct 23, 2024
e5c4fe6
Edited README and accession parameter
olabiyi Oct 23, 2024
da5d9db
minor README edit
olabiyi Oct 23, 2024
f34870c
Merge pull request #127 from olabiyi/DEV_Metagenomics_Illumina_NF_con…
asaravia-butler Oct 23, 2024
b03f9ad
renamed accession parameter
olabiyi Oct 23, 2024
1aeeb73
Updatign signature matrix
asaravia-butler Oct 24, 2024
59b1653
Updating nextflow version
asaravia-butler Oct 24, 2024
54c5bd8
Formatting updates
asaravia-butler Oct 24, 2024
0dcec9b
Formatting updates
asaravia-butler Oct 24, 2024
ca89a74
Merge branch 'nasa:DEV_Metagenomics_Illumina_NF_conversion' into DEV_…
olabiyi Oct 30, 2024
311800e
Fixed read mapping bug
olabiyi Nov 2, 2024
e91d7a5
Fixed typos and no assemblies produced bug
olabiyi Nov 4, 2024
703f469
Merge pull request #129 from olabiyi/DEV_Metagenomics_Illumina_NF_con…
asaravia-butler Nov 5, 2024
d81c2c0
Update GL-DPPD-7113.md
bnovak32 Apr 4, 2025
e6fb470
Update GL-DPPD-7113.md
bnovak32 Apr 17, 2025
7dfbb47
Merge pull request #145 from nasa/GL-DPPD-7113-patch-1
asaravia-butler Apr 18, 2025
620598f
Metagenomics Illumina Nextflow conversion (#134)
olabiyi May 1, 2025
2a16111
Updated documentation (#150)
bnovak32 May 3, 2025
f9dec7c
[DEV_Metagenomics_Illumina] Minor documentation updates (#151)
bnovak32 May 8, 2025
eee1c4c
revert permissions on SW_MGIllumina scripts
bnovak32 May 8, 2025
296943b
Merge pull request #152 from nasa/DEV_Metagenomics_Illumina_NF_conver…
bnovak32 May 8, 2025
f3da4eb
Merge pull request #154 from nasa/DEV_RNAseq_vG
bnovak32 Jun 9, 2025
3355181
RNASeq workflow 2.0.1 patch update
bnovak32 Jul 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,216 changes: 1,216 additions & 0 deletions Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-A.md

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions Metagenomics/Illumina/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@

# GeneLab bioinformatics processing pipeline for Illumina metagenomics sequencing data

> **The document [`GL-DPPD-7107.md`](Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107.md) holds an overview and example commands for how GeneLab processes Illumina metagenomics sequencing datasets. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and processing code are provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**
> **The document [`GL-DPPD-7107-A.md`](Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-A.md) holds an overview and example commands for how GeneLab processes Illumina metagenomics sequencing datasets. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and processing code are provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**
>
> Note: The exact processing commands and MGIllumina version used for specific GLDS datasets can be found in the *_processing_info.zip file under "Files" for each respective GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).

Expand All @@ -26,4 +26,4 @@
---

**Developed and maintained by:**
Michael D. Lee ([email protected])
Michael D. Lee ([email protected]) and Olabiyi A.Obayomi ([email protected])
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Workflow change log

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).


## [1.0.0](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_MGIllumina_1.0.0/Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina)

### Changed
- Update to the latest pipeline version [GL-DPPD-7101-A](../../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-A.md)
of the GeneLab Metagenomics consensus processing pipeline.
- Pipeline implementation as a Nextflow workflow [NF_MGIllumina](./) rather than Snakemake as in
previous workflow versions.
- Run checkm separately on each bin and combine results to improve performance

### Fixed
- Allow explicit specification of the humann3 database location ([#62](https://github.com/nasa/GeneLab_Data_Processing/issues/62))
- Package bin and MAGs fasta files into per sample zip archives ([#76](https://github.com/nasa/GeneLab_Data_Processing/issues/76))

<BR>

---

> ***Note:** All previous workflow changes were associated with the previous version of the GeneLab Metagenomics Pipeline
[GL-DPPD-7101](../../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107.md) and can be found in the
[change log of the Snakemake workflow (SW_MGIllumina)](../SW_MGIllumina/CHANGELOG.md).*
228 changes: 228 additions & 0 deletions Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
# Workflow Information and Usage Instructions

## General Workflow Info

### Implementation Tools

The current GeneLab Illumina metagenomics sequencing data processing pipeline (MGIllumina-A), [GL-DPPD-7107-A.md](../../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-A.md), is implemented as a [Nextflow](https://nextflow.io/) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) containers, [Docker](https://docs.docker.com/get-started/) containers, or [conda](https://docs.conda.io/en/latest/) environments to install/run all tools. This workflow is run using the command line interface (CLI) of any unix-based system. While knowledge of creating workflows in Nextflow is not required to run the workflow as is, [the Nextflow documentation](https://nextflow.io/docs/latest/index.html) is a useful resource for users who want to modify and/or extend this workflow.

> **Note on reference databases**
> Many reference databases are relied upon throughout this workflow. They will be installed and setup automatically the first time the workflow is run. All together, after installed and unpacked, they will take up about about 340 GB of storage, but they may also require up to 500GB during installation and initial un-packing, so be sure there is enough room on your system before running the workflow.

<br>

## Utilizing the Workflow

1. [Installing Nextflow, Singularity, and conda](#1-installing-nextflow-singularity-and-conda)
1a. [Install Nextflow and conda](#1a-install-nextflow-and-conda)
1b. [Install Singularity](#1b-install-singularity)
2. [Download the workflow files](#2-download-the-workflow-files)
3. [Fetch Singularity Images](#3-fetch-singularity-images)
4. [Run the workflow](#4-run-the-workflow)
4a. [Approach 1: Start with OSD or GLDS accession as input](#4a-approach-1-start-with-an-osd-or-glds-accession-as-input)
4b. [Approach 2: Start with a runsheet csv file as input](#4b-approach-2-start-with-a-runsheet-csv-file-as-input)
4c. [Modify parameters and compute resources in the Nextflow config file](#4c-modify-parameters-and-compute-resources-in-the-nextflow-config-file)
5. [Workflow outputs](#5-workflow-outputs)
5a. [Main outputs](#5a-main-outputs)
5b. [Resource logs](#5b-resource-logs)
6. [Post Processing](#6-post-processing)

<br>

---

### 1. Installing Nextflow, Singularity, and conda

#### 1a. Install Nextflow and conda

Nextflow can be installed either through [Anaconda](https://anaconda.org/bioconda/nextflow) or as documented on the [Nextflow documentation page](https://www.nextflow.io/docs/latest/getstarted.html).

> Note: If you want to install Anaconda, we recommend installing a Miniconda version appropriate for your system, as instructed by [Happy Belly Bioinformatics](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda).
>
> Once conda is installed on your system, you can install the latest version of Nextflow by running the following commands:
>
> ```bash
> conda install -c bioconda nextflow
> nextflow self-update
> ```
> You may also install [mamba](https://mamba.readthedocs.io/en/latest/index.html) first which is a faster implementation of conda and can be used as a drop-in replacement:
> ```bash
> conda install -c conda-forge mamba
> conda install -c bioconda nextflow
> nextflow self-update
> ```

<br>

#### 1b. Install Singularity

Singularity is a container platform that allows usage of containerized software. This enables the GeneLab workflow to retrieve and use all software required for processing without the need to install the software directly on the user's system.

We recommend installing Singularity on a system wide level as per the associated [documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html).

> Note: Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity).

> Note: Alternatively, Docker can be used in place of Singularity. To get started with Docker, see the [Docker CE installation documentation](https://docs.docker.com/engine/install/).

<br>

---

### 2. Download the workflow files

All files required for utilizing the NF_MGIllumina GeneLab workflow for processing metagenomics Illumina data are in the [workflow_code](workflow_code) directory. To get a copy of latest *NF_MGIllumina* version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands:

```bash
wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_MGIllumina_1.0.0/NF_MGIllumina_1.0.0.zip
unzip NF_MGIllumina_1.0.0.zip && cd NF_MGIllumina_1.0.0
```

<br>

---

### 3. Fetch Singularity Images

Although Nextflow can fetch Singularity images from a url, doing so may cause issues as detailed [here](https://github.com/nextflow-io/nextflow/issues/1210).

To avoid this issue, run the following command to fetch the Singularity images prior to running the NF_MGIllumina workflow:

> Note: This command should be run from within the `NF_MGIllumina_1.0.0` directory that was downloaded in [step 2](#2-download-the-workflow-files) above.

```bash
bash ./bin/prepull_singularity.sh nextflow.config
```

Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as a Nextflow configuration environment variable to ensure Nextflow can locate the fetched images:

```bash
export NXF_SINGULARITY_CACHEDIR=$(pwd)/singularity
```

<br>

---

### 4. Run the Workflow

> ***Note:** All the commands in this step must be run from within the `NF_MGIllumina_1.0.0` directory that was downloaded in [step 2](#2-download-the-workflow-files) above.*

For options and detailed help on how to run the workflow, run the following command:

```bash
nextflow run main.nf --help
```

> Note: Nextflow commands use both single hyphen arguments (e.g. -help) that denote general Nextflow
arguments and double hyphen arguments (e.g. --input_file) that denote workflow specific parameters.
Take care to use the proper number of hyphens for each argument.

<br>

#### 4a. Approach 1: Start with an OSD or GLDS accession as input

```bash
nextflow run main.nf -resume -profile singularity --accession OSD-574
```

<br>

#### 4b. Approach 2: Start with a runsheet csv file as input

```bash
nextflow run main.nf -resume -profile singularity --input_file PE_file.csv
```

<br>

**Required Parameters For All Approaches:**

* `-run main.nf` - Instructs Nextflow to run the NF_MGIllumina workflow

* `-resume` - Resumes workflow execution using previously cached results

* `-profile` – Specifies the configuration profile(s) to load (multiple options can be provided as a comma-separated list)
* Software environment profile options (choose one):
* `singularity` - instructs Nextflow to use Singularity container environments
* `docker` - instructs Nextflow to use Docker container environments
* `conda` - instructs Nextflow to use conda environments via the conda package manager. By default, Nextflow will create environments at runtime using the yaml files in the [workflow_code/envs](workflow_code/envs/) folder. You can change this behavior by using the `--conda_*` workflow parameters or by editing the [nextflow.config](workflow_code/nextflow.config) file to specify a centralized conda environments directory via the `conda.cacheDir` parameter
* `mamba` - instructs Nextflow to use conda environments via the mamba package manager.
* Other option (can be combined with the software environment option above):
* `slurm` - instructs Nextflow to use the [Slurm cluster management and job scheduling system](https://slurm.schedmd.com/overview.html) to schedule and run the jobs on a Slurm HPC cluster.

* `--accession` – A Genelab / OSD accession number e.g. OSD-574.
> *Required only if you would like to download and process data directly from OSDR*

* `--input_file` – A single-end or paired-end runsheet csv file containing assay metadata for each sample, including sample_id, forward, reverse, and/or paired. Please see the [runsheet documentation](./examples/runsheet) in this repository for examples on how to format this file.
> *Required only if `--accession` is not passed as an argument*

<br>

> See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nextflow.io/docs/latest/cli.html#run) for more options and details on how to run Nextflow.
> For additional information on editing the `nextflow.config` file, see [Step 4d](#4d-modify-parameters-and-cpu-resources-in-the-nextflow-config-file) below.


<br>

#### 4c. Modify parameters and compute resources in the Nextflow config file

Additionally, all parameters and workflow resources can be directly specified in the [nextflow.config](./workflow_code/nextflow.config) file. For detailed instructions on how to modify and set parameters in the config file, please see the [documentation here](https://www.nextflow.io/docs/latest/config.html).

Once you've downloaded the workflow template, you can modify the parameters in the `params` scope and cpus/memory requirements in the `process` scope in your downloaded version of the [nextflow.config](workflow_code/nextflow.config) file as needed in order to match your dataset and system setup. Additionally, if necessary, you can modify each variable in the [nextflow.config](workflow_code/nextflow.config) file to be consistent with the study you want to process and the computer you're using for processing.

<br>

---

### 5. Workflow outputs

#### 5a. Main outputs

> Note: The outputs from the GeneLab Illumina metagenomics sequencing data processing pipeline workflow are documented in the [GL-DPPD-7107-A.md](../../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-A.md) processing protocol.

#### 5b. Resource logs

Standard Nextflow resource usage logs are also produced as follows:

**Nextflow Resource Usage Logs**
- Output:
- Resource_Usage/execution_report_{timestamp}.html (an html report that includes metrics about the workflow execution including computational resources and exact workflow process commands)
- Resource_Usage/execution_timeline_{timestamp}.html (an html timeline for all processes executed in the workflow)
- Resource_Usage/execution_trace_{timestamp}.txt (an execution tracing file that contains information about each process executed in the workflow, including: submission time, start time, completion time, cpu and memory used, machine-readable output)

> Further details about these logs can also found within [this Nextflow documentation page](https://www.nextflow.io/docs/latest/tracing.html#execution-report).

<br>

---

### 6. Post Processing

The post-processing workflow generates a README file, a protocols file, an md5sums
table, and a file association table suitable for uploading to OSDR.

For options and detailed help on how to run the post-processing workflow, run the following command:

```bash
nextflow run post_processing.nf --help
```

To generate the post-processing files after running the main processing workflow successfully, modify and set the parameters in [post_processing.config](workflow_code/post_processing.config), then run the following command:

```bash
nextflow -C post_processing.config run post_processing.nf -resume -profile singularity
```

The outputs of the post-processing workflow are described below:

**Post processing workflow**
- Output:
- Post_processing/FastQC_Outputs/filtered_multiqc_GLmetagenomics_report.zip (Filtered sequence multiqc report with paths purged)
- Post_processing/FastQC_Outputs/raw_multiqc_GLmetagenomics_report.zip (Raw sequence multiqc report with paths purged)
- Post_processing/<GLDS_accession>_-associated-file-names.tsv (File association table for curation)
- Post_processing/<GLDS_accession>_metagenomics-validation.log (Automated verification and validation log file)
- Post_processing/processed_md5sum_GLmetagenomics.tsv (md5sums for the files to be released on OSDR)
- Post_processing/processing_info_GLmetagenomics.zip (Zip file containing all files used to run the workflow and required logs with paths purged)
- Post_processing/protocol.txt (File describing the methods used by the workflow)
- Post_processing/README_GLmetagenomics.txt (README file listing and describing the outputs of the workflow)

Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Runsheet File Specification

## Description

* The runsheet is a comma-separated file that contains the metadata required for processing
metagenomics sequence datasets through the GeneLab Illumina metagenomics sequencing data
processing pipeline (MGIllumina).


## Examples

1. Runsheet for an example [paired-end dataset](paired_end_dataset/PE_file.csv)
2. Runsheet for an example [single-end dataset](single_end_dataset/SE_file.csv)


## Required columns

| Column Name | Type | Description | Example |
|:------------|:-----|:------------|:--------|
| sample_id | string | Unique Sample Name, added as a prefix to sample-specific processed data output files. Should not include spaces or weird characters. | RR23_FCS_FLT_F1 |
| forward | string (local path) | Location of the raw reads file. For paired-end data, this specifies the forward reads fastq.gz file. | /my/data/sample1_R1_HRremoved_raw.fastq.gz |
| reverse | string (local path) | Location of the raw reads file. For paired-end data, this specifies the reverse reads fastq.gz file. For single-end data, this column should be omitted. | /my/data/sample1_R2_HRremoved_raw.fastq.gz |
| paired | bool | Set to True if the samples were sequenced as paired-end. If set to False, samples are assumed to be single-end. | False |
Loading
Loading