Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions 3rd_Party_Licenses/Metagenomics_3rd_Party_Software.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,4 @@
|Snakemake|[The MIT License (MIT) Copyright (c) 2012-2019 Johannes Köster](Metagenomics_3rd_Party_Software_Licenses/License_Snakemake_6.7.0_documentation.pdf)|[https://snakemake.readthedocs.io/en/stable/project_info/license.html](https://snakemake.readthedocs.io/en/stable/project_info/license.html)|Copyright (c) 2012-2019 Johannes Köster <[email protected]> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:|
|genelab-utils|[GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007](Metagenomics_3rd_Party_Software_Licenses/genelab-utils_LICENSE.pdf)|[https://github.com/AstrobioMike/GeneLab-utils/blob/main/LICENSE](https://github.com/AstrobioMike/GeneLab-utils/blob/main/LICENSE)|Copyright (C) 2007 Free Software Foundation, Inc <http://fsf.org/> Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
|R|[GNU GENERAL PUBLIC LICENSE Version 2, June 1991, and Version 3, 29 June 2007](Metagenomics_3rd_Party_Software_Licenses/R_GPL-2_and_GPL-3_LICENSES.pdf)|[https://www.r-project.org/Licenses/](https://www.r-project.org/Licenses/)|Version 2: Copyright (C) 1989, 1991 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.; Version 3: Copyright (C) 2007 Free Software Foundation, Inc. http://fsf.org/ Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.|
|Nextflow|[Apache License Version 2.0, January 2004](Metagenomics_3rd_Party_Software_Licenses/Nextflow_LICENSE.pdf)|[https://github.com/nextflow-io/nextflow/blob/master/COPYING](https://github.com/nextflow-io/nextflow/blob/master/COPYING)| Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.|
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# GeneLab tracking of host reads in metagenomic datasets

> **In order to provide an estimate of host DNA in metagenomic datasets that are sequenced from host-derived samples, datasets are screened against an appropriate reference genome using [kraken2](https://github.com/DerrickWood/kraken2/wiki). Reads are not removed from the dataset, but the percentage of detected host reads is reported.**

---

**Date:** November X, 2025
**Revision:** A
**Document Number:** GL-DPPD-7109

**Submitted by:**
Jihan Yehia (GeneLab Data Processing Team)

**Approved by:**
Samrawit Gebre (OSDR Project Manager)
Danielle Lopez (OSDR Deputy Project Manager)
Jonathan Galazka (OSDR Project Scientist)
Amanda Saravia-Butler (GeneLab Science Lead)
Barbara Novak (GeneLab Data Processing Lead)

## Updates from previous revision
* Updated kraken2 from version 2.1.1 to 2.1.6
* In [Step 1](#1-build-kraken2-database), used kraken2's `k2` wrapper script for `download-taxonomy` because the script supports HTTPS download as mentioned [here](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#introducing-k2)

---

# Table of contents

- [**Software used**](#software-used)
- [**General processing overview with example commands**](#general-processing-overview-with-example-commands)
- [**1. Build kraken2 database**](#1-build-kraken2-database-of-host-genome)
- [**2. Identify host-classified reads**](#2-identify-host-classified-reads)
- [Example if paired-end reads](#example-if-paired-end-reads)
- [Example if single-end reads](#example-if-single-end-reads)
- [**3. Generate a summary report**](#3-generate-a-summary-report)

---

# Software used

|Program|Version|Relevant Links|
|:------|:-----:|------:|
|kraken2|2.1.6|[https://github.com/DerrickWood/kraken2/wiki](https://github.com/DerrickWood/kraken2/wiki)|

---

# General processing overview with example commands

## 1. Build kraken2 database of host genome
This depends on the appropriate host genome. This example is done with the mouse genome ([GRCm39 | GCF_000001635.27](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001635.27)).
> **Note:** It is recommended to use NCBI with kraken2 because sequences not downloaded from NCBI may require explicit assignment of taxonomy information before they can be used to build the database, as mentioned [here](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown).

```bash
# downloading and decompressing reference genome
wget -q https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_genomic.fna.gz
gunzip GCF_000001635.27_GRCm39_genomic.fna.gz


# building kraken2 database
k2 download-taxonomy --db kraken2-mouse-db/
kraken2-build --add-to-library GCF_000001635.27_GRCm39_genomic.fna --no-masking --db kraken2-mouse-db/
kraken2-build --build --db kraken2-mouse-db/ --threads 30 --no-masking
kraken2-build --clean --db kraken2-mouse-db/
```

**Parameter Definitions:**

* `download-taxonomy` - downloads taxonomic mapping information via [k2 wrapper script](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#introducing-k2)
* `--add-to-library` - adds the fasta file to the library of sequences being included
* `--db` - specifies the directory we are putting the database in
* `--threads` - specifies the number of threads to use
* `--no-masking` - prevents [masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences) of low-complexity sequences
* `--build` - specifies to construct kraken2-formatted database
* `--clean` - specifies to remove unnecessarily intermediate files

**Input data:**

* None

**Output data:**

* kraken2 database files (hash.k2d, opts.k2d, and taxo.k2d)
* reference genome used (*.fna)

---

## 2. Identify host-classified reads

### Example if paired-end reads

```bash
kraken2 --db kraken2-mouse-db --gzip-compressed --threads 4 --use-names --paired \
--output sample-1-kraken2-output.txt --report sample-1-kraken2-report.tsv Sample-1_R1.fastq.gz Sample-1_R2.fastq.gz
```

**Parameter Definitions:**

* `--db` - specifies the directory holding the kraken2 database files created in step 1
* `--gzip-compressed` - specifies the input fastq files are gzip-compressed
* `--threads` - specifies the number of threads to use
* `--use-names` - specifies adding taxa names in addition to taxids
* `--paired` - specifies input reads are paired-end
* `--output` - specifies the name of the kraken2 read-based output file (one line per read)
* `--report` - specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it)
* last two positional arguments are the input read files

**Input data:**

* Sample-1_R1.fastq.gz (gzipped forward-reads fastq file)
* Sample-1_R2.fastq.gz (gzipped reverse-reads fastq file)

**Output data:**

* sample-1-kraken2-output.txt (kraken2 read-based output file (one line per read))
* sample-1-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))

### Example if single-end reads

```bash
kraken2 --db kraken2-mouse-db --gzip-compressed --threads 4 --use-names \
--output sample-1-kraken2-output.txt --report sample-1-kraken2-report.tsv Sample-1.fastq.gz
```

**Parameter Definitions:**

* `--db` - specifies the directory holding the kraken2 database files created in step 1
* `--gzip-compressed` - specifies the input fastq files are gzip-compressed
* `--threads` - specifies the number of threads to use
* `--use-names` - specifies adding taxa names in addition to taxids
* `--output` - specifies the name of the kraken2 read-based output file (one line per read)
* `--report` - specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it)
* last positional argument is the input read file

**Input data:**

* Sample-1.fastq.gz (gzipped reads fastq file)

**Output data:**

* sample-1-kraken2-output.txt (kraken2 read-based output file (one line per read))
* sample-1-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))

---

## 3. Generate a summary report
Utilizes a Unix-like command-line.

```bash
total_fragments=$(wc -l sample-1-kraken2-output.txt | sed 's/^ *//' | cut -f 1 -d " ")

fragments_classified=$(grep -w -c "^C" sample-1-kraken2-output.txt)

perc_host=$(printf "%.2f\n" $(echo "scale=4; ${fragments_classified} / ${total_fragments} * 100" | bc -l))

cat <( printf "Sample_ID\tTotal_fragments\tTotal_host_fragments\tPercent_host\n" ) \
<( printf "Sample-1\t${total_fragments}\t${fragments_classified}\t${perc_host}\n" ) > Host-read-count-summary.tsv
```

**Input data:**

* sample-1-kraken2-output.txt (kraken2 read-based output file (one line per read))
* sample-1-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))

**Output data:**

* Host-read-count-summary.tsv (a tab-separated file with 4 columns: "Sample\_ID", "Total\_fragments", "Total\_host\_fragments", "Percent\_host")
*Note: The percent host reads estimated for each sample is provided in the assay table on the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).*

---
9 changes: 5 additions & 4 deletions Metagenomics/Estimate_host_reads_in_raw_data/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# GeneLab pipeline for estimating host reads in Illumina metagenomics sequencing data
# GeneLab pipeline for estimating host reads in metagenomics sequencing data

> **The document [`GL-DPPD-7109.md`](Pipeline_GL-DPPD-7109_Versions/GL-DPPD-7109.md) holds an overview and example commands for how GeneLab identifies and provides an estimate of host DNA in Illumina metagenomics sequencing datasets that are sequenced from host-derived samples. See the [Repository Links](#repository-links) descriptions below for more information. The percentage of detected host reads and a GeneLab host read estimation summary are provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**
> **The document [`GL-DPPD-7109-A.md`](Pipeline_GL-DPPD-7109_Versions/GL-DPPD-7109-A.md) holds an overview and example commands for how GeneLab identifies and provides an estimate of host DNA in metagenomics sequencing datasets that are sequenced from host-derived samples. See the [Repository Links](#repository-links) descriptions below for more information. The percentage of detected host reads and a GeneLab host read estimation summary are provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**
>
> Note: The exact host read identification commands and MGEstHostReads version used for specific GLDS datasets can be found in the *_processing_info.tar file under "Study Files" for each respective GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).

Expand All @@ -10,7 +10,7 @@

* [**Pipeline_GL-DPPD-7109_Versions**](Pipeline_GL-DPPD-7109_Versions)

- Contains the current and previous GeneLab pipeline for identifying host reads in Illumina metagenomics sequencing data (MGEstHostReads) versions documentation
- Contains the current and previous GeneLab pipeline for identifying host reads in metagenomics sequencing data (MGEstHostReads) versions documentation

* [**Workflow_Documentation**](Workflow_Documentation)

Expand All @@ -19,4 +19,5 @@
---

**Developed and maintained by:**
Michael D. Lee ([email protected])
Michael D. Lee ([email protected])
Jihan Yehia ([email protected])
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Workflow change log

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).


## [1.0.0](https://github.com/nasa/GeneLab_Data_Processing/tree/master/Metagenomics/Estimate_host_reads_in_raw_data/Workflow_Documentation/NF_MGEstHostReads)

### Changed
- Update to the latest pipeline version [GL-DPPD-7109-A](../../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7109-A.md)
of the GeneLab Estimate-Host-Reads consensus processing pipeline.
- Pipeline implementation as a Nextflow workflow [NF_MGEstHostReads](./) rather than Snakemake as in
previous workflow versions.

### Added
- Pull dataset from OSDR option using dp_tools
- Build kraken2 database from scratch using host organism's information pulled from [hosts.csv](workflow_code/assets/hosts.csv)
- Create protocol.txt as an output file describing workflow methods

### Removed
- kraken2-mouse-db/ no longer needed as part of the workflow files (can now be explicitly set or built from scratch in case it doesn't exist)

<BR>

---

> ***Note:** Change log of the Snakemake workflow (SW_MGEstHostReads) that is associated with the previous version of the GeneLab Estimate-Host-Reads Pipeline [GL-DPPD-7109](../../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107.md) can be found [here](../SW_MGEstHostReads/CHANGELOG.md)*
Loading