Skip to content

Commit da7b999

Browse files
Merge pull request #1388 from egreenberg7/dev
New module: Kraken2/Bracken on Unaligned Sequences for Contamination Detection
2 parents 0b4125d + 02f65ab commit da7b999

34 files changed

+1430
-201
lines changed

CHANGELOG.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
### Enhancements & fixes
99

10+
- [PR #1388](https://github.com/nf-core/rnaseq/pull/1351) - Adding Kraken2/Bracken on unaligned reads as an additional quality control step to detect sample contamination
1011
- [PR #1186](https://github.com/nf-core/rnaseq/pull/1186) - Bump pipeline version to 3.16.0dev
1112

13+
### Parameters
14+
15+
| Old parameter | New parameter |
16+
| ------------- | --------------------------- |
17+
| | `--contaminant_screening` |
18+
| | `--kraken_db` |
19+
| | `--save_kraken_assignments` |
20+
| | `--save_kraken_unassigned` |
21+
| | `--bracken_precision` |
22+
23+
> **NB:** Parameter has been **updated** if both old and new parameter information is present.
24+
> **NB:** Parameter has been **added** if just the new parameter information is present.
25+
> **NB:** Parameter has been **removed** if new parameter information isn't present.
26+
27+
### Software dependencies
28+
29+
| Dependency | Old version | New version |
30+
| ---------- | ----------- | ----------- |
31+
| `Kraken2` | ----------- | 2.1.3 |
32+
| `Bracken` | ----------- | 2.9 |
33+
34+
> **NB:** Dependency has been **updated** if both old and new version information is present.
35+
>
36+
> **NB:** Dependency has been **added** if just the new version information is present.
37+
>
38+
> **NB:** Dependency has been **removed** if new version information isn't present.
39+
1240
## [[3.15.1](https://github.com/nf-core/rnaseq/releases/tag/3.15.1)] - 2024-09-16
1341

1442
### Enhancements & fixes

CITATIONS.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,10 @@
1616

1717
> Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824.
1818
19+
- [Bracken](https://doi.org/10.7717/peerj-cs.104)
20+
21+
> Lu, J., Breitwieser, F. P., Thielen, P., & Salzberg, S. L. (2017). Bracken: estimating species abundance in metagenomics data. PeerJ. Computer Science, 3(e104), e104. https://doi.org/10.7717/peerj-cs.104
22+
1923
- [fastp](https://www.ncbi.nlm.nih.gov/pubmed/30423086/)
2024

2125
> Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PubMed PMID: 30423086; PubMed Central PMCID: PMC6129281.
@@ -38,6 +42,10 @@
3842

3943
> Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019 Aug;37(8):907-915. doi: 10.1038/s41587-019-0201-4. Epub 2019 Aug 2. PubMed PMID: 31375807.
4044
45+
- [Kraken2](https://doi.org/10.1186/s13059-019-1891-0)
46+
47+
> Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1), 257. https://doi.org/10.1186/s13059-019-1891-0
48+
4149
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
4250

4351
> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@
4646
3. [`dupRadar`](https://bioconductor.org/packages/release/bioc/html/dupRadar.html)
4747
4. [`Preseq`](http://smithlabresearch.org/software/preseq/)
4848
5. [`DESeq2`](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)
49+
6. [`Kraken2`](https://ccb.jhu.edu/software/kraken2/) -> [`Bracken`](https://ccb.jhu.edu/software/bracken/) on unaligned sequences; _optional_
4950
15. Pseudoalignment and quantification ([`Salmon`](https://combine-lab.github.io/salmon/) or ['Kallisto'](https://pachterlab.github.io/kallisto/); _optional_)
5051
16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))
5152

docs/images/bracken-top-n-plot.png

54.7 KB
Loading
10.1 KB
Loading

docs/images/nf-core-rnaseq_metro_map_grey.svg

Lines changed: 235 additions & 179 deletions
Loading

docs/output.md

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
4040
- [Preseq](#preseq) - Estimation of library complexity
4141
- [featureCounts](#featurecounts) - Read counting relative to gene biotype
4242
- [DESeq2](#deseq2) - PCA plot and sample pairwise distance heatmap and dendrogram
43+
- [Kraken2/Bracken](#kraken2bracken) - Taxonomic classification of unaligned reads
4344
- [MultiQC](#multiqc) - Present QC for raw reads, alignment, read counting and sample similiarity
4445
- [Pseudoalignment and quantification](#pseudoalignment-and-quantification)
4546
- [Salmon](#pseudoalignment) - Wicked fast gene and isoform quantification relative to the transcriptome
@@ -656,6 +657,25 @@ The plot on the left hand side shows the standard PC plot - notice the variable
656657

657658
<p align="center"><img src="images/mqc_deseq2_clustering.png" alt="MultiQC - DESeq2 sample similarity plot" width="600"></p>
658659

660+
### Kraken2/Bracken
661+
662+
<details markdown="1">
663+
<summary>Output files</summary>
664+
665+
- `<ALIGNER>/contaminants/kraken2/kraken_reports`
666+
- `*.kraken2.report.txt`: Classification of unaligned reads in the Kraken report format. See the [kraken2 manual](https://github.com/DerrickWood/kraken2/wiki/Manual#output-formats) for more details
667+
- `*.classified*.fastq.gz` If `--save_kraken_alignments`, outputs fastq file for each sample with each classified read annotated with taxonomic identification from Kraken2.
668+
- `*.unclassified*.fastq.gz` If `save_kraken_unassigned`, outputs fastq file with all reads that were not classified by Kraken2.
669+
- `<ALIGNER>/contaminants/bracken/`
670+
- `*.kraken2.report_bracken.txt`: Kraken-style reports of the Bracken abundance estimate results. See the [kraken2 manual](https://github.com/DerrickWood/kraken2/wiki/Manual#output-formats) for more details.
671+
- `*.tsv` Summary of estimated reads for each taxon member at the given classification level and what corrections were made from Kraken2.
672+
673+
</details>
674+
675+
[Kraken2](https://ccb.jhu.edu/software/kraken2/) is a taxonomic classification tool that uses k-mer matches paired with a lowest common ancestory (LCA) algorithm to classify species reads. [Bracken](https://ccb.jhu.edu/software/bracken/) is a statistical method to generate abundance estimates based off of the Kraken2 output. These algorithms are run on unaligned sequences to detect potential contamination of samples. MultiQC reports the top 5 taxon members detected at the level of classification used for Bracken, with toggles available for higher taxonomic levels. If Bracken is skipped, MultiQC will report the top 5 species detected by Kraken2.
676+
677+
![MultiQC - Bracken top species plot](images/bracken-top-n-plot.png)
678+
659679
### MultiQC
660680

661681
<details markdown="1">
@@ -675,7 +695,7 @@ Results generated by MultiQC collate pipeline QC from supported tools i.e. FastQ
675695

676696
### Pseudoalignment
677697

678-
The principal output files are the same between Salmon and Kallsto:
698+
The principal output files are the same between Salmon and Kallisto:
679699

680700
<details markdown="1">
681701
<summary>Output files</summary>

docs/usage.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -296,6 +296,14 @@ Notes:
296296

297297
By default, the input GTF file will be filtered to ensure that sequence names correspond to those in the genome fasta file, and to remove rows with empty transcript identifiers. Filtering can be bypassed completely where you are confident it is not necessary, using the `--skip_gtf_filter` parameter. If you just want to skip the 'transcript_id' checking component of the GTF filtering script used in the pipeline this can be disabled specifically using the `--skip_gtf_transcript_filter` parameter.
298298

299+
## Contamination screening options
300+
301+
The pipeline provides the option to scan unaligned reads for contamination from other species using [Kraken2](https://ccb.jhu.edu/software/kraken2/), with the possibility of applying corrections from [Bracken](https://ccb.jhu.edu/software/bracken/). Since running Bracken is not computationally expensive, we recommend always using it to refine the abundance estimates generated by Kraken2.
302+
303+
It is important to note that the accuracy of Kraken2 is [highly dependent on the database](https://doi.org/10.1099/mgen.0.000949) used. Specifically, it is [crucial](https://doi.org/10.1128/mbio.01607-23) to ensure that the host genome is included in the database. If you are particularly concerned about certain contaminants, it may be beneficial to use a smaller, more focused database containing primarily those contaminants instead of the full standard database. Various pre-built databases [are available for download](https://benlangmead.github.io/aws-indexes/k2), and instructions for building a custom database can be found in the [Kraken2 documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown). Additionally, genomes of contaminants detected in previous sequencing experiments are available on the [OpenContami website](https://openlooper.hgc.jp/opencontami/help/help_oct.php).
304+
305+
While Kraken2 is capable of detecting low-abundance contaminants in a sample, false positives can occur. Therefore, if only a very small number of reads from a contaminating species are detected, these results should be interpreted with caution.
306+
299307
## Running the pipeline
300308

301309
The typical command for running the pipeline is as follows:

modules.json

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,11 @@
1515
"git_sha": "06c8865e36741e05ad32ef70ab3fac127486af48",
1616
"installed_by": ["modules"]
1717
},
18+
"bracken/bracken": {
19+
"branch": "master",
20+
"git_sha": "c214fad97b328eb6d6233f779be9ba44814a9136",
21+
"installed_by": ["modules"]
22+
},
1823
"cat/fastq": {
1924
"branch": "master",
2025
"git_sha": "06c8865e36741e05ad32ef70ab3fac127486af48",
@@ -68,7 +73,8 @@
6873
"hisat2/align": {
6974
"branch": "master",
7075
"git_sha": "ad30f90cfc383dfaa505771d24f9e292c53157ab",
71-
"installed_by": ["fastq_align_hisat2"]
76+
"installed_by": ["fastq_align_hisat2"],
77+
"patch": "modules/nf-core/hisat2/align/hisat2-align.diff"
7278
},
7379
"hisat2/build": {
7480
"branch": "master",
@@ -90,6 +96,11 @@
9096
"git_sha": "06c8865e36741e05ad32ef70ab3fac127486af48",
9197
"installed_by": ["modules", "quantify_pseudo_alignment"]
9298
},
99+
"kraken2/kraken2": {
100+
"branch": "master",
101+
"git_sha": "a13d5d945742a60bbef6e5c177e81cda540f75dc",
102+
"installed_by": ["modules"]
103+
},
93104
"multiqc": {
94105
"branch": "master",
95106
"git_sha": "06c8865e36741e05ad32ef70ab3fac127486af48",

modules/nf-core/bracken/bracken/environment.yml

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)