Skip to content

Commit 8256cc6

Browse files
authored
Merge pull request #830 from drpatelh/updates
Fixes for v3.8 release
2 parents f0eaed4 + ac2fcae commit 8256cc6

File tree

17 files changed

+484
-21
lines changed

17 files changed

+484
-21
lines changed

.prettierignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,4 @@ results/
66
.DS_Store
77
testing/
88
testing*
9-
*.pyc
9+
*.pyc

CHANGELOG.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,27 @@
33
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
44
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
55

6-
## [Unpublished Version / DEV]
6+
## [[3.8](https://github.com/nf-core/rnaseq/releases/tag/3.8)] - 2022-05-25
7+
8+
### :warning: Major enhancements
9+
10+
Fixed quite a well hidden bug in the UMI processing mode of the pipeline when using `--with_umi --aligner star_salmon` as reported by [Lars Roed Ingerslev](https://github.com/lars-work-sund). Paired-end BAM files were not appropriately name sorted after `umi_tools dedup` which ultimately resulted in incorrect reading and quantification with Salmon. If you have used previous versions of the pipeline to analyse paired-end UMI data it will need to be reprocessed using this version of the pipeline. See [#828](https://github.com/nf-core/rnaseq/issues/828) for more context.
711

812
### Enhancements & fixes
913

14+
- [[#824](https://github.com/nf-core/rnaseq/issues/824)] - Add explicit docs for usage of featureCounts in the pipeline
15+
- [[#825](https://github.com/nf-core/rnaseq/issues/825)] - Pipeline fails due to trimming related removal of all reads from a sample
16+
- [[#827](https://github.com/nf-core/rnaseq/issues/827)] - Control generation of --output-stats when running umi-tools dedup
17+
- [[#828](https://github.com/nf-core/rnaseq/issues/828)] - Filter BAM output of UMI-tools dedup before passing to Salmon quant
1018
- Updated pipeline template to [nf-core/tools 2.4.1](https://github.com/nf-core/tools/releases/tag/2.4.1)
1119

1220
### Parameters
1321

22+
| Old parameter | New parameter |
23+
| ------------- | ------------------------ |
24+
| | `--min_trimmed_reads` |
25+
| | `--umitools_dedup_stats` |
26+
1427
## [[3.7](https://github.com/nf-core/rnaseq/releases/tag/3.7)] - 2022-05-03
1528

1629
### :warning: Major enhancements

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) Phil Ewels, Rickard Hammarén
3+
Copyright (c) Harshil Patel, Phil Ewels, Rickard Hammarén
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

assets/multiqc_config.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ run_modules:
2828

2929
# Order of modules
3030
top_modules:
31+
- "fail_trimmed_samples"
3132
- "fail_mapped_samples"
3233
- "fail_strand_check"
3334
- "star_rsem_deseq2_pca"
@@ -140,6 +141,15 @@ sp:
140141

141142
# See https://github.com/ewels/MultiQC_TestData/blob/master/data/custom_content/with_config/table_headerconfig/multiqc_config.yaml
142143
custom_data:
144+
fail_trimmed_samples:
145+
section_name: "WARNING: Fail Trimming Check"
146+
description: "List of samples that failed the minimum trimmed reads threshold specified via the '--min_trimmed_reads' parameter, and hence were ignored for the downstream processing steps."
147+
plot_type: "table"
148+
pconfig:
149+
id: "fail_trimmed_samples_table"
150+
table_title: "Samples failed trimming threshold"
151+
namespace: "Samples failed trimming threshold"
152+
format: "{:.0f}"
143153
fail_mapped_samples:
144154
section_name: "WARNING: Fail Alignment Check"
145155
description: "List of samples that failed the STAR minimum mapped reads threshold specified via the '--min_mapped_reads' parameter, and hence were ignored for the downstream processing steps."

bin/prepare-for-rsem.py

Lines changed: 270 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,270 @@
1+
#!/usr/bin/env python3
2+
3+
'''
4+
==============================================================================
5+
Credits
6+
==============================================================================
7+
8+
This script is a clone of the "prepare-for-rsem.py" script written by
9+
Ian Sudbury, Tom Smith and other contributors to the UMI-tools package:
10+
https://github.com/CGATOxford/UMI-tools
11+
12+
It has been included here to address problems encountered with
13+
Salmon quant and RSEM as discussed in the issue below:
14+
https://github.com/CGATOxford/UMI-tools/issues/465
15+
16+
When the "umi_tools prepare-for-rsem" command becomes available in an official
17+
UMI-tools release this script will be replaced and deprecated.
18+
19+
Commit:
20+
https://github.com/CGATOxford/UMI-tools/blob/bf8608d6a172c5ca0dcf33c126b4e23429177a72/umi_tools/prepare-for-rsem.py
21+
22+
==============================================================================
23+
prepare_for_rsem - make the output from dedup or group compatible with RSEM
24+
===============================================================================
25+
The SAM format specification states that the mnext and mpos fields should point
26+
to the primary alignment of a read's mate. However, not all aligners adhere to
27+
this standard. In addition, the RSEM software requires that the mate of a read1
28+
appears directly after it in its input BAM. This requires that there is exactly
29+
one read1 alignment for every read2 and vice versa.
30+
31+
In general (except in a few edge cases) UMI tools outputs only the read2 to that
32+
corresponds to the read specified in the mnext and mpos positions of a selected
33+
read1, and only outputs this read once, even if multiple read1s point to it.
34+
This makes UMI-tools outputs incompatible with RSEM. This script takes the output
35+
from dedup or groups and ensures that each read1 has exactly one read2 (and vice
36+
versa), that read2 always appears directly after read1,and that pairs point to
37+
each other (note this is technically not valid SAM format). Copy any specified
38+
tags from read1 to read2 if they are present (by default, UG and BX, the unique
39+
group and correct UMI tags added by _group_)
40+
41+
Input must to name sorted.
42+
43+
44+
https://raw.githubusercontent.com/CGATOxford/UMI-tools/master/LICENSE
45+
46+
'''
47+
48+
from umi_tools import Utilities as U
49+
from collections import defaultdict, Counter
50+
import pysam
51+
import sys
52+
53+
usage = '''
54+
prepare_for_rsem - make output from dedup or group compatible with RSEM
55+
56+
Usage: umi_tools prepare_for_rsem [OPTIONS] [--stdin=IN_BAM] [--stdout=OUT_BAM]
57+
58+
note: If --stdout is ommited, standard out is output. To
59+
generate a valid BAM file on standard out, please
60+
redirect log with --log=LOGFILE or --log2stderr '''
61+
62+
def chunk_bam(bamfile):
63+
'''Take in a iterator of pysam.AlignmentSegment entries and yeild
64+
lists of reads that all share the same name'''
65+
66+
last_query_name = None
67+
output_buffer = list()
68+
69+
for read in bamfile:
70+
71+
if last_query_name is not None and last_query_name != read.query_name:
72+
yield(output_buffer)
73+
output_buffer = list()
74+
75+
last_query_name = read.query_name
76+
output_buffer.append(read)
77+
78+
yield (output_buffer)
79+
80+
def copy_tags(tags, read1, read2):
81+
'''Given a list of tags, copies the values of these tags from read1
82+
to read2, if the tag is set'''
83+
84+
for tag in tags:
85+
86+
try:
87+
read1_tag = read1.get_tag(tag, with_value_type=True)
88+
read2.set_tag(tag, value=read1_tag[0], value_type=read1_tag[1])
89+
except KeyError:
90+
pass
91+
92+
return(read2)
93+
94+
def pick_mate(read, template_dict, mate_key):
95+
'''Find the mate of read in the template dict using key. It will retreieve
96+
all reads at that key, and then scan to pick the one that refers to _read_
97+
as it's mate. If there is no such read, it picks a first one it comes to'''
98+
99+
mate = None
100+
101+
# get a list of secondary reads at the correct alignment position
102+
potential_mates = template_dict[not read.is_read1][mate_key]
103+
104+
# search through one at a time to find a read that points to the current read
105+
# as its mate.
106+
for candidate_mate in potential_mates:
107+
if candidate_mate.next_reference_name == read.reference_name and \
108+
candidate_mate.next_reference_start == read.pos:
109+
mate = candidate_mate
110+
111+
# if no such read is found, then pick any old secondary alignment at that position
112+
# note: this happens when UMI-tools outputs the wrong read as something's pair.
113+
if mate is None and len(potential_mates) >0:
114+
mate = potential_mates[0]
115+
116+
return mate
117+
118+
def main(argv=None):
119+
120+
if argv is None:
121+
argv = sys.argv
122+
123+
# setup command line parser
124+
parser = U.OptionParser(version="%prog version: $Id$",
125+
usage=usage,
126+
description=globals()["__doc__"])
127+
group = U.OptionGroup(parser, "RSEM preparation specific options")
128+
129+
group.add_option("--tags", dest="tags", type="string",
130+
default="UG,BX",
131+
help="Comma-seperated list of tags to transfer from read1 to read2")
132+
group.add_option("--sam", dest="sam", action="store_true",
133+
default=False,
134+
help="input and output SAM rather than BAM")
135+
136+
parser.add_option_group(group)
137+
138+
# add common options (-h/--help, ...) and parse command line
139+
(options, args) = U.Start(parser, argv=argv,
140+
add_group_dedup_options=False,
141+
add_umi_grouping_options=False,
142+
add_sam_options=False)
143+
144+
skipped_stats = Counter()
145+
146+
if options.stdin != sys.stdin:
147+
in_name = options.stdin.name
148+
options.stdin.close()
149+
else:
150+
in_name = "-"
151+
152+
if options.sam:
153+
mode = ""
154+
else:
155+
mode = "b"
156+
157+
inbam = pysam.AlignmentFile(in_name, "r"+mode)
158+
159+
if options.stdout != sys.stdout:
160+
out_name = options.stdout.name
161+
options.stdout.close()
162+
else:
163+
out_name = "-"
164+
165+
outbam = pysam.AlignmentFile(out_name, "w" + mode, template = inbam)
166+
167+
options.tags = options.tags.split(",")
168+
169+
for template in chunk_bam(inbam):
170+
171+
assert len(set(r.query_name for r in template)) == 1
172+
current_template = {True: defaultdict(list),
173+
False: defaultdict(list)}
174+
175+
for read in template:
176+
key = (read.reference_name, read.pos, not read.is_secondary)
177+
current_template[read.is_read1][key].append(read)
178+
179+
output = set()
180+
181+
for read in template:
182+
183+
mate = None
184+
185+
# if this read is a non_primary alignment, we first want to check if it has a mate
186+
# with the non-primary alignment flag set.
187+
188+
mate_key_primary = ( True)
189+
mate_key_secondary = (read.next_reference_name, read.next_reference_start, False)
190+
191+
# First look for a read that has the same primary/secondary status
192+
# as read (i.e. secondary mate for secondary read, and primary mate
193+
# for primary read)
194+
mate_key = (read.next_reference_name, read.next_reference_start,
195+
read.is_secondary)
196+
mate = pick_mate(read, current_template, mate_key)
197+
198+
# If none was found then look for the opposite (primary mate of secondary
199+
# read or seconadary mate of primary read)
200+
if mate is None:
201+
mate_key = (read.next_reference_name, read.next_reference_start,
202+
not read.is_secondary)
203+
mate = pick_mate(read, current_template, mate_key)
204+
205+
# If we still don't have a mate, then their can't be one?
206+
if mate is None:
207+
skipped_stats["no_mate"] += 1
208+
U.warn("Alignment {} has no mate -- skipped".format(
209+
"\t".join(map(str, [read.query_name, read.flag, read.reference_name, int(read.pos)]))
210+
))
211+
continue
212+
213+
# because we might want to make changes to the read, but not have those changes reflected
214+
# if we need the read again,we copy the read. This is only way I can find to do this.
215+
read = pysam.AlignedSegment().from_dict(read.to_dict(), read.header)
216+
mate = pysam.AlignedSegment().from_dict(mate.to_dict(), read.header)
217+
218+
# Make it so that if our read is secondary, the mate is also secondary. We don't make the
219+
# mate primary if the read is primary because we would otherwise end up with mulitple
220+
# primary alignments.
221+
if read.is_secondary:
222+
mate.is_secondary = True
223+
224+
# In a situation where there is already one mate for each read, then we will come across
225+
# each pair twice - once when we scan read1 and once when we scan read2. Thus we need
226+
# to make sure we don't output something already output.
227+
if read.is_read1:
228+
229+
mate = copy_tags(options.tags, read, mate)
230+
output_key = str(read) + str(mate)
231+
232+
if output_key not in output:
233+
output.add(output_key)
234+
outbam.write(read)
235+
outbam.write(mate)
236+
skipped_stats["pairs_output"] += 1
237+
238+
elif read.is_read2:
239+
240+
read = copy_tags(options.tags, mate, read)
241+
output_key = str(mate) + str(read)
242+
243+
if output_key not in output:
244+
output.add(output_key)
245+
outbam.write(mate)
246+
outbam.write(read)
247+
skipped_stats["pairs_output"] += 1
248+
249+
else:
250+
skipped_stats["skipped_not_read12"] += 1
251+
U.warn("Alignment {} is neither read1 nor read2 -- skipped".format(
252+
"\t".join(map(str, [read.query_name, read.flag, read.reference_name, int(read.pos)]))
253+
))
254+
continue
255+
256+
if not out_name == "-":
257+
outbam.close()
258+
259+
U.info("Total pairs output: {}, Pairs skipped - no mates: {},"
260+
" Pairs skipped - not read1 or 2: {}".format(
261+
skipped_stats["pairs_output"],
262+
skipped_stats["no_mate"],
263+
skipped_stats["skipped_not_read12"]))
264+
U.Stop()
265+
266+
if __name__ == "__main__":
267+
sys.exit(main(sys.argv))
268+
269+
270+

conf/modules.config

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -204,6 +204,13 @@ if (!params.skip_trimming) {
204204
]
205205
]
206206
}
207+
208+
withName: 'MULTIQC_TSV_FAIL_TRIMMED' {
209+
publishDir = [
210+
path: { "${params.outdir}/multiqc" },
211+
enabled: false
212+
]
213+
}
207214
}
208215
}
209216

@@ -362,6 +369,7 @@ if (!params.skip_alignment) {
362369
if (params.with_umi && ['star_salmon','hisat2'].contains(params.aligner)) {
363370
process {
364371
withName: '.*:DEDUP_UMI_UMITOOLS_GENOME:UMITOOLS_DEDUP' {
372+
ext.args = { meta.single_end ? '' : '--unpaired-reads=discard --chimeric-pairs=discard' }
365373
ext.prefix = { "${meta.id}.umi_dedup.sorted" }
366374
publishDir = [
367375
[
@@ -560,11 +568,20 @@ if (!params.skip_alignment && params.aligner == 'star_salmon') {
560568
ext.args = '-n'
561569
ext.prefix = { "${meta.id}.umi_dedup.transcriptome" }
562570
publishDir = [
563-
path: { "${params.outdir}/samtools" },
571+
path: { "${params.outdir}/${params.aligner}" },
564572
enabled: false
565573
]
566574
}
567575

576+
withName: 'NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORRSEM' {
577+
ext.prefix = { "${meta.id}.umi_dedup.transcriptome.filtered" }
578+
publishDir = [
579+
path: { "${params.outdir}/${params.aligner}/umitools/log" },
580+
mode: params.publish_dir_mode,
581+
pattern: '*.log'
582+
]
583+
}
584+
568585
withName: 'NFCORE_RNASEQ:RNASEQ:BAM_SORT_SAMTOOLS:SAMTOOLS_SORT' {
569586
ext.prefix = { "${meta.id}.transcriptome.sorted" }
570587
publishDir = [
@@ -588,6 +605,7 @@ if (!params.skip_alignment && params.aligner == 'star_salmon') {
588605
}
589606

590607
withName: '.*:DEDUP_UMI_UMITOOLS_TRANSCRIPTOME:UMITOOLS_DEDUP' {
608+
ext.args = { meta.single_end ? '' : '--unpaired-reads=discard --chimeric-pairs=discard' }
591609
ext.prefix = { "${meta.id}.umi_dedup.transcriptome.sorted" }
592610
publishDir = [
593611
path: { "${params.outdir}/${params.aligner}/umitools" },

0 commit comments

Comments
 (0)