Skip to content

Commit ab9df9a

Browse files
authored
Merge pull request #1106 from nf-core/kallisto_quant
Kallisto quantification
2 parents f189c95 + 5ad263a commit ab9df9a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+1361
-494
lines changed

.github/workflows/ci.yml

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -232,16 +232,19 @@ jobs:
232232
run: |
233233
nextflow run ${GITHUB_WORKSPACE} -profile test_cache,docker --aligner hisat2 ${{ matrix.parameters }} --outdir ./results --test_data_base ${{ github.workspace }}/test-datasets/
234234
235-
salmon:
236-
name: Test Salmon with workflow parameters
235+
pseudo:
236+
name: Test Pseudoaligners with workflow parameters
237237
if: ${{ (github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/rnaseq')) && !contains(github.event.head_commit.message, '[ci fast]') }}
238238
runs-on: ubuntu-latest
239239
strategy:
240240
matrix:
241241
parameters:
242-
- "--skip_qc"
243-
- "--skip_alignment --skip_pseudo_alignment"
244-
- "--salmon_index false --transcript_fasta false"
242+
- "--pseudo_aligner salmon --skip_qc"
243+
- "--pseudo_aligner salmon --skip_alignment --skip_pseudo_alignment"
244+
- "--pseudo_aligner salmon --salmon_index false --transcript_fasta false"
245+
- "--pseudo_aligner kallisto --skip_qc"
246+
- "--pseudo_aligner kallisto --skip_alignment --skip_pseudo_alignment"
247+
- "--pseudo_aligner kallisto --kallisto_index false --transcript_fasta false"
245248
steps:
246249
- name: Check out pipeline code
247250
uses: actions/checkout@v2
@@ -280,6 +283,6 @@ jobs:
280283
wget -qO- get.nextflow.io | bash
281284
sudo mv nextflow /usr/local/bin/
282285
283-
- name: Run pipeline with Salmon and various parameters
286+
- name: Run pipeline with Salmon or Kallisto and various parameters
284287
run: |
285-
nextflow run ${GITHUB_WORKSPACE} -profile test_cache,docker --pseudo_aligner salmon ${{ matrix.parameters }} --outdir ./results --test_data_base ${{ github.workspace }}/test-datasets/
288+
nextflow run ${GITHUB_WORKSPACE} -profile test_cache,docker ${{ matrix.parameters }} --outdir ./results --test_data_base ${{ github.workspace }}/test-datasets/

CHANGELOG.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1010
Special thanks to the following for their contributions to the release:
1111

1212
- [Adam Talbot](https://github.com/adamrtalbot)
13+
- [Jonathan Manning](https://github.com/pinin4fjords)
1314
- [Júlia Mir Pedrol](https://github.com/mirpedrol)
1415
- [Matthias Zepper](https://github.com/MatthiasZepper)
1516
- [Maxime Garcia](https://github.com/maxulysse)
@@ -28,13 +29,16 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
2829
- [PR #1083](https://github.com/nf-core/rnaseq/pull/1083) - Move local modules and subworkflows to subfolders
2930
- [PR #1088](https://github.com/nf-core/rnaseq/pull/1088) - Updates contributing and code of conduct documents with nf-core template 2.10
3031
- [PR #1091](https://github.com/nf-core/rnaseq/pull/1091) - Reorganise parameters in schema for better usability
32+
- [PR #1106](https://github.com/nf-core/rnaseq/pull/1106) - Kallisto quantification
33+
- [PR #1106](https://github.com/nf-core/rnaseq/pull/1106) - MultiQC [version bump](https://github.com/nf-core/rnaseq/pull/1106/commits/aebad067a10a45510a2b421da852cb436ae65fd8)
34+
- [#1050](https://github.com/nf-core/rnaseq/issues/1050) - Provide custom prefix/suffix for summary files to avoid overwriting
3135

3236
### Software dependencies
3337

3438
| Dependency | Old version | New version |
3539
| ----------------------- | ----------- | ----------- |
3640
| `fastqc` | 0.11.9 | 0.12.1 |
37-
| `multiqc` | 1.14 | 1.15 |
41+
| `multiqc` | 1.14 | 1.17 |
3842
| `ucsc-bedgraphtobigwig` | 377 | 445 |
3943

4044
> **NB:** Dependency has been **updated** if both old and new version information is present.
@@ -61,7 +65,7 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
6165
### Enhancements & fixes
6266

6367
- [[#1011](https://github.com/nf-core/rnaseq/issues/1011)] - FastQ files from UMI-tools not being passed to fastp
64-
- [[#1018](https://github.com/nf-core/rnaseq/issues/1018)] - Ability to skip both alignment and pseudo-alignment to only run pre-processing QC steps.
68+
- [[#1018](https://github.com/nf-core/rnaseq/issues/1018)] - Ability to skip both alignment and pseudoalignment to only run pre-processing QC steps.
6569
- [PR #1016](https://github.com/nf-core/rnaseq/pull/1016) - Updated pipeline template to [nf-core/tools 2.8](https://github.com/nf-core/tools/releases/tag/2.8)
6670
- [PR #1025](https://github.com/nf-core/fetchngs/pull/1025) - Add `public_aws_ecr.config` to source mulled containers when using `public.ecr.aws` Docker Biocontainer registry
6771
- [PR #1038](https://github.com/nf-core/rnaseq/pull/1038) - Updated error log for count values when supplying `--additional_fasta`
@@ -809,7 +813,7 @@ Major novel changes include:
809813
- Added options to skip several steps
810814
- Skip trimming using `--skipTrimming`
811815
- Skip BiotypeQC using `--skipBiotypeQC`
812-
- Skip Alignment using `--skipAlignment` to only use pseudo-alignment using Salmon
816+
- Skip Alignment using `--skipAlignment` to only use pseudoalignment using Salmon
813817

814818
### Documentation updates
815819

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
3. [`dupRadar`](https://bioconductor.org/packages/release/bioc/html/dupRadar.html)
4040
4. [`Preseq`](http://smithlabresearch.org/software/preseq/)
4141
5. [`DESeq2`](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)
42-
15. Pseudo-alignment and quantification ([`Salmon`](https://combine-lab.github.io/salmon/); _optional_)
42+
15. Pseudoalignment and quantification ([`Salmon`](https://combine-lab.github.io/salmon/) or ['Kallisto'](https://pachterlab.github.io/kallisto/); _optional_)
4343
16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))
4444

4545
> **Note**

assets/multiqc_config.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ run_modules:
2323
- hisat2
2424
- rsem
2525
- salmon
26+
- kallisto
2627
- samtools
2728
- picard
2829
- preseq
@@ -66,6 +67,7 @@ extra_fn_clean_exts:
6667
- ".umi_dedup"
6768
- "_val"
6869
- ".markdup"
70+
- "_primary"
6971

7072
# Customise the module search patterns to speed up execution time
7173
# - Skip module sub-tools that we are not interested in

bin/salmon_tx2gene.py

Lines changed: 0 additions & 89 deletions
This file was deleted.

bin/salmon_summarizedexperiment.r renamed to bin/summarizedexperiment.r

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,18 @@
22

33
library(SummarizedExperiment)
44

5-
## Create SummarizedExperiment (se) object from Salmon counts
5+
## Create SummarizedExperiment (se) object from counts
66

77
args <- commandArgs(trailingOnly = TRUE)
8-
if (length(args) < 2) {
9-
stop("Usage: salmon_se.r <coldata> <counts> <tpm>", call. = FALSE)
8+
if (length(args) < 3) {
9+
stop("Usage: summarizedexperiment.r <coldata> <counts> <tpm> <tx2gene>", call. = FALSE)
1010
}
1111

1212
coldata <- args[1]
1313
counts_fn <- args[2]
1414
tpm_fn <- args[3]
15+
tx2gene <- args[4]
1516

16-
tx2gene <- "salmon_tx2gene.tsv"
1717
info <- file.info(tx2gene)
1818
if (info$size == 0) {
1919
tx2gene <- NULL

bin/tx2gene.py

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
#!/usr/bin/env python
2+
import logging
3+
import argparse
4+
import glob
5+
import os
6+
from collections import Counter, defaultdict, OrderedDict
7+
from collections.abc import Set
8+
from typing import Dict
9+
10+
# Configure logging
11+
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
12+
logger = logging.getLogger(__name__)
13+
logger.setLevel(logging.INFO)
14+
15+
16+
def read_top_transcripts(quant_dir: str, file_pattern: str) -> Set[str]:
17+
"""
18+
Read the top 100 transcripts from the quantification file.
19+
20+
Parameters:
21+
quant_dir (str): Directory where quantification files are located.
22+
file_pattern (str): Pattern to match quantification files.
23+
24+
Returns:
25+
set: A set containing the top 100 transcripts.
26+
"""
27+
try:
28+
# Find the quantification file within the directory
29+
quant_file_path = glob.glob(os.path.join(quant_dir, "*", file_pattern))[0]
30+
with open(quant_file_path, "r") as file_handle:
31+
# Read the file and extract the top 100 transcripts
32+
return {line.split()[0] for i, line in enumerate(file_handle) if i > 0 and i <= 100}
33+
except IndexError:
34+
# Log an error and raise a FileNotFoundError if the quant file does not exist
35+
logger.error("No quantification files found.")
36+
raise FileNotFoundError("Quantification file not found.")
37+
38+
39+
def discover_transcript_attribute(gtf_file: str, transcripts: Set[str]) -> str:
40+
"""
41+
Discover the attribute in the GTF that corresponds to transcripts, prioritizing 'transcript_id'.
42+
43+
Parameters:
44+
gtf_file (str): Path to the GTF file.
45+
transcripts (Set[str]): A set of transcripts to match in the GTF file.
46+
47+
Returns:
48+
str: The attribute name that corresponds to transcripts in the GTF file.
49+
"""
50+
votes = Counter()
51+
with open(gtf_file) as inh:
52+
# Read GTF file, skipping header lines
53+
for line in filter(lambda x: not x.startswith("#"), inh):
54+
cols = line.split("\t")
55+
# Parse attribute column and update votes for each attribute found
56+
attributes = dict(item.strip().split(" ", 1) for item in cols[8].split(";") if item.strip())
57+
votes.update(key for key, value in attributes.items() if value.strip('"') in transcripts)
58+
59+
if not votes:
60+
# Log a warning if no matching attribute is found
61+
logger.warning("No attribute in GTF matching transcripts")
62+
return ""
63+
64+
# Check if 'transcript_id' is among the attributes with the highest votes
65+
if "transcript_id" in votes and votes["transcript_id"] == max(votes.values()):
66+
logger.info("Attribute 'transcript_id' corresponds to transcripts.")
67+
return "transcript_id"
68+
69+
# If 'transcript_id' isn't the highest, determine the most common attribute that matches the transcripts
70+
attribute, _ = votes.most_common(1)[0]
71+
logger.info(f"Attribute '{attribute}' corresponds to transcripts.")
72+
return attribute
73+
74+
75+
def parse_attributes(attributes_text: str) -> Dict[str, str]:
76+
"""
77+
Parse the attributes column of a GTF file.
78+
79+
:param attributes_text: The attributes column as a string.
80+
:return: A dictionary of the attributes.
81+
"""
82+
# Split the attributes string by semicolon and strip whitespace
83+
attributes = attributes_text.strip().split(";")
84+
attr_dict = OrderedDict()
85+
86+
# Iterate over each attribute pair
87+
for attribute in attributes:
88+
# Split the attribute into key and value, ensuring there are two parts
89+
parts = attribute.strip().split(" ", 1)
90+
if len(parts) == 2:
91+
key, value = parts
92+
# Remove any double quotes from the value
93+
value = value.replace('"', "")
94+
attr_dict[key] = value
95+
96+
return attr_dict
97+
98+
99+
def map_transcripts_to_gene(
100+
quant_type: str, gtf_file: str, quant_dir: str, gene_id: str, extra_id_field: str, output_file: str
101+
) -> bool:
102+
"""
103+
Map transcripts to gene names and write the output to a file.
104+
105+
Parameters:
106+
quant_type (str): The quantification method used (e.g., 'salmon').
107+
gtf_file (str): Path to the GTF file.
108+
quant_dir (str): Directory where quantification files are located.
109+
gene_id (str): The gene ID attribute in the GTF file.
110+
extra_id_field (str): Additional ID field in the GTF file.
111+
output_file (str): The output file path.
112+
113+
Returns:
114+
bool: True if the operation was successful, False otherwise.
115+
"""
116+
# Read the top transcripts based on quantification type
117+
transcripts = read_top_transcripts(quant_dir, "quant.sf" if quant_type == "salmon" else "abundance.tsv")
118+
# Discover the attribute that corresponds to transcripts in the GTF
119+
transcript_attribute = discover_transcript_attribute(gtf_file, transcripts)
120+
121+
if not transcript_attribute:
122+
# If no attribute is found, return False
123+
return False
124+
125+
# Open GTF and output file to write the mappings
126+
# Initialize the set to track seen combinations
127+
seen = set()
128+
129+
with open(gtf_file) as inh, open(output_file, "w") as output_handle:
130+
# Parse each line of the GTF, mapping transcripts to genes
131+
for line in filter(lambda x: not x.startswith("#"), inh):
132+
cols = line.split("\t")
133+
attr_dict = parse_attributes(cols[8])
134+
if gene_id in attr_dict and transcript_attribute in attr_dict:
135+
# Create a unique identifier for the transcript-gene combination
136+
transcript_gene_pair = (attr_dict[transcript_attribute], attr_dict[gene_id])
137+
138+
# Check if the combination has already been seen
139+
if transcript_gene_pair not in seen:
140+
# If it's a new combination, write it to the output and add to the seen set
141+
extra_id = attr_dict.get(extra_id_field, attr_dict[gene_id])
142+
output_handle.write(f"{attr_dict[transcript_attribute]}\t{attr_dict[gene_id]}\t{extra_id}\n")
143+
seen.add(transcript_gene_pair)
144+
145+
return True
146+
147+
148+
# Main function to parse arguments and call the mapping function
149+
if __name__ == "__main__":
150+
parser = argparse.ArgumentParser(description="Map transcripts to gene names for tximport.")
151+
parser.add_argument("--quant_type", type=str, help="Quantification type", default="salmon")
152+
parser.add_argument("--gtf", type=str, help="GTF file", required=True)
153+
parser.add_argument("--quants", type=str, help="Output of quantification", required=True)
154+
parser.add_argument("--id", type=str, help="Gene ID in the GTF file", required=True)
155+
parser.add_argument("--extra", type=str, help="Extra ID in the GTF file")
156+
parser.add_argument("-o", "--output", dest="output", default="tx2gene.tsv", type=str, help="File with output")
157+
158+
args = parser.parse_args()
159+
if not map_transcripts_to_gene(args.quant_type, args.gtf, args.quants, args.id, args.extra, args.output):
160+
logger.error("Failed to map transcripts to genes.")

0 commit comments

Comments
 (0)