Skip to content

Commit 949570f

Browse files
authored
Merge pull request #1 from mapo9/querynator_module
Querynator functionality added
2 parents 1277e9b + 3902517 commit 949570f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+1229
-539
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,5 @@ results/
66
testing/
77
testing*
88
*.pyc
9+
dev_test
10+
workflows/test_module

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,20 @@
33
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
44
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
55

6+
## [0.1.0](https://github.com/qbic-pipelines/variantmtb/releases/tag/0.1.0) - Paris-Roubaix
7+
8+
### `Added`
9+
10+
- [#1](https://github.com/qbic-pipelines/variantmtb/pull/1) - Query to CGI & CIViC. Creation of a comprehensive HTML report.
11+
12+
13+
### `Fixed`
14+
15+
### `Dependencies`
16+
17+
### `Deprecated`
18+
19+
620
## v1.0dev - [date]
721

822
Initial release of nf-core/variantmtb, created with the [nf-core](https://nf-co.re/) template.

CITATIONS.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,15 @@
1010
1111
## Pipeline tools
1212

13-
- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
14-
15-
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
16-
> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
13+
- [Tabix](http://www.htslib.org/doc/tabix.html)
14+
- [bcftools norm](https://samtools.github.io/bcftools/bcftools.html#norm)
15+
16+
- [CGI](https://www.cancergenomeinterpreter.org/home)
17+
> Tamborero, D., Rubio-Perez, C., Deu-Pons, J., Schroeder, M. P., Vivancos, A., Rovira, A., ... & Lopez-Bigas, N. (2018). Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations. Genome medicine, 10, 1-8.
18+
- [CIViC](https://civicdb.org/welcome)
19+
> Griffith, M., Spies, N. C., Krysiak, K., McMichael, J. F., Coffman, A. C., Danos, A. M., ... & Griffith, O. L. (2017). CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nature genetics, 49(2), 170-174.
20+
- [CIViCpy](https://docs.civicpy.org/en/latest/)
21+
> Wagner, A. H., Kiwala, S., Coffman, A. C., McMichael, J. F., Cotto, K. C., Mooney, T. B., ... & Griffith, M. (2020). CIViCpy: a python software development and analysis toolkit for the CIViC knowledgebase. JCO Clinical Cancer Informatics, 4, 245-253.
1722
1823
## Software packaging/containerisation tools
1924

README.md

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# ![nf-core/variantmtb](docs/images/nf-core-variantmtb_logo_light.png#gh-light-mode-only) ![nf-core/variantmtb](docs/images/nf-core-variantmtb_logo_dark.png#gh-dark-mode-only)
1+
# ![nf-core/variantmtb](docs/images/nf-core-variantmtb_logo_light.png#gh-light-mode-only)
2+
<!-- ![nf-core/variantmtb](docs/images/nf-core-variantmtb_logo_dark.png#gh-dark-mode-only) -->
23

34
[![GitHub Actions CI Status](https://github.com/nf-core/variantmtb/workflows/nf-core%20CI/badge.svg)](https://github.com/nf-core/variantmtb/actions?query=workflow%3A%22nf-core+CI%22)
45
[![GitHub Actions Linting Status](https://github.com/nf-core/variantmtb/workflows/nf-core%20linting/badge.svg)](https://github.com/nf-core/variantmtb/actions?query=workflow%3A%22nf-core+linting%22)
@@ -19,22 +20,27 @@
1920

2021
<!-- TODO nf-core: Write a 1-2 sentence summary of what data the pipeline is for and what it does -->
2122

22-
**nf-core/variantmtb** is a bioinformatics best-practice analysis pipeline for querying variant databases to investigate the biological and predictive relevance of tumor variants.
23+
**qbic-pipelines/variantmtb** is a bioinformatics best-practice analysis pipeline for querying variant databases to investigate the diagnostic, prognostic and predictive relevance of tumor variants.
2324

2425
The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
2526

2627
<!-- TODO nf-core: Add full-sized test dataset and amend the paragraph below if applicable -->
2728

28-
On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/variantmtb/results).
29+
<!-- On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/variantmtb/results). -->
30+
31+
<p align="center">
32+
<img title="variantMTB workflow" src="docs/images/variantMTB_workflow.png" width=70%>
33+
</p>
2934

3035
## Pipeline summary
3136

3237
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
3338

34-
1. Filter for variants that [PASS](http://samtools.github.io/bcftools/bcftools.html)
35-
2. Query [Clinvar](https://www.ncbi.nlm.nih.gov/clinvar/)
36-
3. Query [Oncokb](https://www.oncokb.org/)
37-
4. Query [Civic](https://civicdb.org/variants/home)
39+
1. Normalize variants [bcftools norm](https://www.htslib.org/doc/1.0/bcftools.html#norm)
40+
2. Index VCF file [tabix](http://www.htslib.org/doc/tabix.html)
41+
3. Query [CGI](https://www.cancergenomeinterpreter.org/home)
42+
4. Query [CIViC](https://civicdb.org/variants/home)
43+
5. Categorize variants and create an comprehensive HTML report
3844

3945
## Quick Start
4046

@@ -60,7 +66,7 @@ On release, automated continuous integration tests run the pipeline on a full-si
6066
<!-- TODO nf-core: Update the example "typical command" below used to run the pipeline -->
6167

6268
```console
63-
nextflow run nf-core/variantmtb --input samplesheet.csv --outdir <OUTDIR> --genome GRCh37 -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
69+
nextflow run qbic-pipelines/variantmtb -r dev --input samplesheet.csv --outdir <OUTDIR> --genome GRCh38 -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
6470
```
6571

6672
## Documentation
@@ -69,7 +75,7 @@ The nf-core/variantmtb pipeline comes with documentation about the pipeline [usa
6975

7076
## Credits
7177

72-
nf-core/variantmtb was originally written by SusiJo.
78+
nf-core/variantmtb was originally started by SusiJo and mainly developed by mapo9.
7379

7480
We thank the following people for their extensive assistance in the development of this pipeline:
7581

assets/samplesheet.csv

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1-
sample,fastq_1,fastq_2
2-
SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
3-
SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
1+
sample,filename,genome,filetype
2+
sample_1,path/to/file_1.vcf,GRCh38,mutations
3+
sample_2,path/to/file_2.vcf,GRCh38,mutations
4+
sample_3,path/to/file_3.vcf,GRCh38,mutations

bin/check_samplesheet.py

Lines changed: 61 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -28,12 +28,30 @@ class RowChecker:
2828
VALID_FORMATS = (
2929
".vcf",
3030
".vcf.gz",
31+
".tsv",
32+
".ext"
33+
)
34+
35+
VALID_GENOMES = (
36+
"hg19",
37+
"GRCh37",
38+
"hg38",
39+
"GRCh38"
40+
)
41+
42+
VALID_FILETYPES = (
43+
"mutations",
44+
"cnas",
45+
"translocations"
3146
)
3247

3348
def __init__(
3449
self,
3550
sample_col="sample",
36-
first_col="vcf",
51+
filename_col="filename",
52+
genome_col="genome",
53+
filetype_col="filetype",
54+
3755
**kwargs,
3856
):
3957
"""
@@ -42,13 +60,20 @@ def __init__(
4260
Args:
4361
sample_col (str): The name of the column that contains the sample name
4462
(default "sample").
45-
first_col (str): The name of the column that contains the first (or only)
46-
VCF file path (default "vcf").
63+
filename_col (str): The name of the column that contains the input file path.
64+
genome_col (str): The name of the column that contains the reference genome.
65+
(default "GRCh37")
66+
filetype_col (str): The name of the column that contains the type of the input file
67+
(default "mutations")
68+
69+
4770
4871
"""
4972
super().__init__(**kwargs)
5073
self._sample_col = sample_col
51-
self._first_col = first_col
74+
self._filename_col = filename_col
75+
self._genome_col = genome_col
76+
self._filetype_col = filetype_col
5277
self._seen = set()
5378
self.modified = []
5479

@@ -62,8 +87,8 @@ def validate_and_transform(self, row):
6287
6388
"""
6489
self._validate_sample(row)
65-
self._validate_first(row)
66-
self._seen.add((row[self._sample_col], row[self._first_col]))
90+
self._validate_entries(row)
91+
self._seen.add((row[self._sample_col], row[self._filename_col]))
6792
self.modified.append(row)
6893

6994
def _validate_sample(self, row):
@@ -72,18 +97,38 @@ def _validate_sample(self, row):
7297
# Sanitize samples slightly.
7398
row[self._sample_col] = row[self._sample_col].replace(" ", "_")
7499

75-
def _validate_first(self, row):
76-
"""Assert that the first VCF entry is non-empty and has the right format."""
77-
assert len(row[self._first_col]) > 0, "At least the first VCF file is required."
78-
self._validate_vcf_format(row[self._first_col])
100+
def _validate_entries(self, row):
101+
"""
102+
Assert that the first VCF entry is non-empty and has the right format.
103+
Assert that supported reference genome is given
104+
Assert that supported filetype is provided
105+
"""
106+
assert len(row[self._filename_col]) > 0, "At least the first VCF file is required."
107+
self._validate_file_format(row[self._filename_col])
108+
self._validate_genome(row[self._genome_col])
109+
self._validate_filetype(row[self._filetype_col])
79110

80-
def _validate_vcf_format(self, filename):
111+
def _validate_file_format(self, filename):
81112
"""Assert that a given filename has one of the expected VCF extensions."""
82113
assert any(filename.endswith(extension) for extension in self.VALID_FORMATS), (
83114
f"The VCF file has an unrecognized extension: {filename}\n"
84115
f"It should be one of: {', '.join(self.VALID_FORMATS)}"
85116
)
86117

118+
def _validate_genome(self, genome_name):
119+
"""Assert that the given reference genome is compatible with the pipeline."""
120+
assert any(genome_name == genome for genome in self.VALID_GENOMES), (
121+
f"The provided reference genome is not supported: {genome_name}\n"
122+
f"It should be one of: {', '.join(self.VALID_GENOMES)}"
123+
)
124+
125+
def _validate_filetype(self, file_type):
126+
"""Assert that the given reference genome is compatible with the pipeline."""
127+
assert any(file_type == f_t for f_t in self.VALID_FILETYPES), (
128+
f"The provided filetype is not supported: {file_type}\n"
129+
f"It should be one of: {', '.join(self.VALID_FILETYPES)}"
130+
)
131+
87132
def validate_unique_samples(self):
88133
"""
89134
Assert that the combination of sample name and VCF filename is unique.
@@ -155,16 +200,16 @@ def check_samplesheet(file_in, file_out):
155200
This function checks that the samplesheet follows the following structure,
156201
see also the `viral recon samplesheet`_::
157202
158-
sample,vcf
159-
SAMPLE1,SAMPLE1.vcf.gz
160-
SAMPLE2,SAMPLE2.vcf.gz
161-
SAMPLE3,SAMPLE3.vcf.gz
203+
sample,filename,genome,filetype
204+
SAMPLE1,SAMPLE1.vcf.gz,hg19,mutations
205+
SAMPLE2,SAMPLE2.tsv,GRCh37,translocations
206+
SAMPLE3,SAMPLE3.vcf,hg19,mutations
162207
163208
.. _viral recon samplesheet:
164209
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv
165210
166211
"""
167-
required_columns = {"sample", "vcf"}
212+
required_columns = {"sample", "filename", "genome", "filetype"}
168213
# See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
169214
with file_in.open(newline="") as in_handle:
170215
reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle))

conf/base.config

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,6 @@
1010

1111
process {
1212

13-
// TODO nf-core: Check the defaults for all processes
1413
cpus = { check_max( 1 * task.attempt, 'cpus' ) }
1514
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
1615
time = { check_max( 4.h * task.attempt, 'time' ) }
@@ -24,7 +23,6 @@ process {
2423
// These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
2524
// If possible, it would be nice to keep the same label naming convention when
2625
// adding in your local modules too.
27-
// TODO nf-core: Customise requirements for specific processes.
2826
// See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
2927
withLabel:process_low {
3028
cpus = { check_max( 2 * task.attempt, 'cpus' ) }

conf/modules.config

Lines changed: 40 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -18,26 +18,11 @@ process {
1818
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
1919
]
2020

21-
22-
withName: 'BCFTOOLS_VIEW' {
23-
ext.args = "-f PASS"
24-
ext.prefix = { "${meta.id}.pass" }
25-
publishDir = [
26-
path: { "${params.outdir}/bcftools/pass" },
27-
mode: params.publish_dir_mode,
28-
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
29-
]
30-
}
31-
32-
withName: 'BCFTOOLS_SPLITVEP' {
33-
// [%AF] pastes allele frequencies of all samples contained in a vcf without quotes
34-
// Normal sample AF: 0.01 Tumor sample AF: 0.019 is printed as 0.010.019
35-
ext.args = "-f '%CHROM %POS %ID %REF %ALT [%AF] %IMPACT %Gene %SYMBOL %Consequence %SIFT %PolyPhen %HGVSc %HGVSp %RefSeq %Existing_variation %CLIN_SIG\n' --duplicate"
36-
ext.prefix = { "${meta.id}.split_vep" }
21+
withName: 'BCFTOOLS_NORM' {
22+
ext.args = "--output-type z -a --atom-overlaps ."
23+
ext.prefix = { "${meta.id}.normalized" }
3724
publishDir = [
38-
path: { "${params.outdir}/bcftools/split_vep" },
39-
mode: params.publish_dir_mode,
40-
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
25+
enabled: false
4126
]
4227
}
4328

@@ -56,5 +41,41 @@ process {
5641
pattern: '*_versions.yml'
5742
]
5843
}
44+
45+
withName: QUERYNATOR_CGIAPI {
46+
publishDir = [
47+
path: { "${params.outdir}/${meta.id}" },
48+
mode: params.publish_dir_mode,
49+
pattern: '*'
50+
]
51+
}
52+
53+
withName: QUERYNATOR_CIVICAPI {
54+
publishDir = [
55+
path: { "${params.outdir}/${meta.id}" },
56+
mode: params.publish_dir_mode,
57+
pattern: '*'
58+
]
59+
}
60+
61+
withName: QUERYNATOR_CREATEREPORT {
62+
publishDir = [
63+
path: { "${params.outdir}/${meta.id}" },
64+
mode: params.publish_dir_mode,
65+
pattern: '*'
66+
]
67+
}
68+
69+
withName: TABIX_TABIX {
70+
publishDir = [
71+
enabled: false
72+
]
73+
}
74+
75+
withName: TABIX_BGZIPTABIX {
76+
publishDir = [
77+
enabled: false
78+
]
79+
}
5980

6081
}

conf/test.config

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,5 +23,9 @@ params {
2323
input = "${projectDir}/tests/csv/input.csv"
2424

2525
// Genome references
26-
genome = 'hg38'
26+
genome = 'GRCh37'
27+
fasta = "s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa"
28+
29+
// mandatory flags
30+
databases = 'civic'
2731
}

conf/test_full.config

Lines changed: 0 additions & 24 deletions
This file was deleted.

0 commit comments

Comments
 (0)