You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## _gvanno_ - workflow for functional and clinical annotation of germline nucleotide variants
2
+
3
+
### Contents
4
+
5
+
-[Overview](#overview)
6
+
-[News](#news)
7
+
-[Annotation resources](#annotation-resources)
8
+
-[Getting started](#getting-started)
9
+
-[Contact](#contact)
2
10
3
11
### Overview
4
12
5
13
The germline variant annotator (*gvanno*) is a simple, software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below). Technically, the workflow is built with the [Docker](https://www.docker.com) technology, but it can also be installed through the [Singularity](https://sylabs.io/docs/) framework.
6
14
7
15
*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record.
8
16
9
-
#### Annotation resources included in _gvanno_ - 1.3.2
10
-
11
-
*[VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v100.2 (GENCODE v34/v19 as the gene reference dataset)
12
-
*[dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.1, June 2020)
13
-
*[gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
14
-
*[dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 153) - from VEP
*[ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (June 2020)
17
-
*[DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v7.0, May 2020)
18
-
*[Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2020_06, June 2020)
19
-
*[UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2020_03, June 2020)
20
-
*[Pfam](http://pfam.xfam.org) - Database of protein families and domains (v33.1, May 2020)
21
-
*[NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (June 13th 2020)
22
-
23
17
### News
18
+
19
+
* September 29th 2020 - **1.4.0 release**
20
+
* Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform)
21
+
* Software updates (VEP 101)
22
+
* Configuration through TOML file is omitted - all configurations are now encoded as optional arguments to the main Python script (`gvanno.py`)
24
23
* June 30th 2020 - **1.3.2 release**
25
24
* Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform, Pfam, dbNSFP)
26
25
* Using GENCODE v34 as the correct transcript assembly for grch38 (see [issue](https://github.com/Ensembl/ensembl-vep/issues/749))
@@ -33,21 +32,28 @@ The germline variant annotator (*gvanno*) is a simple, software package intended
33
32
* November 22nd 2019 - **1.1.0 release**
34
33
* Ability to install and run workflow using [Singularity](https://sylabs.io/docs/), excellent contribution by [@oskarvid](https://github.com/oskarvid), see step 1.1 in _Getting Started_
35
34
* Data and software updates (ClinVar, UniProt, VEP)
36
-
* July 10th 2019 - **1.0.0 release**
37
-
* Docker image update - VEP v97 (GENCODE 31/19)
38
-
* Data bundle updates: ClinVar, UniProt, GWAS catalog
*[ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (August 2020)
45
+
*[DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v7.0, May 2020)
46
+
*[Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2020_09, September 2020)
47
+
*[UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2020_04, August 2020)
48
+
*[Pfam](http://pfam.xfam.org) - Database of protein families and domains (v33.1, May 2020)
49
+
*[NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (September 9th 2020)
50
+
43
51
44
52
### Getting started
45
53
46
54
#### STEP 0: Python
47
55
48
-
An installation of Python (version _3.6_) is required to run *gvanno*. Check that Python is installed by typing `python --version` in your terminal window. In addition, a [Python library](https://github.com/uiri/toml) for parsing configuration files encoded with [TOML](https://github.com/toml-lang/toml) is needed. To install, simply run the following command:
49
-
50
-
pip install toml
56
+
An installation of Python (version _3.6_) is required to run *gvanno*. Check that Python is installed by typing `python --version` in your terminal window.
51
57
52
58
#### STEP 1: Installation of Docker
53
59
@@ -74,15 +80,15 @@ An installation of Python (version _3.6_) is required to run *gvanno*. Check tha
74
80
75
81
#### STEP 2: Download *gvanno* and data bundle
76
82
77
-
1. Download and unpack the [latest software release (1.3.2)](https://github.com/sigven/gvanno/releases/tag/v1.3.2)
83
+
1. Download and unpack the [latest software release (1.4.0)](https://github.com/sigven/gvanno/releases/tag/v1.4.0)
78
84
2. Download and unpack the assembly-specific data bundle in the gvanno directory
79
-
*[grch37 data bundle](https://drive.google.com/file/d/1XJT8sSngl5T3HHQK2CZtZuwXX3rouEYg/) (approx 16Gb)
80
-
*[grch38 data bundle](https://drive.google.com/file/d/1M6gioFzvt6XOqRDTx4UXYD5sIVOH55IY) (approx 17Gb)
85
+
*[grch37 data bundle](https://drive.google.com/file/d/1VnABjA3ZCJLlQxhQKcIGaC17MD0kItVd) (approx 16Gb)
86
+
*[grch38 data bundle](https://drive.google.com/file/d/13fbKtAFzcUGDnPfruzgK43PvAKiFc8XL/) (approx 17Gb)
81
87
**Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`
82
88
83
89
A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced
84
-
3. Pull the [gvanno Docker image (1.3.2)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 1.9Gb):
@@ -92,42 +98,65 @@ The *gvanno* workflow accepts a single input file:
92
98
93
99
We __strongly__ recommend that the input VCF is compressed and indexed using [bgzip](http://www.htslib.org/doc/tabix.html) and [tabix](http://www.htslib.org/doc/tabix.html). NOTE: If the input VCF contains multi-allelic sites, these will be subject to [decomposition](http://genome.sph.umich.edu/wiki/Vt#Decompose).
94
100
95
-
#### STEP 4: *gvanno* configuration
96
-
97
-
A few elements of the workflow can be figured using the *gvanno* configuration file (i.e. **gvanno.toml**), encoded in [TOML](https://github.com/toml-lang/toml) (an easy to read file format).
98
-
99
-
* Prediction of loss-of-function variants using VEP's LOFTEE plugin can be turned on in the configuration file (`lof_prediction = true`). Do note that this frequently increases the run time for VEP significantly.
100
-
101
101
#### STEP 5: Run example
102
102
103
103
Run the workflow with **gvanno.py**, which takes the following arguments and options:
sample_id Sample identifier - prefix for output files
118
-
--container Run gvanno with docker or singularity
119
-
120
-
optional arguments:
121
-
-h, --help show this help message and exit
122
-
--force_overwrite The script will fail with an error if the output file already exists. Force the overwrite of existing result files by using this flag
123
-
--version show program's version number and exit
124
-
--no_vcf_validate Skip validation of input VCF with Ensembl's vcf-validator
105
+
usage:
106
+
gvanno.py -h [options]
107
+
--query_vcf QUERY_VCF
108
+
--gvanno_dir GVANNO_DIR
109
+
--output_dir OUTPUT_DIR
110
+
--genome_assembly grch37|grch38
111
+
--sample_id SAMPLE_ID
112
+
--container docker|singularity
113
+
114
+
gvanno - workflow for functional and clinical annotation of germline nucleotide variants
115
+
116
+
Required arguments:
117
+
--query_vcf QUERY_VCF
118
+
VCF input file with germline query variants (SNVs/InDels).
119
+
--gvanno_dir GVANNO_DIR
120
+
Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.4.0
121
+
--output_dir OUTPUT_DIR
122
+
Output directory
123
+
--genome_assembly {grch37,grch38}
124
+
Genome assembly build: grch37 or grch38
125
+
--container {docker,singularity}
126
+
Run gvanno with docker or singularity
127
+
--sample_id SAMPLE_ID
128
+
Sample identifier - prefix for output files
129
+
130
+
Optional arguments:
131
+
--force_overwrite By default, the script will fail with an error if any output file already exists.
132
+
You can force the overwrite of existing result files by using this flag, default: False
133
+
--version show program's version number and exit
134
+
--no_vcf_validate Skip validation of input VCF with Ensembl's vcf-validator, default: False
135
+
--lof_prediction Predict loss-of-function variants with Loftee plugin in Variant Effect Predictor (VEP), default: False
136
+
--vep_n_forks VEP_N_FORKS
137
+
Number of forks for Variant Effect Predictor (VEP) processing, default: 4
138
+
--vep_buffer_size VEP_BUFFER_SIZE
139
+
Variant buffer size (variants read into memory simultaneously) for Variant Effect Predictor (VEP) processing
140
+
- set lower to reduce memory usage, default: 5000
141
+
--vep_pick_order VEP_PICK_ORDER
142
+
Comma-separated string of ordered transcript properties for primary variant pick in
0 commit comments