Skip to content

Commit 9e27751

Browse files
committed
1.4.0 release
1 parent 8382336 commit 9e27751

File tree

5 files changed

+147
-188
lines changed

5 files changed

+147
-188
lines changed

README.md

Lines changed: 88 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,25 @@
1-
## _gvanno_ - *g*ermline *v*ariant *anno*tator
1+
## _gvanno_ - workflow for functional and clinical annotation of germline nucleotide variants
2+
3+
### Contents
4+
5+
- [Overview](#overview)
6+
- [News](#news)
7+
- [Annotation resources](#annotation-resources)
8+
- [Getting started](#getting-started)
9+
- [Contact](#contact)
210

311
### Overview
412

513
The germline variant annotator (*gvanno*) is a simple, software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below). Technically, the workflow is built with the [Docker](https://www.docker.com) technology, but it can also be installed through the [Singularity](https://sylabs.io/docs/) framework.
614

715
*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record.
816

9-
#### Annotation resources included in _gvanno_ - 1.3.2
10-
11-
* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v100.2 (GENCODE v34/v19 as the gene reference dataset)
12-
* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.1, June 2020)
13-
* [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
14-
* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 153) - from VEP
15-
* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
16-
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (June 2020)
17-
* [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v7.0, May 2020)
18-
* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2020_06, June 2020)
19-
* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2020_03, June 2020)
20-
* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v33.1, May 2020)
21-
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (June 13th 2020)
22-
2317
### News
18+
19+
* September 29th 2020 - **1.4.0 release**
20+
* Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform)
21+
* Software updates (VEP 101)
22+
* Configuration through TOML file is omitted - all configurations are now encoded as optional arguments to the main Python script (`gvanno.py`)
2423
* June 30th 2020 - **1.3.2 release**
2524
* Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform, Pfam, dbNSFP)
2625
* Using GENCODE v34 as the correct transcript assembly for grch38 (see [issue](https://github.com/Ensembl/ensembl-vep/issues/749))
@@ -33,21 +32,28 @@ The germline variant annotator (*gvanno*) is a simple, software package intended
3332
* November 22nd 2019 - **1.1.0 release**
3433
* Ability to install and run workflow using [Singularity](https://sylabs.io/docs/), excellent contribution by [@oskarvid](https://github.com/oskarvid), see step 1.1 in _Getting Started_
3534
* Data and software updates (ClinVar, UniProt, VEP)
36-
* July 10th 2019 - **1.0.0 release**
37-
* Docker image update - VEP v97 (GENCODE 31/19)
38-
* Data bundle updates: ClinVar, UniProt, GWAS catalog
39-
* May 21st 2019 - **0.9.0 release**
40-
* Data bundle updates: ClinVar, UniProt
41-
* Adding gene-disease associations from [Open Targets Platform](https://targetvalidation.org),([Carvalho-Silva et. al, NAR, 2019](https://www.ncbi.nlm.nih.gov/pubmed/30462303))
42-
* Moved *vcf-validation* configuration to command-line option
35+
36+
37+
### Annotation resources
38+
39+
* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v101 (GENCODE v35/v19 as the gene reference dataset)
40+
* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.1, June 2020)
41+
* [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
42+
* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 153) - from VEP
43+
* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
44+
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (August 2020)
45+
* [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v7.0, May 2020)
46+
* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2020_09, September 2020)
47+
* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2020_04, August 2020)
48+
* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v33.1, May 2020)
49+
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (September 9th 2020)
50+
4351

4452
### Getting started
4553

4654
#### STEP 0: Python
4755

48-
An installation of Python (version _3.6_) is required to run *gvanno*. Check that Python is installed by typing `python --version` in your terminal window. In addition, a [Python library](https://github.com/uiri/toml) for parsing configuration files encoded with [TOML](https://github.com/toml-lang/toml) is needed. To install, simply run the following command:
49-
50-
pip install toml
56+
An installation of Python (version _3.6_) is required to run *gvanno*. Check that Python is installed by typing `python --version` in your terminal window.
5157

5258
#### STEP 1: Installation of Docker
5359

@@ -74,15 +80,15 @@ An installation of Python (version _3.6_) is required to run *gvanno*. Check tha
7480

7581
#### STEP 2: Download *gvanno* and data bundle
7682

77-
1. Download and unpack the [latest software release (1.3.2)](https://github.com/sigven/gvanno/releases/tag/v1.3.2)
83+
1. Download and unpack the [latest software release (1.4.0)](https://github.com/sigven/gvanno/releases/tag/v1.4.0)
7884
2. Download and unpack the assembly-specific data bundle in the gvanno directory
79-
* [grch37 data bundle](https://drive.google.com/file/d/1XJT8sSngl5T3HHQK2CZtZuwXX3rouEYg/) (approx 16Gb)
80-
* [grch38 data bundle](https://drive.google.com/file/d/1M6gioFzvt6XOqRDTx4UXYD5sIVOH55IY) (approx 17Gb)
85+
* [grch37 data bundle](https://drive.google.com/file/d/1VnABjA3ZCJLlQxhQKcIGaC17MD0kItVd) (approx 16Gb)
86+
* [grch38 data bundle](https://drive.google.com/file/d/13fbKtAFzcUGDnPfruzgK43PvAKiFc8XL/) (approx 17Gb)
8187
* *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`
8288

8389
A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced
84-
3. Pull the [gvanno Docker image (1.3.2)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 1.9Gb):
85-
* `docker pull sigven/gvanno:1.3.2` (gvanno annotation engine)
90+
3. Pull the [gvanno Docker image (1.4.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 1.9Gb):
91+
* `docker pull sigven/gvanno:1.4.0` (gvanno annotation engine)
8692

8793
#### STEP 3: Input preprocessing
8894

@@ -92,42 +98,65 @@ The *gvanno* workflow accepts a single input file:
9298

9399
We __strongly__ recommend that the input VCF is compressed and indexed using [bgzip](http://www.htslib.org/doc/tabix.html) and [tabix](http://www.htslib.org/doc/tabix.html). NOTE: If the input VCF contains multi-allelic sites, these will be subject to [decomposition](http://genome.sph.umich.edu/wiki/Vt#Decompose).
94100

95-
#### STEP 4: *gvanno* configuration
96-
97-
A few elements of the workflow can be figured using the *gvanno* configuration file (i.e. **gvanno.toml**), encoded in [TOML](https://github.com/toml-lang/toml) (an easy to read file format).
98-
99-
* Prediction of loss-of-function variants using VEP's LOFTEE plugin can be turned on in the configuration file (`lof_prediction = true`). Do note that this frequently increases the run time for VEP significantly.
100-
101101
#### STEP 5: Run example
102102

103103
Run the workflow with **gvanno.py**, which takes the following arguments and options:
104104

105-
usage: gvanno.py [options] <QUERY_VCF> <GVANNO_DIR> <OUTPUT_DIR> <GENOME_ASSEMBLY>
106-
<CONFIG_FILE> <SAMPLE_ID> --container <docker|singularity>
107-
108-
Germline variant annotation (gvanno) workflow for clinical and functional interpretation
109-
of germline nucleotide variants
110-
111-
positional arguments:
112-
query_vcf VCF input file with germline query variants (SNVs/InDels)
113-
gvanno_dir gvanno base directory with accompanying data directory, e.g. ~/gvanno-1.3.2
114-
output_dir Output directory
115-
{grch37,grch38} grch37 or grch38
116-
configuration_file gvanno configuration file (TOML format)
117-
sample_id Sample identifier - prefix for output files
118-
--container Run gvanno with docker or singularity
119-
120-
optional arguments:
121-
-h, --help show this help message and exit
122-
--force_overwrite The script will fail with an error if the output file already exists. Force the overwrite of existing result files by using this flag
123-
--version show program's version number and exit
124-
--no_vcf_validate Skip validation of input VCF with Ensembl's vcf-validator
105+
usage:
106+
gvanno.py -h [options]
107+
--query_vcf QUERY_VCF
108+
--gvanno_dir GVANNO_DIR
109+
--output_dir OUTPUT_DIR
110+
--genome_assembly grch37|grch38
111+
--sample_id SAMPLE_ID
112+
--container docker|singularity
113+
114+
gvanno - workflow for functional and clinical annotation of germline nucleotide variants
115+
116+
Required arguments:
117+
--query_vcf QUERY_VCF
118+
VCF input file with germline query variants (SNVs/InDels).
119+
--gvanno_dir GVANNO_DIR
120+
Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.4.0
121+
--output_dir OUTPUT_DIR
122+
Output directory
123+
--genome_assembly {grch37,grch38}
124+
Genome assembly build: grch37 or grch38
125+
--container {docker,singularity}
126+
Run gvanno with docker or singularity
127+
--sample_id SAMPLE_ID
128+
Sample identifier - prefix for output files
129+
130+
Optional arguments:
131+
--force_overwrite By default, the script will fail with an error if any output file already exists.
132+
You can force the overwrite of existing result files by using this flag, default: False
133+
--version show program's version number and exit
134+
--no_vcf_validate Skip validation of input VCF with Ensembl's vcf-validator, default: False
135+
--lof_prediction Predict loss-of-function variants with Loftee plugin in Variant Effect Predictor (VEP), default: False
136+
--vep_n_forks VEP_N_FORKS
137+
Number of forks for Variant Effect Predictor (VEP) processing, default: 4
138+
--vep_buffer_size VEP_BUFFER_SIZE
139+
Variant buffer size (variants read into memory simultaneously) for Variant Effect Predictor (VEP) processing
140+
- set lower to reduce memory usage, default: 5000
141+
--vep_pick_order VEP_PICK_ORDER
142+
Comma-separated string of ordered transcript properties for primary variant pick in
143+
Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane
144+
--vep_skip_intergenic
145+
Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: False
146+
--vcfanno_n_processes VCFANNO_N_PROCESSES
147+
Number of processes for vcfanno processing (see https://github.com/brentp/vcfanno#-p), default: 4
125148

126149

127150
The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
128151

129-
`python ~/gvanno-1.3.2/gvanno.py ~/gvanno-1.3.2/examples/example.grch37.vcf.gz --container docker`
130-
` ~/gvanno-1.3.2 ~/gvanno-1.3.2/examples grch37 ~/gvanno-1.3.2/gvanno.toml example`
152+
python ~/gvanno-1.4.0/gvanno.py
153+
--query_vcf ~/gvanno-1.4.0/examples/example.grch37.vcf.gz
154+
--gvanno_dir ~/gvanno-1.4.0
155+
--output_dir ~/gvanno-1.4.0
156+
--sample_id example
157+
--genome_assembly grch37
158+
--container docker
159+
--force_overwrite
131160

132161
This command will run the Docker-based *gvanno* workflow and produce the following output files in the _examples_ folder:
133162

@@ -142,4 +171,4 @@ Documentation of the various variant and gene annotations should be interrogated
142171

143172
### Contact
144173

145-
sigven@ifi.uio.no
174+
sigven AT ifi.uio.no

0 commit comments

Comments
 (0)