Skip to content

Commit 67a4935

Browse files
authored
Merge pull request #14 from sigven/dev
1.5.0 release
2 parents 904160d + 77e31e8 commit 67a4935

File tree

13 files changed

+94
-72
lines changed

13 files changed

+94
-72
lines changed

README.md

Lines changed: 25 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -15,34 +15,29 @@ The germline variant annotator (*gvanno*) is a software package intended for ana
1515
*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record __per sample variant__.
1616

1717
### News
18+
* September 24th 2022 - **1.5.0 release**
19+
* Data updates: ClinVar, GENCODE GWAS catalog, CancerMine, Open Targets Platform
20+
* Software updates: VEP 107
21+
* Excluded UniProt KB from annotation tracks
1822
* December 21st 2021 - **1.4.4 release**
1923
* Data updates: ClinVar, GWAS catalog, CancerMine, UniProt KB, Open Targets Platform
2024
* Software updates: VEP (v105)
2125
* August 25th 2021 - **1.4.3 release**
2226
* Data updates: ClinVar, GWAS catalog, CancerMine, UniProt, Open Targets Platform
23-
* May 24th 2021 - **1.4.2 release**
24-
* Software update (VEP 104)
25-
* Data updates: ClinVar, GWAS catalog, CancerMine, Pfam, dbNSFP, UniProt
26-
* Two new options added:
27-
* `--vep_regulatory` - annotates variants for overlap with regulatory regions (details below)
28-
* `--docker-uid` - set Docker user id
29-
* New variant annotations for enhanced non-coding interpretation:
30-
* _REGULATORY_ANNOTATION_ : A comma-separated list of regulatory annotations from VEP's `--regulatory` option, i.e. __TF_binding_site__, overlap with __enhancer/promoter/open_chromatin__, __CTCF_binding_site__ etc. Included when the `--vep_regulatory` option is turned on in gvanno.
31-
* _NCER_PERCENTILE_: A genome-wide percentile rank score from the ncER algorithm (**n**on-**c**oding **E**ssential **R**egulation), [Wells et al., Nat Comm. (2019)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6868241/).
32-
33-
### Annotation resources (v1.4.4)
34-
35-
* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v105 (GENCODE v39/v19 as the gene reference dataset)
27+
28+
### Annotation resources (v1.5.0)
29+
30+
* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v107 (GENCODE v41/v19 as the gene reference dataset)
3631
* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021)
3732
* [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
3833
* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP
3934
* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
40-
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (December 2021)
41-
* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 41, December 2021)
42-
* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2021_11, Nocember 2021)
43-
* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2021_04, November 2021)
35+
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (September 2022)
36+
* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 47, July 2022)
37+
* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2022_06, June 2022)
4438
* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v35.0, November 2021)
45-
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (December 7th 2021)
39+
* [Mutation hotspots](cancerhotspots.org) - Database of mutation hotspots in cancer
40+
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (August 26th 2022)
4641

4742

4843
### Getting started
@@ -76,17 +71,17 @@ An installation of Python (version >=_3.6_) is required to run *gvanno*. Check t
7671

7772
#### STEP 2: Download *gvanno* and data bundle
7873

79-
1. [Download the latest version](https://github.com/sigven/gvanno/releases/tag/v1.4.4) (gvanno run script, v1.4.4)
74+
1. [Download the latest version](https://github.com/sigven/gvanno/releases/tag/v1.5.0) (gvanno run script, v1.5.0)
8075
2. Download (preferably using `wget`) and unpack the latest assembly-specific data bundle in the gvanno directory
81-
* [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20211221.tgz) (approx 18Gb)
82-
* [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20211221.tgz) (approx 20Gb)
76+
* [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20220921.tgz) (approx 20Gb)
77+
* [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20220921.tgz) (approx 28Gb)
8378
* Example commands:
84-
* `wget http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20211221.tgz`
79+
* `wget http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20220921.tgz`
8580
* `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`
8681

87-
A _data/_ folder within the _gvanno-1.4.4_ software folder should now have been produced
88-
3. Pull the [gvanno Docker image (1.4.4)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.2Gb):
89-
* `docker pull sigven/gvanno:1.4.4` (gvanno annotation engine)
82+
A _data/_ folder within the _gvanno-1.5.0_ software folder should now have been produced
83+
3. Pull the [gvanno Docker image (1.5.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.2Gb):
84+
* `docker pull sigven/gvanno:1.5.0` (gvanno annotation engine)
9085

9186
#### STEP 3: Input preprocessing
9287

@@ -115,7 +110,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
115110
--query_vcf QUERY_VCF
116111
VCF input file with germline query variants (SNVs/InDels).
117112
--gvanno_dir GVANNO_DIR
118-
Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.4.4
113+
Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.5.0
119114
--output_dir OUTPUT_DIR
120115
Output directory
121116
--genome_assembly {grch37,grch38}
@@ -152,10 +147,10 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
152147

153148
The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
154149

155-
python ~/gvanno-1.4.4/gvanno.py
156-
--query_vcf ~/gvanno-1.4.4/examples/example.grch37.vcf.gz
157-
--gvanno_dir ~/gvanno-1.4.4
158-
--output_dir ~/gvanno-1.4.4
150+
python ~/gvanno-1.5.0/gvanno.py
151+
--query_vcf ~/gvanno-1.5.0/examples/example.grch37.vcf.gz
152+
--gvanno_dir ~/gvanno-1.5.0
153+
--output_dir ~/gvanno-1.5.0
159154
--sample_id example
160155
--genome_assembly grch37
161156
--container docker

data-raw/RELEASE_NOTES

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,13 @@
1-
##GVANNO_SOFTWARE_VERSION = 1.4.4
2-
##GVANNO_DB_VERSION = 20211221
1+
##GVANNO_SOFTWARE_VERSION = 1.5.0
2+
##GVANNO_DB_VERSION = 20220913
33
pfam = v35.0 (November 2021)
44
ncER = v1.0 (March 2019)
5-
uniprot = release 2021_04
6-
corum = release 3.0 (20180903)
5+
uniprot = release 2022_03
76
onekg = phase 3 (20130502)
8-
dbsnp = build 154/153
7+
dbsnp = build 154
98
dbnsfp = v4.2 (April 2021)
109
gnomad = r2.1 (October 2018)
11-
gwas = December 2021 (20211207)
12-
clinvar = December 2021 (20211130)
13-
opentargets = 2021_11
14-
gencode = 39/19
10+
gwas = August 2022 (20220826)
11+
clinvar = September 2022 (20220831)
12+
opentargets = 2022_06
13+
gencode = 41/19

examples/example.grch37.vcf.gz

1.58 KB
Binary file not shown.

examples/example.grch37.vcf.gz.tbi

-2.48 KB
Binary file not shown.

gvanno.py

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,10 @@
1111
import platform
1212
from argparse import RawTextHelpFormatter
1313

14-
GVANNO_VERSION = '1.4.4'
15-
DB_VERSION = 'GVANNO_DB_VERSION = 20211221'
16-
VEP_VERSION = '105'
17-
GENCODE_VERSION = 'v39'
14+
GVANNO_VERSION = '1.5.0'
15+
DB_VERSION = 'GVANNO_DB_VERSION = 20220921'
16+
VEP_VERSION = '107'
17+
GENCODE_VERSION = 'v41'
1818
VEP_ASSEMBLY = "GRCh38"
1919
DOCKER_IMAGE_VERSION = 'sigven/gvanno:' + str(GVANNO_VERSION)
2020

@@ -294,6 +294,9 @@ def run_gvanno(arg_dict, host_directories):
294294
container_command_run2 = container_command_run2 + " -W /workdir/output " + 'src/gvanno.sif' + " sh -c \""
295295
docker_command_run_end = '\"'
296296

297+
#logger.info(container_command_run1)
298+
#logger.info(container_command_run2)
299+
297300
## GVANNO|start - Log key information about sample, options and assembly
298301
logger = getlogger("gvanno-start")
299302
logger.info("--- Germline variant annotation (gvanno) workflow ----")
@@ -306,6 +309,7 @@ def run_gvanno(arg_dict, host_directories):
306309
logger.info("STEP 0: Validate input data")
307310
vcf_validate_command = str(container_command_run1) + "gvanno_validate_input.py " + str(data_dir) + " " + str(input_vcf_docker) + " " + \
308311
str(vcf_validation) + " " + str(arg_dict['genome_assembly']) + docker_command_run_end
312+
#logger.info(vcf_validate_command)
309313

310314
check_subprocess(vcf_validate_command)
311315
logger.info('Finished')
@@ -378,10 +382,10 @@ def run_gvanno(arg_dict, host_directories):
378382
## GVANNO|vcfanno - annotate VCF against a number of variant annotation resources
379383
print()
380384
logger = getlogger('gvanno-vcfanno')
381-
logger.info("STEP 2: Clinical/functional variant annotations with gvanno-vcfanno (ClinVar, ncER, dbNSFP, GWAS catalog, UniProtKB, cancerhotspots.org)")
385+
logger.info("STEP 2: Clinical/functional variant annotations with gvanno-vcfanno (ClinVar, ncER, dbNSFP, GWAS catalog, cancerhotspots.org)")
382386
logger.info('vcfanno configuration - number of processes (-p): ' + str(arg_dict['vcfanno_n_processes']))
383387
gvanno_vcfanno_command = str(container_command_run2) + "gvanno_vcfanno.py --num_processes " + str(arg_dict['vcfanno_n_processes']) + \
384-
" --dbnsfp --clinvar --ncer --uniprot --gvanno_xref --gwas --cancer_hotspots " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + \
388+
" --dbnsfp --clinvar --ncer --gvanno_xref --gwas --cancer_hotspots " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + \
385389
" " + os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + docker_command_run_end
386390
check_subprocess(gvanno_vcfanno_command)
387391
logger.info("Finished")
@@ -412,7 +416,7 @@ def run_gvanno(arg_dict, host_directories):
412416
print()
413417
## GVANNO|vcf2tsv - convert VCF to TSV with https://github.com/sigven/vcf2tsv
414418
logger = getlogger("gvanno-vcf2tsv")
415-
logger.info("STEP 4: Converting VCF to TSV with https://github.com/sigven/vcf2tsv")
419+
logger.info("STEP 4: Converting VCF to TSV with https://github.com/sigven/vcf2tsvpy")
416420
gvanno_vcf2tsv_command_pass = str(container_command_run2) + "vcf2tsv.py " + str(output_pass_vcf) + " --compress " + str(output_pass_tsv) + docker_command_run_end
417421
gvanno_vcf2tsv_command_all = str(container_command_run2) + "vcf2tsv.py " + str(output_vcf) + " --compress --keep_rejected " + str(output_tsv) + docker_command_run_end
418422
logger.info("Conversion of VCF variant data to records of tab-separated values - PASS variants only")

src/Dockerfile

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ RUN apt-get update && apt-get -y install \
2222
ENV OPT /opt/vep
2323
ENV OPT_SRC $OPT/src
2424
ENV HTSLIB_DIR $OPT_SRC/htslib
25-
ENV BRANCH release/105
25+
ENV BRANCH release/107
2626

2727
# Working directory
2828
WORKDIR $OPT_SRC
@@ -166,7 +166,7 @@ ENV PERL5LIB $PERL5LIB_TMP
166166
WORKDIR /
167167
ADD loftee_1.0.3.tgz $OPT/src/ensembl-vep/modules
168168
ADD UTRannotator.tgz $OPT/src/ensembl-vep/modules
169-
RUN wget -q "https://raw.githubusercontent.com/Ensembl/VEP_plugins/release/105/NearestExonJB.pm" -O $OPT/src/ensembl-vep/modules/NearestExonJB.pm
169+
RUN wget -q "https://raw.githubusercontent.com/Ensembl/VEP_plugins/release/107/NearestExonJB.pm" -O $OPT/src/ensembl-vep/modules/NearestExonJB.pm
170170

171171

172172
# Final steps
@@ -250,7 +250,9 @@ RUN rm miniconda.sh
250250

251251
# update conda & install vt
252252
RUN /conda/bin/conda update conda
253+
#RUN /conda/bin/conda update python
253254
RUN /conda/bin/conda install -c bioconda vt
255+
#RUN /conda/bin/conda install -c bioconda vcf2tsvpy
254256

255257
## Clean Up
256258
RUN apt-get clean autoclean
@@ -271,7 +273,7 @@ RUN rm -rf $HOME/src/ensembl-vep/t/
271273
RUN rm -f $HOME/src/v335_base.tar.gz
272274
RUN rm -f $HOME/src/release-1-6-924.zip
273275
RUN rm -rf /samtools-1.10.tar.bz2
274-
RUN rm -f /conda/bin/python
276+
#RUN rm -f /conda/bin/python
275277

276278
ADD gvanno.tgz /
277279
ENV PATH=$PATH:/conda/bin:/gvanno

src/buildDocker.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,5 @@ cp /Users/sigven/research/software/vcf2tsv/vcf2tsv.py gvanno/
44
tar czvfh gvanno.tgz gvanno/
55
echo "Build the Docker Image"
66
TAG=`date "+%Y%m%d"`
7-
docker build --no-cache -t sigven/gvanno:$TAG --rm=true .
7+
docker build -t sigven/gvanno:$TAG --rm=true .
88

src/gvanno.tgz

-47 Bytes
Binary file not shown.

src/gvanno/gvanno_summarise.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,8 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, r
3333
"""
3434

3535
## read VEP and PCGR tags to be appended to VCF file
36-
vcf_infotags_meta = annoutils.read_infotag_file(os.path.join(gvanno_db_directory,'gvanno_infotags.tsv'))
37-
gvanno_xref_map = annoutils.read_genexref_namemap(os.path.join(gvanno_db_directory,'gvanno_xref', 'gvanno_xref.namemap.tsv'))
36+
vcf_infotags_meta = annoutils.read_infotag_file(logger, os.path.join(gvanno_db_directory,'gvanno_infotags.tsv'))
37+
gvanno_xref_map = annoutils.read_genexref_namemap(logger, os.path.join(gvanno_db_directory,'gvanno_xref', 'gvanno_xref_namemap.tsv'))
3838
out_vcf = re.sub(r'\.vcf(\.gz){0,}$','.annotated.vcf',query_vcf)
3939

4040
meta_vep_dbnsfp_info = annoutils.vep_dbnsfp_meta_vcf(query_vcf, vcf_infotags_meta)

src/gvanno/gvanno_validate_input.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ def check_existing_vcf_info_tags(input_vcf, gvanno_directory, genome_assembly, l
3434
If any coinciding tags, an error will be returned
3535
"""
3636

37-
gvanno_infotags_desc = annoutils.read_infotag_file(os.path.join(gvanno_directory,'data',genome_assembly,'gvanno_infotags.tsv'))
37+
gvanno_infotags_desc = annoutils.read_infotag_file(logger, os.path.join(gvanno_directory,'data',genome_assembly,'gvanno_infotags.tsv'))
3838

3939
vcf = VCF(input_vcf)
4040
logger.info('Checking if existing INFO tags of query VCF file coincide with gvanno INFO tags')

0 commit comments

Comments
 (0)