Skip to content

Commit 9f48483

Browse files
authored
Merge pull request #15 from sigven/dev
Coding only
2 parents 67a4935 + 60681c7 commit 9f48483

File tree

4 files changed

+35
-24
lines changed

4 files changed

+35
-24
lines changed

README.md

Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ The germline variant annotator (*gvanno*) is a software package intended for ana
1515
*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record __per sample variant__.
1616

1717
### News
18+
* September 26th 2022 - **1.5.1 release**
19+
* Added option `--vep_coding_only` - only report variants that fall into coding regions of transcripts (VEP option `--coding_only`)
1820
* September 24th 2022 - **1.5.0 release**
1921
* Data updates: ClinVar, GENCODE GWAS catalog, CancerMine, Open Targets Platform
2022
* Software updates: VEP 107
@@ -25,7 +27,7 @@ The germline variant annotator (*gvanno*) is a software package intended for ana
2527
* August 25th 2021 - **1.4.3 release**
2628
* Data updates: ClinVar, GWAS catalog, CancerMine, UniProt, Open Targets Platform
2729

28-
### Annotation resources (v1.5.0)
30+
### Annotation resources (v1.5.1)
2931

3032
* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v107 (GENCODE v41/v19 as the gene reference dataset)
3133
* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021)
@@ -71,17 +73,17 @@ An installation of Python (version >=_3.6_) is required to run *gvanno*. Check t
7173

7274
#### STEP 2: Download *gvanno* and data bundle
7375

74-
1. [Download the latest version](https://github.com/sigven/gvanno/releases/tag/v1.5.0) (gvanno run script, v1.5.0)
76+
1. [Download the latest version](https://github.com/sigven/gvanno/releases/tag/v1.5.1) (gvanno run script, v1.5.1)
7577
2. Download (preferably using `wget`) and unpack the latest assembly-specific data bundle in the gvanno directory
7678
* [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20220921.tgz) (approx 20Gb)
7779
* [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20220921.tgz) (approx 28Gb)
7880
* Example commands:
7981
* `wget http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20220921.tgz`
8082
* `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`
8183

82-
A _data/_ folder within the _gvanno-1.5.0_ software folder should now have been produced
83-
3. Pull the [gvanno Docker image (1.5.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.2Gb):
84-
* `docker pull sigven/gvanno:1.5.0` (gvanno annotation engine)
84+
A _data/_ folder within the _gvanno-1.5.1_ software folder should now have been produced
85+
3. Pull the [gvanno Docker image (1.5.1)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.2Gb):
86+
* `docker pull sigven/gvanno:1.5.1` (gvanno annotation engine)
8587

8688
#### STEP 3: Input preprocessing
8789

@@ -110,7 +112,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
110112
--query_vcf QUERY_VCF
111113
VCF input file with germline query variants (SNVs/InDels).
112114
--gvanno_dir GVANNO_DIR
113-
Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.5.0
115+
Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.5.1
114116
--output_dir OUTPUT_DIR
115117
Output directory
116118
--genome_assembly {grch37,grch38}
@@ -134,6 +136,8 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
134136
Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane
135137
--vep_skip_intergenic
136138
Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: False
139+
--vep_coding_only
140+
Only report variants falling into coding regions of transcripts (VEP), default: False
137141

138142
Other optional arguments:
139143
--force_overwrite By default, the script will fail with an error if any output file already exists.
@@ -147,10 +151,10 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
147151

148152
The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
149153

150-
python ~/gvanno-1.5.0/gvanno.py
151-
--query_vcf ~/gvanno-1.5.0/examples/example.grch37.vcf.gz
152-
--gvanno_dir ~/gvanno-1.5.0
153-
--output_dir ~/gvanno-1.5.0
154+
python ~/gvanno-1.5.1/gvanno.py
155+
--query_vcf ~/gvanno-1.5.1/examples/example.grch37.vcf.gz
156+
--gvanno_dir ~/gvanno-1.5.1
157+
--output_dir ~/gvanno-1.5.1
154158
--sample_id example
155159
--genome_assembly grch37
156160
--container docker

gvanno.py

Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
import platform
1212
from argparse import RawTextHelpFormatter
1313

14-
GVANNO_VERSION = '1.5.0'
14+
GVANNO_VERSION = '1.5.1'
1515
DB_VERSION = 'GVANNO_DB_VERSION = 20220921'
1616
VEP_VERSION = '107'
1717
GENCODE_VERSION = 'v41'
@@ -37,16 +37,17 @@ def __main__():
3737
optional.add_argument('--version', action='version', version='%(prog)s ' + str(GVANNO_VERSION))
3838
optional.add_argument('--no_vcf_validate', action = "store_true",help="Skip validation of input VCF with Ensembl's vcf-validator, default: %(default)s")
3939
optional.add_argument('--docker_uid', dest = 'docker_user_id', help = 'Docker user ID. default is the host system user ID. If you are experiencing permission errors, try setting this up to root (`--docker-uid root`)')
40-
optional_vep.add_argument('--vep_regulatory', action='store_true', help = 'Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP).')
41-
optional_vep.add_argument('--vep_gencode_all', action='store_true', help = 'Consider all GENCODE transcripts with Variant Effect Predictor (VEP) (option --gencode_basic in VEP is used by default in gvanno).')
40+
optional_vep.add_argument('--vep_regulatory', action='store_true', help = 'Enable VEP to look for overlap with regulatory regions (option --regulatory in VEP).')
41+
optional_vep.add_argument('--vep_gencode_all', action='store_true', help = 'Consider all GENCODE transcripts with VEP (option --gencode_basic in VEP is used by default in gvanno).')
4242
optional_vep.add_argument('--vep_lof_prediction', action = "store_true", help = "Predict loss-of-function variants with Loftee plugin " + \
43-
"in Variant Effect Predictor (VEP), default: %(default)s")
44-
optional_vep.add_argument('--vep_n_forks', default = 4, help="Number of forks for Variant Effect Predictor (VEP) processing, default: %(default)s")
43+
"in VEP, default: %(default)s")
44+
optional_vep.add_argument('--vep_n_forks', default = 4, help="Number of forks for VEP processing, default: %(default)s")
4545
optional_vep.add_argument('--vep_buffer_size', default = 500, help="Variant buffer size (variants read into memory simultaneously) " + \
46-
"for Variant Effect Predictor (VEP) processing\n- set lower to reduce memory usage, higher to increase speed, default: %(default)s")
46+
"for VEP processing\n- set lower to reduce memory usage, higher to increase speed, default: %(default)s")
4747
optional_vep.add_argument('--vep_pick_order', default = "canonical,appris,biotype,ccds,rank,tsl,length,mane", help="Comma-separated string " + \
48-
"of ordered transcript properties for primary variant pick in\nVariant Effect Predictor (VEP) processing, default: %(default)s")
49-
optional_vep.add_argument('--vep_skip_intergenic', action = "store_true", help="Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: %(default)s")
48+
"of ordered transcript properties for primary variant pick in\nVEP processing, default: %(default)s")
49+
optional_vep.add_argument('--vep_skip_intergenic', action = "store_true", help="Skip intergenic variants (VEP), default: %(default)s")
50+
optional_vep.add_argument('--vep_coding_only', action = "store_true", help="Only return consequences that fall in the coding regions of transcripts (VEP), default: %(default)s")
5051
optional.add_argument('--vcfanno_n_processes', default = 4, help="Number of processes for vcfanno " + \
5152
"processing (see https://github.com/brentp/vcfanno#-p), default: %(default)s")
5253
required.add_argument('--query_vcf', help='VCF input file with germline query variants (SNVs/InDels).', required = True)
@@ -347,6 +348,8 @@ def run_gvanno(arg_dict, host_directories):
347348
gencode_set_in_use = "GENCODE - basic transcript set (--gencode_basic)"
348349
if arg_dict['vep_skip_intergenic'] == 1:
349350
vep_options = vep_options + " --no_intergenic"
351+
if arg_dict['vep_coding_only'] == 1:
352+
vep_options = vep_options + " --coding_only"
350353
if arg_dict['vep_regulatory'] == 1:
351354
vep_options = vep_options + " --regulatory"
352355
if arg_dict['vep_lof_prediction'] == 1:
@@ -362,13 +365,14 @@ def run_gvanno(arg_dict, host_directories):
362365
## GVANNO|VEP - run consequence annotation with Variant Effect Predictor
363366
logger = getlogger('gvanno-vep')
364367
print()
365-
logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (" + str(VEP_VERSION) + ", GENCODE " + str(GENCODE_VERSION) + ", " + str(arg_dict['genome_assembly']) + ")")
366-
logger.info("VEP configuration - one primary consequence block pr. alternative allele (--flack_pick_allele)")
368+
logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (v" + str(VEP_VERSION) + ", GENCODE " + str(GENCODE_VERSION) + ", " + str(arg_dict['genome_assembly']) + ")")
369+
logger.info("VEP configuration - one primary consequence block pr. alternative allele (--flag_pick_allele)")
367370
logger.info("VEP configuration - transcript pick order: " + str(arg_dict['vep_pick_order']))
368371
logger.info("VEP configuration - transcript pick order: See more at https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick_options")
369372
logger.info("VEP configuration - GENCODE set: " + str(gencode_set_in_use))
370373
logger.info("VEP configuration - buffer size: " + str(arg_dict['vep_buffer_size']))
371374
logger.info("VEP configuration - skip intergenic: " + str(arg_dict['vep_skip_intergenic']))
375+
logger.info("VEP configuration - coding only: " + str(arg_dict['vep_coding_only']))
372376
logger.info("VEP configuration - look for overlap with regulatory regions: " + str(arg_dict['vep_regulatory']))
373377
logger.info("VEP configuration - number of forks: " + str(arg_dict['vep_n_forks']))
374378
logger.info("VEP configuration - loss-of-function prediction: " + str(arg_dict['vep_lof_prediction']))

src/gvanno.tgz

52 Bytes
Binary file not shown.

src/gvanno/gvanno_summarise.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, r
5858
w = Writer(out_vcf, vcf)
5959
current_chrom = None
6060
num_chromosome_records_processed = 0
61+
num_records_filtered = 0
6162

6263
vcf_info_element_types = {}
6364
for e in vcf.header_iter():
@@ -77,10 +78,11 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, r
7778
current_chrom = str(rec.CHROM)
7879
num_chromosome_records_processed = 0
7980
if rec.INFO.get('CSQ') is None:
80-
alt_allele = ','.join(rec.ALT)
81-
pos = rec.start + 1
82-
variant_id = 'g.' + str(rec.CHROM) + ':' + str(pos) + str(rec.REF) + '>' + alt_allele
83-
logger.warning('Variant record ' + str(variant_id) + ' does not have CSQ tag from Variant Effect Predictor (vep_skip_intergenic in config set to true?) - variant will be skipped')
81+
num_records_filtered = num_records_filtered + 1
82+
#alt_allele = ','.join(rec.ALT)
83+
#pos = rec.start + 1
84+
#variant_id = 'g.' + str(rec.CHROM) + ':' + str(pos) + str(rec.REF) + '>' + alt_allele
85+
#logger.warning('Variant record ' + str(variant_id) + ' does not have CSQ tag from Variant Effect Predictor (--vep_skip_intergenic or --vep_coding_only turned ON?) - variant will be skipped')
8486
continue
8587
num_chromosome_records_processed += 1
8688
gvanno_xref = annoutils.make_transcript_xref_map(rec, gvanno_xref_map, xref_tag = "GVANNO_XREF")
@@ -116,6 +118,7 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, r
116118
w.close()
117119
logger.info('Completed summary of functional annotations for ' + str(num_chromosome_records_processed) + ' variants on chromosome ' + str(current_chrom))
118120
vcf.close()
121+
logger.info("Number of variant calls filtered by VEP (No CSQ tag, '--vep_coding_only' / '--vep_skip_intergenic'): " + str(num_records_filtered))
119122

120123
if os.path.exists(out_vcf):
121124
if os.path.getsize(out_vcf) > 0:

0 commit comments

Comments
 (0)