Skip to content

Annotating or Filtering Variants with Other VCFs

David A. Parry edited this page Jul 9, 2020 · 2 revisions

VCF Filtering/Annotation

Because VCF/BCF files are the de facto standard for storing genetic variants, VASE is designed to be able to annotate and filter variants using information in other VCFs without having to convert them to another file or database format. This means VCFs downloaded directly from gnomAD or dbSNP should work with VASE without performing any conversion steps as long as you are using the same genome versions in your input VCF/BCF file as your gnomAD/dbSNP VCF. It also means that you can use your own VCFs (e.g. from a control cohort) to perform filtering.

VCF Filtering Requirements

To use a VCF for filtering or annotation (--vcf_filter) it must be either a bgzip compressed and tabix-indexed VCF or an indexed BCF.

Your input (--input) VCF does not need to be compressed or sorted, but it is highly recommended that you use a sorted input VCF which will allow VASE to use a much faster look-up algorithm.

Custom VCF Filtering

VASE provides convenience methods for filtering using gnomAD, dbSNP or ClinVar data (see below). To provide your own VCF filter for filtering or annotating your input you can use the --vcf_filter argument as follows:

--vcf_filter <your_data.vcf.gz>,<label>[<CUSTOM_INFO_FIELD1>,<CUSTOM_INFO_FIELD2>]

VASE will automatically read the AF and AN fields in the given VCF(s) and add these annotations to your VCF. If the --freq or --min_freq arguments are given, the AF annotation will be used to filter variants based on their allele frequency in the given VCF(s). You must provide a label for your VCF(s) by adding a comma and label after the filename. VASE will then annotate AF and AN fields from your VCF(s) in the output as VASE_label_AF and VASE_label_AN.

If your VCF(s) provided to --vcf_filter have other custom INFO fields you want to annotate matching variants with, you can provide these separated by commas after the label.

In the example below, we provide two VCFs to be used as VCF filters. The first contains variants from a control cohort and will be used to add AF and AN data labelled in the output as VASE_ControlCohort_AF and VASE_ControlCohort_AN. For the second VCF we want to add data from custom INFO fields called FOO and BAR to matching variants in our input.

vase -i input.bcf --vcf_filter controls.vcf.gz,ControlCohort custom.vcf.gz,Custom1,FOO,BAR -o annotated.bcf

If you want to filter to only output variants with an allele frequency (AF) < 1% in these VCFs at the same time:

vase -i input.bcf --vcf_filter controls.vcf.gz,ControlCohort custom.vcf.gz,Custom1,FOO,BAR --freq 0.01 -o filtered.bcf

If you wanted to filter on these AF annotations in a later run you can use the --info_filters option as follows:

vase -i annotated.bcf --info_filters "VASE_ControlCohort_AF<0.01"

Special Case VCF Filters

VASE contains pre-set VCF filter options for some commonly used datasets. Furthermore, if a VCF is pre-annotated with these types of VCFs, VASE can read those annotations on subsequent runs without having to provide these VCFs again (this behaviour can be override by the --ignore_existing_annotations option if desired). So, you may want to annotate a VCF with gnomAD allele frequencies once so that you can later perform faster filtering at different allele frequencies. For example, after annotating your VCF, you might want to perform a recessive segregation analysis using a 1% allele frequency threshold and a dominant segregation analysis at a stricter 0.1% frequency threshold.

The special case VCF filters are detailed below.

gnomAD

Use the -g/--gnomad option to point VASE to gnomAD VCFs. By default VASE will annotate variants in your input matching variants in either gnomAD VCF with allele frequencies for the following subpopulations: AFR, AMR, EAS, FIN, NFE and SAS. These defaults were chosen based on having a large number of samples represented in gnomAD v2.1. As such, allele frequency filtering can be performed at strict thresholds with relative confidence using these populations. You can choose a different set of populations by passing the desired three-letter codes to the --gnomad_pops argument. For reference, the three-letter codes include:

afr    African-American/African ancestry
ami    Amish ancestry
sas    South Asian ancestry
amr    Latino ancestry
eas    East Asian ancestry
nfe    Non-Finnish European ancestry
fin    Finnish ancestry
asj    Ashkenazi Jewish ancestry
oth    Other ancestry

If the --freq argument is used when running VASE, VASE will filter variants if they are found in the gnomAD VCFs and have a frequency equal to or greater than the given value in any of the populations supplied to --gnomad_pops or the default populations if the --gnomad_pops argument is not used.

Additionally, the --max_gnomad_homozygotes argument is a special case where variants can be filtered if any of these populations have samples that are homozygous (or hemizygous) for matching variants. For example, if you are attempting to identify variants causing a rare recessive condition you might want to add --max_gnomad_homozygotes 1 to remove any variant found in the homozygous state in gnomAD.

dbSNP

VCFs from dbSNP contain allele frequencies from 1000 genomes and TOPMED which can parsed and used for annotating or filtering. Use the --dbsnp argument to provide dbSNP VCFs. The --freq argument can be used to filter variants with an allele frequency equal to or greater than this value in the 1000 genomes or TOPMED annotations in your dbSNP VCF.

The --build and --max_build arguments also provide the option to filter variants that were present in given versions of dbSNP. For example, --build 129 would filter variants if they were present in dbSNP v129 or earlier. However, while filtering based on dbSNP versions was a useful method for filtering out likely common variants before the availability of data from large projects such as gnomAD and TOPMED, this method is not generally recommended except for backwards compatibility with older analyses.

ClinVar

VCFs from ClinVar can also be used as a special case for annotating variants. These VCFs have information on whether an allele has been annotated as pathogenic, likely pathogenic, uncertain significance, likely benign or benign in ClinVar. In addition to adding these annotations for matching variants to your VCF, you can provide the --clinvar_path flag to tell VASE to retain variants that are labelled as likely pathogenic or pathogenic in ClinVar even if they otherwise fail allele frequency filters.

Outputting Matching/Novel Variants

VASE provides two options, --filter_novel and --filter_known, for filtering variants that are present/not present in other VCFs. For example, if we wanted to remove any variant present in another VCF:

vase -i input.bcf --vcf_filter other.vcf.gz,Other --filter_known -o novel.bcf

Or to output only variants that are present in another VCF:

vase -i input.bcf --vcf_filter other.vcf.gz,Other --filter_novel -o known.bcf

If we only wanted to variants that are relatively common (5% or higher) in gnomAD populations:

vase -i input.bcf \
-g gnomad.exomes.r2.1.1.sites.vcf.bgz gnomad.genomes.r2.1.1.sites.vcf.bgz \
--min_freq 0.05 \
--filter_novel \
-o common_vars.bcf

Like --freq and --min_freq arguments, by default --filter_known and --filter_novel will identify previously annotated gnomAD and dbSNP information and filter accordingly. So, the above example will also work for a file that has previously been annotated with gnomAD information:

#annotate gnomAD frequency information
vase -i input.bcf \
-g gnomad.exomes.r2.1.1.sites.vcf.bgz gnomad.genomes.r2.1.1.sites.vcf.bgz \
-o annotated.bcf

#filter pre-annotated VCF
vase -i annotated.bcf \
--min_freq 0.05 \
--filter_novel \
-o common_vars.bcf

Clone this wiki locally