Variant-calling-using-GATK4

Variant calling using GATK4 for a single sample

Objectives

Take raw DNA sequencing reads and perform variant calling to produce a variant list using GATK4.
Perform basic exploration of variants.

Description

This analysis runs through the GATK4 best practices workflow for variant calling. The workflow starts with pairs of sequencing reads and performs a series of steps to determine a set of genetic variants.

Data: Illumina HiSeq paired-end (2×100 bp) chr20 reads in FASTQ format. Source: Sample NA12878 from IGSR Reference: Grch38/Hg38 GATK bundle of reference

Tools and installation

The following tools are used in this analysis. Most of these tools are installed using Homebrew:

fastqc

brew install fastqc

multiqc

pip install multiqc

samtools 1.19

brew install samtools

bwa 0.7.17

brew install bwa

picard 3.1.1

Install picard.jar and most recent Java version using brew

brew install openjdk@17

GATK 4.1.3.0

This analysis is performed using GATK docker interactively:

docker pull broadinstitute/gatk:4.1.3.0

bcftools

brew install bcftools

Data preprocessing

Data preprocessing includes read trimming, alignment, sorting by coordinate, and marking duplicates.

Quality control checks on the raw fastq data

FastQC (Fast Quality Control) is designed to assess the quality of high-throughput sequencing data. Its main purpose is to provide a comprehensive analysis of various quality-related metrics and characteristics of raw sequencing data. Here are the primary purposes and features of FastQC:

Per-Base Sequence Quality
Per-Sequence Quality Scores
Sequence Length Distribution
GC Content Distribution
Sequence Duplication Levels
Overrepresented Sequences
Adapter Content
Kmer Content
Quality per Base Position

Run fastqc on both read 1 and read 2 fastq files:

fastqc NA12878.chr20.region_1.fastq.gz -o fatsqc_results/

MultiQC is a tool used to aggregate, analyze, and visualize quality control (QC) results from multiple samples. It creates a unified report and interactive HTML reports to view QC metrics of both NA12878.chr20.region_1.fastq.gz and NA12878.chr20.region_2.fastq.gz files.

multiqc fatsqc_results/

Results: multiqc_report.html report depicts good sequence/read quality with good quality scores, GC content and sequence length distribution.

Map raw mapped reads to reference genome

Map the raw sequencing data to the Homo sapiens genome assembly GRCh38 (hg38) using the BWA-MEM algorithms. Use Samtools to convert the output to the BAM format.

bwa mem -M -t 2 -R "@RG\tID:SRR622461.7\tSM:NA12878\tLB:ERR194147\tPL:ILLUMINA" \
reference/Homo_sapiens_assembly38.fasta \
NA12878.chr20.region_1.fastq.gz \
NA12878.chr20.region_2.fastq.gz | \
samtools view -b -h -o NA12878.bam -

At the end of this step you should have a file called NA12878.bam.

Sort SAM/BAM

The alignment file NA12878.bam is sorted using the Picard tools.

java -jar -Xmx7g picard.jar SortSam -I NA12878.bam -O NA12878.sort.bam

The above command will create a coordinate sorted BAM file and an index (.bai) file.

Mark duplicate reads

The aim of this step is to locate and tag duplicate reads in the BAM file.

java -jar -Xmx7g picard.jar MarkDuplicates -I NA12878.sort.bam -O NA12878.sort.dup.bam -M marked_dup_metrics.txt

Base quality recalibration

The last step of pre-processing mapped reads is the base quality score recalibration (BQSR) stage. The GATK tools detects systematic errors made by the sequencing machine while estimating the accuracy of each base. The systematic errors can have various sources ranging from technical machine errors to the variability in the sequencing chemical reactions.

We need the GATK tool on its docker from here on. After pulling the docker image as mentioned in the installation step, start a GATK container interactively by mounting your directory with required files:

docker run -v /Users/ramaiahr/Documents/wgs:/gatk/my_data -it broadinstitute/gatk:4.1.3.0

The following two step BQSR process applies machine learning to model the possible errors and adjust the base quality scores accordingly.

gatk BaseRecalibrator -I my_data/NA12878.sort.dup.bam \
-R my_data/reference/Homo_sapiens_assembly38.fasta \
--known-sites my_data/reference/dbsnp_146.hg38.vcf.gz \
-O recal_data.table

gatk ApplyBQSR -I my_data/NA12878.sort.dup.bam \
-R my_data/reference/Homo_sapiens_assembly38.fasta \
--bqsr-recal-file my_data/recal_data.table \
-O my_data/NA12878.sort.dup.bqsr.bam

We now have a pre-processed BAM file (NA12878.sort.dup.bqsr.bam) ready for variant calling. 3. The command below uses Picard to generate QC metrics. Run multiQC to aggregate it with fastq data and produce an HTML report.

java -jar -Xmx7g picard.jar CollectMultipleMetrics \
-I my_data/NA12878.sort.dup.bqsr.bam \
-R my_data/reference/Homo_sapiens_assembly38.fasta \
-O my_data/NA12878.sort.dup.bqsr.CollectMultipleMetrics

Variant Calling using HaplotypeCaller

HaplotypeCaller is the focal tool within GATK4 to simultaneously call germline SNVs and small Indels using local de novo assembly of haplotype regions. It is developed by the Broad Institute and is widely used in variant calling workflows.

Briefly, the HaplotypeCaller works by:

Identify active regions or regions with evidence of variations.
Re-asssemble the active regions.
Re-align active region reads to the assembled regions to identify allele.

gatk HaplotypeCaller -I my_data/NA12878.sort.dup.bqsr.bam \
-R my_data/reference/Homo_sapiens_assembly38.fasta \
-L chr20 \
-O my_data/NA12878.vcf.gz

The output will be a VCF (Variant Call Format) file containing the called variants.

GATK's SelectVariants tool is used to filter and select specific subsets of variants from a Variant Call Format (VCF) file based on certain criteria. This tool allows you to tailor your variant dataset to include only the variants that meet specified conditions, facilitating downstream analyses and focusing on relevant portions of the data.

To create separate VCF files for SNPs (Single Nucleotide Polymorphisms) and INDELS, using the following commands:

gatk SelectVariants -V my_data/NA12878.vcf.gz -R my_data/reference/Homo_sapiens_assembly38.fasta --select-type-to-include SNP -O my_data/output/raw_snps.vcf

gatk SelectVariants -V my_data/NA12878.vcf.gz -R my_data/reference/Homo_sapiens_assembly38.fasta --select-type-to-include INDEL -O my_data/output/raw_indel.vcf

Results: Totally, 7517 variants were recorded. 2 diferent vcf files with 6055 SNP variants and 1452 INDEL variants respectively were created.

Filter and prepare analysis ready variants

Consider the following method to filter a single sample VCF file. Here we will go through the Convolutional Neural Net based protocol to annotate and filter the VCF file.

This is a two step process:

(i) CNNScoreVariants will annotate the variant with pre-computed single-sample derived model scores in the INFO field CNN_1D (the neural network performs convolutions over the reference sequence surrounding the variant and combines those features with a multilayer perceptron on the variant annotations).

gatk CNNScoreVariants -V my_data/NA12878.vcf.gz \
-R my_data/reference/Homo_sapiens_assembly38.fasta \
-O my_data/filtered_output/NA12878.cnns.vcf \
--disable-avx-check

FilterVariantTranches takes as input the percent sensitivities (0-100) to known sites to apply the filter. Variants with scores higher than for e.g. 99th percentile of variants in the resources pass through the filter and will have PASS in the filter. Others will have a filter values like ‘CNN_1D_INDEL_Tranche_99.40_100.00’ or ‘CNN_1D_SNP_Tranche_99.95_100.00’.

gatk FilterVariantTranches -V my_data/filtered_output/NA12878.cnns.vcf \
--resource my_data/reference/hapmap_3.3.hg38.vcf.gz \
--resource my_data/reference/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--info-key CNN_1D --snp-tranche 99.95 --indel-tranche 99.4 \
-O my_data/filtered_output/NA12878.cnns.cnnfilter.vcf

Additional filtering The VariantFiltration tools is designed for hard-filtering variant calls based on custom quality criteria such as sequencing depth, mapping quality etc. The two parameters are the filter-name and filter-expression. The parameter filter-name is the name of the filter to be used in the FILTER column if the expression in filter-expression is true. In the example below, if the sequencing depth at the variant site (VCF field DP) is less than 10, the FILTER field will be populated with the value ‘Low_depth10’. Users can add multiple filter expression/name combinations.

gatk VariantFiltration -V my_data/NA12878.vcf.gz \
-R my_data/reference/Homo_sapiens_assembly38.fasta \
-O my_data/NA12878.vqsr.varfilter.vcf \
--filter-name "Low_depth10" \
--filter-expression "DP < 10"

Exporting variant data and visualisation

enerating a report for a VCF (Variant Call Format) file, especially in its compressed form (VCF.gz), typically involves using tools that can provide summary statistics, metrics, and other relevant information about the variants in the file.

Generate a VCF report. The report includes various metrics such as transition/transversion ratios, Ti/Tv ratios, and more.

bcftools stats NA12878.vcf.gz > final/report.txt

This GATK4 tool extracts fields of interest from each record in a VCF file. VariantsToTable can extract field from both the INFO and FORMAT columns in the VCF file.

gatk VariantsToTable -V my_data/NA12878.vqsr.varfilter.vcf \
-R my_data/reference/Homo_sapiens_assembly38.fasta \
-F CHROM -F POS -F FILTER -F TYPE -GF AD -GF DP --show-filtered \
-O my_data/NA12878.vqsr.varfilter.pass.tsv

We can also create a subset of variants to report.

bcftools view NA12878.vcf.gz chr20:3822018-3999324 | bgzip -c > filtered_output/subset.vcf.gz

View these vcf reports using IGV, a popular genome browser that allows you to visualize and explore genomic data, including VCF files.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
QC_metrics		QC_metrics
output		output
MultiQC Report.html		MultiQC Report.html
NA12878.vcf		NA12878.vcf
NA12878.vcf.gz		NA12878.vcf.gz
NA12878.vcf.gz.tbi		NA12878.vcf.gz.tbi
NA12878.vcf.idx		NA12878.vcf.idx
NA12878.vqsr.vcf		NA12878.vqsr.vcf
NA12878.vqsr.vcf.idx		NA12878.vqsr.vcf.idx
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Variant-calling-using-GATK4

Objectives

Description

Tools and installation

Data preprocessing

Base quality recalibration

Variant Calling using HaplotypeCaller

Filter and prepare analysis ready variants

Exporting variant data and visualisation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

ReshmaRamaiah10/Variant-calling-using-GATK4

Folders and files

Latest commit

History

Repository files navigation

Variant-calling-using-GATK4

Objectives

Description

Tools and installation

Data preprocessing

Base quality recalibration

Variant Calling using HaplotypeCaller

Filter and prepare analysis ready variants

Exporting variant data and visualisation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages