A pipeline to impute human SNP data to 1000 genomes efficiently by parallelising across a compute cluster.
Provided here is simply a collection of scripts in bash and R that knit together a two stage imputation process:
- Stage one uses
hapi-urto haplotype the target data- Williams AL, Patterson N, Glessner J, Hakonarson H, and Reich D. Phasing of Many Thousands of Genotyped Samples. American Journal of Human Genetics 2012 91(2) 238-251.
- Stage two uses
impute2to impute to the 1000 genomes reference- Howie BN, Donnelly P, and Marchini, J. A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. PLoS Genetics 2009 5(6):e1000529
It performs the following processes:
- Alignment of target data to reference data
- Haplotyping
- Imputing
- Converting to best-guess genotypes
- Filtering steps
Runtime is fast thanks to some great software that is freely available (see below). I typically have 400 or so cores available and for e.g. 1000 individuals this will complete in a few hours. Time complexity scales linearly.
- A compute cluster using
SGE - Your genotype data (22 autosomes), QC'd, in binary plink format, referred to as the target data set
- The downloads (including a reference data set) listed in the instructions below
R,awk,bash,git, a text editor, etc
- Haplotyped target and imputed data in
impute2format - Dosage imputed data in
impute2format - 'Best-guess' imputed data in binary
plinkformat - 'Best-guess' imputed data, filtered for MAF and imputation quality, in binary
plinkformat
Imputation is a big, slow, ugly, long-winded, hand-wavey, unpleasant process. In setting up this pipeline I have used plenty of scripts, programmes, and data found in various corners of the internet, and these have made the whole task much, much easier. Most of these resources have been used without permission from the original authors. If any of the authors are angry about this then let me know and I will take it down!
Here is a list of resources that I have used:
hapi-urdeveloped by Amy Williamsimpute2developed by Bryan Howieplinkdeveloped by Shaun Purcell- Parts of the
GermLinesoftware developed by Itsik Pe'er - Some of the
BEAGLEutilities, written by Brian and Sharon Browning liftOverdeveloped at UCSC- Reference data hosted by the developers of
impute2 - Strand alignment data files, produced and hosted by Will Rayner
- Strand alignment script developed by Neil Robertson
- The
plyrlibrary inR, written by Hadley Wickham - Help and guidelines from the
MaCHandminimacImputation Cookbook, developed by Goncalo Abecasis
The pipeline was developed by Gibran Hemani under the Complex Trait Genomics Group at the University of Queensland (Diamantina Institute and Queensland Brain Institute). Valuable help was provided by members of Peter Visscher's and Naomi Wray's group, and Paul Leo from Matt Brown's lab.
-
First Clone this repository
git clone https://github.com/explodecomputer/imputePipe.git -
Then
cpyour raw data in binary plink format todata/target -
Download the strand alignment files for your data's chip from Will Rayner's page and unzip.
-
Download the chain file for the SNP chip's build from UCSC (most likely you will need HG18 to HG19 which is included in this repo)
-
Download and unarchive the reference data from the impute2 website, e.g.
wget http://mathgen.stats.ox.ac.uk/impute/ALL_1000G_phase1integrated_v3_impute.tgz tar xzvf ALL_1000G_phase1integrated_v3_impute.tgz
This file has all the options required for the imputation process. It should be fairly self explanatory, just change the file names and options that are listed in the section marked To be edited by user
This is a two step process.
-
First, convert all alleles to be on the forward strand:
./strand_align.sh -
Second, convert the map to hg19, update SNP names and positions, remove SNPs that are not present in the reference data, and split the data into separate chromosomes. By running
qsub ref_align.sh
the script will be submitted to SGE to execute on all chromosomes in parallel. Alternatively you can run
./ref_align.sh <chr>
and the script will only work on chromosome <chr>.
The output from this will be binary plink files for each chromosome located in the data/target directory.
This uses Amy Williams' excellent haplotyping programme hapi-ur. We perform the haplotyping three times on each chromosome:
qsub hap.sh
and then vote on the most common outcome at each position to make a final haplotype:
qsub imp.sh
This also creates a new SGE submit script for each chromosome, where each chromosome has been split into 5Mb chunks with 250kb overlapping regions (these options can be amended in the parameter.sh file.
For both scripts the script can run on a specified chromosome in the front end by using
./hap.sh <chr>
./imp.sh <chr>
which might be useful for testing to see if it is working etc.
The output from this will be three haplotype file sets for each chromosome, as well as a final, democratically elected (!) file set in impute2 format, located in the data/haplotypes directory.
Most likely the lengthiest and most memory demanding stage. By running
./submit_imp.sh
the scripts spawned for each in the last step will be submitted to SGE.
With large sample sizes e.g. >10k individuals, my cluster will occasionally kill a particular chunk. Should this happen it is safe to run the submit script in its entirety again at the end - it will not overwrite anything that is already completed, and only those chunks that are incomplete will continue running.
This script will perform the imputation using impute2, and then convert the dosage output to best-guess genotypes in binary plink format.
Again, to test that it is working you can simply run the submit script in the front end for a particular chunk of the chromosome, e.g.
cd data/imputed/chr22
./submit_imp.sh 4
will run the 4th 5Mb chunk of chromosome 22.
The outputs from this script will be imputed dosages, haplotypes and best-guess genotypes in chromosomes broken into 5Mb chunks. These will be located in data/imputed.
This will stitch together the 5Mb chunks for each chromosome:
qsub stitch_plink.sh
Again, a single chromosome can be executed in the frontend by running:
./stitch_plink.sh
Imputed data for entire chromosomes in:
- Dosages (
impute2format) - Haplotypes (
impute2format) - Best-guess genotypes (binary
plinkformat) impute2info files
The final stage is to filter on MAF and quality. The thresholds can be amended in the parameter.sh file.
qsub filter.sh
or
./filter.sh <chr>
Best-guess genotypes (in binary plink format) for each chromosome.
This pipeline works for me. I use it regularly, and I thought it was a good idea to share it given that I am using so much stuff that has been shared by others.
I have never tried it on another cluster, and I imagine that some of the parameters will have to be customised for different cluster setups.
It is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.