Tim Morris - Designed the pipline and led development and implementation.
Gemma Shireby - Designed the pipline and led curation of genotype data.
Georg Otto - Contributed to pipeline design including feature suggestions.
David Bann - Provided code annotations and documentation support.
Liam Wright - Checked PGIs and designed PGI visualations.
For queries and to report errors, please contact Tim Morris.
This pipeline was designed to run on the British cohort studies managed by the Centre for Longitudinal Studies. Further information about the cohorts can be found on the CLS website and in the CLS genomics cohort profile paper.
The CLS genomics GitHub website contains detailed information about the genotyping, imputation and quality control of the underlying genetic resources used in this pipeline.
This pipeline generates Polygenic Indices (PGIs) across multiple cohorts using GWAS summary statistics and cohort genotype data. The pipeline was designed to run on the CLS cohorts but can be applied simply to other cohort studies. The pipeline uses the following steps:
- Prepares a lookup file for fast build checking.
- Harmonises SNPs across all input cohorts to ensure consistent variant sets for PGI.
- Reformat GWAS summary statistics, checking the genome build and performing liftover to hg38 where neccessary.
- Computes PGIs using a clumping and thresholding approach as applied in PRSice2.
- Cleans up large intermediary files to save disk space.
The pipeline requires the following software packages to be installed:
- PLINK v1.9.
- PRSice2.
- LiftOver.
- R (with required packages).
- Standard Unix tools (awk, sort, comm, etc.).
- Genotype data for each cohort in PLINK binary format (.bed/.bim/.fam).
- GWAS summary statistics for each trait.
- dbSNP reference file relevant to the genome build of your cohort genotype data (here, hg38).
- LiftOver chain file relevant to the genome build of your cohort genotype data (here, hg19 to hg38).
Pull the scripts from GitHub:
git clone https://github.com/CLS-Data/CLS_PGI_repository.git
Navigate to the script directory:
cd pgi_pipeline
Edit the config template to map the correct paths for the pipeline:
cp scripts/config_template.sh scripts/config.sh
nano scripts/config.sh
GWAS summary statistics files for each trait need to be placed in trait-specific directories with the below structure. The naming of the trait-specific directories is upto the analyst; we recommend brevity.
|--- reference/
|--- summary_statistics/
|--- [trait]/
|---gwas_summary_statistics.txt.gz
Phenotypes need to be specified in the below phenotype_list.txt file in the reference folder.
|--- reference/
|--- phenotype_list.txt
The the phenotype_list.txt file must contain one row for each trait with comma separated specifications for each of folder name, sumstats file, trait name
. For example:
alcohol,DrinksPerWeek.txt.gz,drinksperweek
bmi,BMI_summary_stats.txt.gz,bmi
The dbsnp lookup relevant to the build of your cohort genome data needs to be downloaded to the following location:
|--- reference/
qsub scripts/x1.genome_build_check_prep.sh scripts/config.sh
This script prepares reference files for genome build checking.
qsub scripts/x2.harmonise_snps.sh
This script creates harmonised SNP lists across all cohorts for consistent PGI calculation.
qsub scripts/x3.run_reformat_sum_stats.sh
This script reformats GWAS summary statistics and handles genome build conversion where needed.
The submission script needs to be run once for each cohort.
qsub scripts/x4.submit_pgi_pipeline.sh scripts/config.sh [cohort] results/cross_cohort/[cohort]/[cohort]_cross_cohort
This script generates the PGIs for each cohort from the cleaned genotype data and GWAS summary statistics.
PRSice creates a .valid file for each trait containing the SNPs used to estimate the PGI, which can be very large. Analysts may wish to use this (optional) script to remove valid files over a pre-specified size.
qsub scripts/x5.rm_valid.sh
The pipeline will output all derived files to the following folders:
|--- results/
|--- cross-cohort/
|--- [cohort]
|--- snplists
|--- pgi/
|--- [cohort]
|--- [trait]
|--- summary_statistics
|--- [trait]
Generated PGIs can be found for each cohort in results/pgi/[cohort]/[trait]/
, with the cleaned genotype files and shared snplist in results/cross-cohort/
, and the cleaned summary statistics in results/summary_statistics/[trait]
.
- Missing files: Check that all paths in
scripts/config.sh
are correct. - Software: Ensure all required software is installed and paths are correct.
- Permissions: Check file permissions, execution and cluster job submission rights.
- Memory: The pipeline was designed to run on cohorts with sample sizes of N<10,000 genotyped individuals. For cohorts with larger sample sizes you may need to assign more memory in the scripts.
Check logs for detailed error messages.
For the initial run, the pipeline must be run iterively in the following order:
scripts/x1.genome_build_check_prep.sh
scripts/x2.harmonise_snps.sh
scripts/x3.run_reformat_sum_stats.sh
scripts/x4.submit_pgi_pipeline.sh
scripts/x5.rm_valid.sh
After the pipeline has been run once, scripts 3-5 can be run again for additional traits.
If you use this pipeline, we ask that you cite the UK Data Service documentation that describes the PGIs generated from the pipeline.