- This tutorial will cover the following steps, all including the autosomes and the X chromosome:
- Quality control of genotype data
- Imputation of genotype data
- Genome-wide association study (GWAS)
- Genotype data is provided to use with this tutorial, or you can run through it using your own genotype data
-
Clone this repository to get a local copy of all files
-
In your terminal or HPC environment navigate to the location you would like to download a copy of all the files in this tutorial
-
Then run the following code:
git clone https://github.com/jodithea/Tutorial_GWAS_including_X_chromosome.git- You will now have a copy of all of the directories and files from this tutorial
- For this tutorial you will need the following software
- Plink v1.9
- R (v4.5.0 is used in this tutorial)
- Make sure the R package 'tidyverse' is installed
- vcftools (v0.1.16 used in this tutorial)
- bcftools (v1.22 used in this tutorial)
- GCTA (v1.94.1 used in this tutorial)
- Follow through the directories and files in chronological order
- Markdown files (files ending with .md) contain instructions and documentation
- You can view these files directly in the GitHub web interface
- Or, if you have a local clone of the repository, you can view them in the terminal using cat, e.g.:
cat README.md- Scripts (i.e. files ending with '.sh') are shell scripts that can be submitted to your HPC environment
- Again you can view these files directly in the GitHub web interface or in your local clone of the repository by using cat
- All scripts in this repository follow a similar layout:
- Shebang and Scheduler Header
- Every script starts with a shebang (#!/bin/bash) to indicate it is a bash script, followed by HPC scheduler directives
- The schedular directives are written for PBS, for example:
#!/bin/bash
#PBS -l walltime=00:10:00
#PBS -l mem=1GB
#PBS -l ncpus=1If you use SLURM instead of PBS, you can replace these lines with SLURM directives, for example:
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=1GB
#SBATCH --cpus-per-task=1To do this replacement, in your local clone of the repository, open the script with an editor, for example:
nano 01_Split_X_chromosome.shThen edit the scheduler directives in the file as needed
-
Script Description
- A short comment explaining what the script does, for example:
# This script splits the X chromosome into non-pseudoautosomal region (nPAR) (coded as CHR 23) and pseudoautosomal region (PAR) (coded as CHR 25)-
Environment
- This section loads any required modules or software, for example:
### Environment ###
module load plink/1.90b7-
Preamble / User Variables
- This section defines any directories, file paths, or parameters used in the script
- You need to edit this section of the script as appropriate
- For example, in each script the 'directory' variable is defined. Open the script using 'nano' (or a similar editor) and set the path to the location of where your local clone of the repository is
### Preamble ###
# Update to point to the location where you are doing this tutorial
directory=/path/Tutorial_GWAS_including_X_chromosome/-
Commands / Submission Section
- This section contains the actual commands that perform the work
- For example:
### Submit script ###
cd ${directory}
# Split chromosome X
plink --bfile 01_Genotype_Data/Genotypes_chrX \
--split-x b37 \
--make-bed \
--out 01_Genotype_Data/Genotypes_chrX_split- After viewing the script and making any edits as appropriate, submit the script
For PBS, submit using qsub, e.g.:
qsub 01_Split_X_chromosome.sh
For SLURM (and after you have edited the scheduler directives appropriately), submit using sbatch, e.g.:
sbatch 01_Split_X_chromosome.sh
- You can check the status of your jobs:
For PBS, using qstat:
qstat -u your_usernameFor SLURM, using squeue:
squeue -u your_username-
Some scripts in this tutorial use job arrays, one job per chromosome:
- 03_Imputation/04_Convert_and_filter_imputed_genotype_data.sh
- 04_Ancestry_Checks/02_Tidy_1000G_data.sh
- 04_Ancestry_Checks/03_Merge_1000G_with_tutorial_data_and_prune.sh
-
The scripts are written using PBS job scheduling
- The PBS scheduler directive uses
#PBS -Jto define the range of array jobs, for example:
#!/bin/bash
#PBS -l walltime=01:00:00
#PBS -l mem=80GB
#PBS -l ncpus=1
#PBS -J 1-22- Within the script, the array index for the current task is accessed using PBS_ARRAY_INDEX, for example:
chr=${PBS_ARRAY_INDEX}- Another example from this tutorial:
files=(${directory}03_Imputation/Genotype_imputation_results/chr_*.zip)
chr=$(basename "${files[$PBS_ARRAY_INDEX]}" .zip)-
If your HPC system uses SLURM, you must modify these scripts
-
Open the script using 'nano' (or a similar editor) and replace the PBS array directive with:
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --mem=80GB
#SBATCH --cpus-per-task=1
#SBATCH --array=1-22- And inside the script replace PBS_ARRAY_INDEX with the SLURM equivalent:
chr=${SLURM_ARRAY_TASK_ID}files=(${directory}03_Imputation/Genotype_imputation_results/chr_*.zip)
chr=$(basename "${files[$SLURM_ARRAY_TASK_ID]}" .zip)