recombination-landscape

Preparing VCF and running ReLERNN

Optimally, you have many whole genome sequence data for individuals from a single panmictic population mapping to a closely related reference genome (conspecific is best).

1. For each indiviudal, run haplotypecaller using GATK to produce a GVCF file

#for gpu compatible devices I use Parabricks

pbrun fq2bam \
  --ref Leiomod.fa \
  --in-fq Leiomadagascarensis-ran38059_S10_L004_R1_001.fastq.gz Leiomadagascarensis-ran38059_S10_L004_R2_001.fastq.gz \
  --out-bam LM_ran38059.bam

pbrun haplotypecaller \
  --ref Leiomod.fa \
  --in-bam LM_ran38059.bam \
  --out-variants LM_ran38059.g.vcf --gvcf

2. Combine all with CombinGVCFs. Make sure the name of the sample is in the GVCFs

while IFS= read -r name; do
  # Your command here, using $line for the current line
  echo "Processing line: $name"
  bgzip ${name}.g.vcf
  tabix ${name}.g.vcf.gz
  echo ${name} > ${name}
  bcftools reheader -s $name -o ${name}.rename.g.vcf.gz ${name}.g.vcf.gz
  tabix -f ${name}.rename.g.vcf.gz
done < leionames.txt

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /mendel-nas1/ddebaun/gatk/build/libs/gatk-package-4.2.5.0-17-gb14f810-SNAPSHOT-local.jar CombineGVCFs -R $reference -V LLM_RAN39039.rename.g.vcf.gz -V LLM_RAX12322.rename.g.vcf.gz -V LLM_RAX12546.rename.g.vcf.gz -V LLM_RAX5299.rename.g.vcf.gz -V LLM_ran38059.rename.g.vcf.gz -V LLM_rax12515.rename.g.vcf.gz -O Leiomod_combined.g.vcf.gz

3. Run GenotypeGVCF

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /mendel-nas1/ddebaun/gatk/build/libs/gatk-package-4.2.5.0-17-gb14f810-SNAPSHOT-local.jar GenotypeGVCFs -R $reference -V Leiomod_combined.g.vcf.gz -O Leiomod_allsites.vcf.gz --include-non-variant-sites

4. Filter VCF file

First, get all biallelic SNPs

vcf_in="${name}.vcf"
filtered_vcf="${name}_snps.vcf"
bcftools view \
  -f .,PASS \
  -v snps \
  --min-alleles 2 --max-alleles 2 \
  $vcf_in \
  -Ov -o $filtered_vcf

For samples with low depth and low genotype quality, convert these to missing.

masked_vcf="${name}_masked.vcf.gz"
bcftools +setGT $filtered_vcf -- -t q -n . -i 'FMT/DP < 5 || FMT/GQ < 30' \
| bcftools view -Oz -o $masked_vcf

ReLERNN learns from the pattern of polymorphism across haplotypes. Very few samples (or sites present in only 1–2 samples) give weak signal and increase noise. Here I require no more than 33% of samples are missing. For 6 taxa, requires 4 are present at a site.

filtered_vcf_2="${name}_snps_relernn.vcf.gz"
bcftools view \
       -i "F_MISSING <= 0.33" \
       $masked_vcf \
        -Oz -o $filtered_vcf_2

Additional quality control filtering of SNPs

tabix ${filtered_vcf_2}

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /mendel-nas1/ddebaun/gatk/build/libs/gatk-package-4.2.5.0-17-gb14f810-SNAPSHOT-local.jar VariantFiltration \
-R $reference \
-V ${filtered_vcf_2}.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 60.0" --filter-name "FS60" \
--filter-expression "MQ < 40.0" --filter-name "MQ40" \
--filter-expression "SOR > 3.0" --filter-name "SOR3" \
--filter-expression "ExcessHet > 54.69" --filter-name "ExcessHet54.69" \
--filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
--filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
-O ${name}_filtered.vcf.gz

bcftools view  -f 'PASS' ${name}_filtered.vcf.gz  -Ov -o ${name}_filtered_PASS.vcf

5. Create mask file

Some sites of the reference genome may never be mapped to across your samples. Here, I create a mask of these regions based on depth and genotype quality statistics

vcftools --vcf Leiomod_allsites.vcf.gz --site-mean-depth --out Leiomod_depth
bcftools query -f '%CHROM\t%POS\t%POS[\t%GQ]\n' Leiomod_allsites.vcf.gz \
  | awk '{
      sum=0; n=0;
      for (i=4; i<=NF; i++) {
          if ($i != "." && $i != "") { sum+=$i; n++ }
      }
      if (n>0) {
          mean=sum/n;
          print $1, $2-1, $2, mean
      }
  }' OFS='\t' > Leiomod_site_meanGQ.bed

bcftools query -f '%CHROM\t%POS\t%POS[\t%RGQ]\n' Leiomod_allsites.vcf.gz \
  | awk '{
      sum=0; n=0;
      for (i=4; i<=NF; i++) {
          if ($i != "." && $i != "") { sum+=$i; n++ }
      }
      if (n>0) {
          mean=sum/n;
          print $1, $2-1, $2, mean
      }
  }' OFS='\t' > Leiomod_site_meanRGQ.bed

cat Leiomod_site_meanGQ.bed Leiomod_site_meanRGQ.bed | sort -k 1,1 -k2,2n | bedtools merge > Leiomod_lowgq.bed
awk '$4 < 5 {print $1"\t"$2-1"\t"$2}' Leiomod_depth.ldepth.mean > Leiomod_lowdepth.bed

cat Leiomod_lowdepth.bed Leiomod_lowgq.bed | sort -k 1,1 -k2,2n | bedtools merge > Leiomod.mask

6. Create genome file

samtools faidx ../Leiomod.fa
cut -f1,2 ../Leiomod.fa.fai > Leiomod.chrom.sizes
awk '{print $1"\t0\t"$2}' Leiomod.chrom.sizes > Leiomod.chrom.bed

7. Run ReLERNN

Simulating based on the VCF

name="Leiomod"
DIR=${name}_results
mkdir $DIR
VCF=${name}_ReLERNN.vcf
GENOME=Leiomod.chrom.bed
MU="2.4e-9" #snake mu from green et al.
ReLERNN_SIMULATE --vcf ${VCF} --genome ${GENOME} --unphased --projectDir ${DIR} --assumedMu ${MU} --assumedGenTime 2 --nCPU $SLURM_NTASKS_PER_NODE --mask ${name}.mask

Training, Prediction, and Bootstrapping

ReLERNN_TRAIN --projectDir ${DIR}
ReLERNN_PREDICT --vcf ${VCF} --projectDir ${DIR} --minSites 150 --unphased 
ReLERNN_BSCORRECT --projectDir ${DIR} --nCPU $SLURM_NTASKS_PER_NODE --nSlice 100 --nReps 1000

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Installing ReLERNN.txt		Installing ReLERNN.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

recombination-landscape

1. For each indiviudal, run haplotypecaller using GATK to produce a GVCF file

2. Combine all with CombinGVCFs. Make sure the name of the sample is in the GVCFs

3. Run GenotypeGVCF

4. Filter VCF file

5. Create mask file

6. Create genome file

7. Run ReLERNN

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

recombination-landscape

1. For each indiviudal, run haplotypecaller using GATK to produce a GVCF file

2. Combine all with CombinGVCFs. Make sure the name of the sample is in the GVCFs

3. Run GenotypeGVCF

4. Filter VCF file

5. Create mask file

6. Create genome file

7. Run ReLERNN

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages