-
The analaysis here is for data mapped and finalized with three mappers: bwa mem, bowtie2 and novoalign
-
See associated scripts for steps up to final bams and creating .sync files
/ /
2) Run Fst on windows for each pairwise comparision of sequenced data and calculate average Fst across three mappers
/ /
3) per SNP logistic regression for each treatment by generation averaged for Novoalign, Bwa-mem and bowtie2
/ /
4) Average estimates of selection coefficient at each position for selection and control lineages for three mappers
/ /
/ /
/ /
-
For one mapper: Have a diretory with all the .final.bam files created and .mpileup /.sync files created using these .bam files
-
The analysis (up to step 5) will be generic for one mapper (Novoalign) but completed similarily for other chosen mappers
-
need to know the order of the .sync files: will be based on the order of the .bam files read into .sync
-
Some scripts require the .sync file split into chromosomes: script below shows method of splitting
Script: novo_split_sync2chromosomes.sh
Script: novo_Pi_pileups.sh
ex.
samtools mpileup -B -Q 0 -f ${ref_genome} ${input}/${base}_merge_novo_final_realigned.bam > ${output}/${base}.pileup
Flags:
-
B -- disable BAQ (base alignment quality) computation, helps to stop false SNPs passing through due to misalignment
-
Q -- minimum base quality (already filtered for 20, default is 13, just set to 0 and not worry about it)
-
f -- path to reference sequence
Script: novo_tajima_pi.sh
ex.
perl ${popoolation}/Variance-sliding.pl --input ${input}/${base}.pileup --output ${output}/${base}.pi --measure pi --window-size 10000 --step-size 10000 --min-count 2 --min-coverage 4 --max-coverage 400 --min-qual 20 --pool-size 120 --fastq-type sanger --snp-output ${output}/${base}.snps --min-covered-fraction 0.5
Flags:
- input -- input pileup file
- output -- output file with Tajima's Pi calculated
- measure [pi] -- Options include Tajima's Pi or Wattersons Theta or Tajima's D along chromosomes using a sliding window approach
- window-size [10000] -- size of the sliding window
- step-size [10000] -- how far to move along with chromosome (if step size smaller, windows will overlap)
- min-count [2] -- minimum allele count
- min-coverage [4] -- minimum coverage (not important if subsampling done..)
- max-coverage [400] --maximum coverage
- min-qual [20] -- minimum base quality (already filtered for 20 multiple times)
- pool-size [120] -- number of chromosomes (So double the number of individuals per pool)
- fastq-type [sanger] -- depending on the encoding of the fastq files
- min-covered-fraction [0.5] -- minimum percentage of sites having sufficient coverage in the given window -- 0.5 from example
On local machine, this R function can run each .pi file to output a plot
Script: Pi_PlotFunction.R
ex.
Pi_PlotFunction('FILE.pi', "Plot Title Details")
Questions
Currently, Pi files for Novoalign and Bowtie2 generated.
-
Should these be averaged (with bwa-mem that can be generated easily enough)
-
is one mapper enough for this
-
Do we want to show trends in nucleotide diversity (other Pi plots) or just focus on ancestoral Pi (overlay plots)
Script: Novo_Fst.sh
ex.
perl ${fst} --input ${novo_mpileup}/novo_episodic_main.sync --output ${novo_fst}/novo_episodic_main.fst --min-count 6 --min-coverage 10 --max-coverage 250 --min-covered-fraction 1 --window-size 500 --step-size 500 --pool-size 120
Flags:
- input -- input sync file
- output -- output file with Fst calculated
- window-size [500] -- size of the window
- step-size [500] -- distance to move along chromosome
- min-count [6] -- minimum allele count
- min-coverage [10] -- minimum coverage
- max-coverage [250] --maximum coverage
- pool-size [120] -- double pooled size (diploid)
- min-covered-fraction [1] -- minimum percentage of sites having sufficient coverage in the given window
Script: novo_Fst_Split_Comparisons.R
R script that will split the .fst file into many .csv files with each comparison of interest (can choose the necessary ones from here)
ScriptFST_combine3mappers.R
Will take the split comparisons, and combine those specified into one FST file with the average Fst between the three mappers
-
average Fst of three mappers and average between replicates
-
comparison betweeen control and selection lines
Generation 38: meanFst for F38
Generation 77: meanFst for F77
Long Script:* novo_regression_model_LONGSCRIPT.sh
This script will break the chromosomal .sync files into smaller managable pieces and run through multiple R scripts while removing intermediates:
R script to covert sync to Count data: Sync_to_counts.R
Creates a file with the counts for the major and minor frequency (based on ancestor) that can run through the model
R script for running the model for each position along the chromosome: Counts_to_model.R
In long script: this is set up to work in parallel, having each chromosome running at the same time (6 instances running over 11 sections)
NOTE: Script needs to be changed to run faster/ more efficiently (not done here b/c already completed)
Basic Model at each positon (tmp2):
modlist_2[[i]] <-
glm(cbind(Major_count, Minor_count) ~ Treatment*Generation,
data = tmp2, family = "binomial")
R script to combine all the split chromosome pieces back into one chromosome: Combine_chromo.R
Recreates one chromosomal file
R script to combine three mappers into one file model_combine3mappers.R
Combines each of BWA-mem, Bowtie2 and Novoalign files into one file (keeping all information)
R script to write files with coeffefficent of interest model_3mappersTxG.R
This script (choosing Treatment by Generation effect) keeps positions that are present in all three files (i.e position needs to be mapped three times)
P.adjust?, Positions?,etc.
Plots
Treatment x Generation -log10(meanP-value) for model output: NEED TO CHANGE FOR P.adjust!

4) estimates of selection coefficient at each position for selection and control lineages using poolSeq R package:
Script: poolseq_SelectionCoefficientEstimate.sh
This script will break the file into two treatment .sync files, break apart these .sync files (smaller sized files), and run through a R script to run poolSeq Package.
Rscript: Running poolseq poolSeq_selectionCoeff.R
Note: to run, check poolseq is available, if not, source all PoolSeq scripts available from Taus git page.
Also, to run with modified Sync files (Change is spacing), need a personal read.sync function: --> modifed script == read.sync_personal_function.R
Rscript: combining CSV files: combinePoolseqCSV.R
Ex. For individual Chromosomes: combine_poolseq_individual_Chromo.R
Plot: overlay (red = control) over selection coefficients of Selections:

Notes:
-
need to edit to make more efficient to run (not taking 3 days per section)
-
poolSeq may not work on certain versions of R; but can bring in Taus poolSeq scripts in manually and source (done in this script)
-
may need to ensure updated packages and install if necessary (i.e matrixStats_0.53.0 installed for this reason)
-
breaking the .sync file causes changes in structure, so a modified read_sync function is used (in R script; Taus_ReadSync.R))
First draft of plots
Full Chromosome:
-- Numbers/ peaks do not line up perfectly (product of how they are plotted/ measured (windows vs. positions)
Fst window and ranges in data frame:
### Adjust the FST output (FDR) and keep positons that have an Fst value after adjusting
require(data.table)
require(tidyverse)
### Read in the data (final generation) with comparison between Control and Selection lines:
XC2 <- fread('combined_fst_1:3.csv')
CX2 <- fread('combined_fst_2:4.csv')
### One data frame:
ddat <- rbind(CX2, XC2)
### Mean Fst b/w two replicates:
ddat2 <- ddat %>%
group_by(chr, window) %>%
summarise(meanFst = (mean(meanFst)))
### Fdr correction: Adjust the calculated Fst values with a false discovery rate:
ddat2$adjustFst <- p.adjust(ddat2$meanFst, method = 'fdr')
### Remove positions with an Fst of 0
ddat2 <- ddat2[-which(ddat2$adjustFst==0),]
### The range of the 500 bp window (window == center)
ddat2$minWindow <- ddat2$window -250
ddat2$maxWindow <- ddat2$window +250
### need to sort positions based on chromosome (for trajectories):
fst_X <- ddat2[which(ddat2$chr=='X'),]
fst_2L <- ddat2[which(ddat2$chr=='2L'),]
fst_2R <- ddat2[which(ddat2$chr=='2R'),]
fst_3L <- ddat2[which(ddat2$chr=='3L'),]
fst_3R <- ddat2[which(ddat2$chr=='3R'),]
fst_4 <- ddat2[which(ddat2$chr=='4'),]
### create positional data frame (to be sources later:
posfst_X <- as.data.frame(cbind(fst_X$window, fst_X$minWindow, fst_X$maxWindow))
posfst_2L <- as.data.frame(cbind(fst_2L$window, fst_2L$minWindow, fst_2L$maxWindow))
posfst_2R <- as.data.frame(cbind(fst_2R$window, fst_2R$minWindow, fst_2R$maxWindow))
posfst_3L <- as.data.frame(cbind(fst_3L$window, fst_3L$minWindow, fst_3L$maxWindow))
posfst_3R <- as.data.frame(cbind(fst_3R$window, fst_3R$minWindow, fst_3R$maxWindow))
posfst_4 <- as.data.frame(cbind(fst_4$window, fst_4$minWindow, fst_4$maxWindow))
### Keep only position data frame for sourcing:
rm(list=ls()[! ls() %in% c('posfst_X', "posfst_2L", 'posfst_2R', 'posfst_3R', 'posfst_3L','posfst_4')])
Positions with significant change after FDR adjustments:
### Finding Positions of interest from Model:
### Packages:
require(dplyr)
require(ggplot2)
require(data.table)
### Read in each Chromosomal Data:
ddatX <- fread('../Data/X_Chromosome_TxG.csv', h=T)
chr_X <- ddatX %>%
group_by(position, chr, Effects) %>%
summarise(mean_p = (mean(p.value)))
rm(ddatX)
a <- nrow(chr_X)
chr_X$number <- 1:a
ddat2L <- fread('../Data/2L_Chromosome_TxG.csv', h=T)
chr_2L <- ddat2L %>%
group_by(position, chr, Effects) %>%
summarise(mean_p = (mean(p.value)))
rm(ddat2L)
b <- nrow(chr_2L)
chr_2L$number <- (a+1):(a+b)
ddat2R <- fread('../Data/2R_Chromosome_TxG.csv', h=T)
chr_2R <- ddat2R %>%
group_by(position, chr, Effects) %>%
summarise(mean_p = (mean(p.value)))
rm(ddat2R)
c <- nrow(chr_2R)
chr_2R$number <- (a+b+1):(a+b+c)
ddat3L <- fread('../Data/3L_Chromosome_TxG.csv', h=T)
chr_3L <- ddat3L %>%
group_by(position, chr, Effects) %>%
summarise(mean_p = (mean(p.value)))
rm(ddat3L)
d <- nrow(chr_3L)
chr_3L$number <- (a+b+c+1):(a+b+c+d)
ddat3R <- fread('../Data/3R_Chromosome_TxG.csv', h=T)
chr_3R <- ddat3R %>%
group_by(position, chr, Effects) %>%
summarise(mean_p = (mean(p.value)))
rm(ddat3R)
e <- nrow(chr_3R)
chr_3R$number <- (a+b+c+d+1):(a+b+c+d+e)
ddat4 <- fread('../Data/4_Chromosome_TxG.csv', h=T)
chr_4 <- ddat4 %>%
group_by(position, chr, Effects) %>%
summarise(mean_p = (mean(p.value)))
rm(ddat4)
f <- nrow(chr_4)
chr_4$number <- (a+b+c+d+e+1):(a+b+c+d+e+f)
chr_4$chr <- as.character(chr_4$chr)
### One large data frame:
CHROMOs <- rbind(chr_X, chr_2L, chr_2R, chr_3L, chr_3R, chr_4)
### Fdr correction: adjust p values with false discrovey rate correction:
CHROMOs$adjustP <- p.adjust(CHROMOs$mean_p, method = 'fdr')
### Filter slightly to make it easier to wotk with:
CHROMOs_3 <- CHROMOs[which(CHROMOs$adjustP<0.9),]
# Split into chromosomes and keep significant postions:
ddat2_X <- CHROMOs_3[which(CHROMOs_3$chr=='X' & CHROMOs_3$adjustP<0.05),]
pos_X <- ddat2_X$position
ddat2_2L <- CHROMOs_3[which(CHROMOs_3$chr=='2L' & CHROMOs_3$adjustP<0.05),]
pos_2L <- ddat2_2L$position
ddat2_2R <- CHROMOs_3[which(CHROMOs_3$chr=='2R' & CHROMOs_3$adjustP<0.05),]
pos_2R <- ddat2_2R$position
ddat2_3L <- CHROMOs_3[which(CHROMOs_3$chr=='3L' & CHROMOs_3$adjustP<0.05),]
pos_3L <- ddat2_3L$position
ddat2_3R <- CHROMOs_3[which(CHROMOs_3$chr=='3R' & CHROMOs_3$adjustP<0.05),]
pos_3R <- ddat2_3R$position
ddat2_4 <- CHROMOs_3[which(CHROMOs_3$chr=='4' & CHROMOs_3$adjustP<0.05),]
pos_4 <- ddat2_4$position
### Remove all but positions:
rm(list=ls()[! ls() %in% c('pos_X','pos_2L','pos_2R','pos_3L','pos_3R','pos_4')])
Positions showing significant selection coefficents in selection but not controls:
### Must be done for each chromosome seperatly:
###----- 2R -----###
novo_sel_2R <- fread('novo_episodic_2R_Sel.csv')
novo_con_2R <- fread('novo_episodic_2R_Con.csv')
bwa_sel_2R <- fread('bwa_episodic_2R_Sel.csv')
bwa_con_2R <- fread('bwa_episodic_2R_Con.csv')
column.names <- c('selcoef', 'pval', 'pos', 'chr')
colnames(novo_sel_2R) <- column.names
colnames(novo_con_2R) <- column.names
colnames(bwa_sel_2R) <- column.names
colnames(bwa_con_2R) <- column.names
novo_sel_2R <- na.omit(novo_sel_2R)
novo_con_2R <- na.omit(novo_con_2R)
bwa_sel_2R <- na.omit(bwa_sel_2R)
bwa_con_2R <- na.omit(bwa_con_2R)
novo_sel_2R$map <- 'Novo'
novo_con_2R$map <- 'Novo'
bwa_sel_2R$map <- 'bwa'
bwa_con_2R$map <- 'bwa'
novo_sel_2R$Treatment <- 'Sel'
novo_con_2R$Treatment <- 'Con'
bwa_sel_2R$Treatment <- 'Sel'
bwa_con_2R$Treatment <- 'Con'
Xcx_2R <- rbind(novo_con_2R, novo_sel_2R, bwa_con_2R, bwa_sel_2R)
###----- 2R -----###
### Repeat this for all chromosomes (changing instances of 2R to chr) to create one large data frame:
Xcx <- rbind(Xcx_2R, Xcx_2L)
### Make sure that all 2 (3) mappers have the position for chr/treatment:
CXC <- Xcx %>%
group_by(chr, Treatment, pos) %>%
mutate(count = n())
XCV <- CXC[which(CXC$count==2),]
### Get the mean selection coefficent and the least significant pvalue (max(pval)) between mappers:
Zxc <- XCV %>%
group_by(chr, Treatment, pos) %>%
summarise(meanSelCoef = (mean(selcoef)),
pval_max=max(pval))
### Adjust the p-value left with false discovery rate and keep significant postions:
Zxc$adjustP <- p.adjust(Zxc$pval_max, method = 'fdr')
Zxc$sig <- ifelse(Zxc$adjustP<0.05, "<0.05", ">0.05")
Zxc_sig <- Zxc[which(Zxc$sig=='<0.05'),]
### Check if there are positons that are significant and present for both control and selection (and keep only positions that are only significant for selection:
Zxc_count <- Zxc_sig %>%
group_by(chr, pos) %>%
mutate(count = n())
Zxc_count2 <- Zxc_count[which(Zxc_count$count==1),]
Zxc_count2 <- Zxc_count2[which(Zxc_count2$Treatment=='Sel'),]
### Positions by chromosome:
poolseq_X <- Zxc_count2[which(Zxc_count2$chr=='X'),]
pool_pos_X <- poolseq_X$pos
poolseq_2L <- Zxc_count2[which(Zxc_count2$chr=='2L'),]
pool_pos_2L <- poolseq_2L$pos
poolseq_2R <- Zxc_count2[which(Zxc_count2$chr=='2R'),]
pool_pos_2R <- poolseq_2R$pos
poolseq_3L <- Zxc_count2[which(Zxc_count2$chr=='3L'),]
pool_pos_3L <- poolseq_3L$pos
poolseq_3R <- Zxc_count2[which(Zxc_count2$chr=='3R'),]
pool_pos_3R <- poolseq_3R$pos
poolseq_4 <- Zxc_count2[which(Zxc_count2$chr=='4'),]
pool_pos_4 <- poolseq_4$pos
### keep only positions:
rm(list=ls()[! ls() %in% c('pool_pos_2L', 'pool_pos_2R', 'pool_pos_3L', 'pool_pos_3R', 'pool_pos_4', 'pool_pos_X')])
Ex. with BWA -mem output sync files
Rscript: extract_sig_Chromo_positions.R
running rscript:
Rscript extract_sig_Chromo_positions.R


