These scripts allow the user to convert a whole genome alignment produced by Progressive Cactus in HAL format into gene trees for non-overlapping sliding windows of the genome. Shell scripts are designed for HPC clusters with PBS and/or SLURM schedulers but can be easily adapted.
For MAF conversion, choose a reference genome to use as a coordinate system.
halStats $name/$HAL --genomes > $name/genomes.txt
for genomename in $(<$name/genomes.txt); do
# Check if the entry does NOT start with "Anc"
if [[ ! $genomename =~ ^Anc ]]; then
echo "Processing genome: $genomename"
hal2fasta "$name/$HAL" "$genomename" > $name/${genomename}.fa
sed -i "/^>/s/^>/&${genomename}./" $name/${genomename}.fa
fi
done
cactus-hal2maf ./${name}/js_${name} ./${name}/$HAL ./${name}/${name}_hal.taf.gz --refGenome $reference --chunkSize 1000000 --dupeMode single --batchMemory 50G --maxMemory 100G --binariesMode local --batchSystem single_machine --cleanWorkDir onSuccess --workDir ./temp --batchCores 2 --index --noAncestors --maxCores 60 --batchParallelHal2maf 1
Create a bed file including all windows of interest. 2splitmafs.sh takes all 10kb non-overlapping sliding windows of the reference genome. taffy view extracts the region from the maf file.
taffy view --inputFile $TAF --outputFile $WINDOWi_MAF_PATH --maf --region $REFERENCE_GENOME_NAME"."$REGIONi
Here I also set a threshold for missing data, removing windows with more than 5000bp of missing data on average across all indiviudals.
#Convert maf2fasta
mafToFastaStitcher --maf $file --seqs $fastas --outMfa ${OUT_DIR}/${label}.mfa --breakpointPenalty 5 --interstitialSequence 20
#Check amount of missingness
samtools faidx ${OUT_DIR}/${label}.mfa
awk '{print $1 "\t0\t" $2}' "${OUT_DIR}/${label}.mfa.fai" > "$OUT_DIR/${label}.bed"
bedtools nuc -fi $OUT_DIR/${label}.mfa -bed $OUT_DIR/${label}.bed > $OUT_DIR/${label}_res.txt
average=$(awk 'NR > 2 {sum += ($10 + $11); count++} END {if (count > 0) print sum / count; else print 0}' "$OUT_DIR/${label}_res.txt")
if (( $(echo "$average < 5000" | bc -l) )); then
echo "$label : iqtree"
mv $OUT_DIR/${label}.mfa $iqtree_dir
fi
IQTree will pick the most likely model of sequence evolution for each window and infer the gene tree using that. 4iqtree.sh also includes some scripts for rooting trees.
iqtree2 -S $iqtree_dir --prefix ${name}_${CHROM}_${window_size}_${test_id} -T 90 -safe -redo