Skip to content

dylandebaun/cactus2trees

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cactus2Trees

These scripts allow the user to convert a whole genome alignment produced by Progressive Cactus in HAL format into gene trees for non-overlapping sliding windows of the genome. Shell scripts are designed for HPC clusters with PBS and/or SLURM schedulers but can be easily adapted.

1. Obtain fasta and MAF files from HAL

For MAF conversion, choose a reference genome to use as a coordinate system.

halStats $name/$HAL --genomes > $name/genomes.txt

for genomename in $(<$name/genomes.txt); do
    # Check if the entry does NOT start with "Anc"
    if [[ ! $genomename =~ ^Anc ]]; then
        echo "Processing genome: $genomename"
        hal2fasta "$name/$HAL" "$genomename" > $name/${genomename}.fa
        sed -i "/^>/s/^>/&${genomename}./" $name/${genomename}.fa
    fi
done

cactus-hal2maf ./${name}/js_${name} ./${name}/$HAL ./${name}/${name}_hal.taf.gz --refGenome $reference --chunkSize 1000000 --dupeMode single  --batchMemory 50G --maxMemory 100G  --binariesMode local --batchSystem single_machine  --cleanWorkDir onSuccess --workDir ./temp --batchCores 2 --index --noAncestors --maxCores 60 --batchParallelHal2maf 1

2. Split MAF files into windows

Create a bed file including all windows of interest. 2splitmafs.sh takes all 10kb non-overlapping sliding windows of the reference genome. taffy view extracts the region from the maf file.

taffy view --inputFile $TAF --outputFile $WINDOWi_MAF_PATH --maf --region $REFERENCE_GENOME_NAME"."$REGIONi

3. Convert MAF to Fasta for small windows

Here I also set a threshold for missing data, removing windows with more than 5000bp of missing data on average across all indiviudals.

#Convert maf2fasta
mafToFastaStitcher --maf $file --seqs $fastas --outMfa ${OUT_DIR}/${label}.mfa --breakpointPenalty 5 --interstitialSequence 20

#Check amount of missingness
samtools faidx ${OUT_DIR}/${label}.mfa
awk '{print $1 "\t0\t" $2}' "${OUT_DIR}/${label}.mfa.fai" > "$OUT_DIR/${label}.bed"
bedtools nuc -fi $OUT_DIR/${label}.mfa -bed $OUT_DIR/${label}.bed > $OUT_DIR/${label}_res.txt
average=$(awk 'NR > 2 {sum += ($10 + $11); count++} END {if (count > 0) print sum / count; else print 0}' "$OUT_DIR/${label}_res.txt")
if (( $(echo "$average < 5000" | bc -l) )); then
echo "$label : iqtree"
mv $OUT_DIR/${label}.mfa $iqtree_dir
fi

4. Infer gene trees with IQTree

IQTree will pick the most likely model of sequence evolution for each window and infer the gene tree using that. 4iqtree.sh also includes some scripts for rooting trees.

iqtree2 -S $iqtree_dir  --prefix ${name}_${CHROM}_${window_size}_${test_id} -T 90 -safe -redo

About

Convert whole genome alignments to locus tree data sets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors