Cactus2Trees

These scripts allow the user to convert a whole genome alignment produced by Progressive Cactus in HAL format into gene trees for non-overlapping sliding windows of the genome. Shell scripts are designed for HPC clusters with PBS and/or SLURM schedulers but can be easily adapted.

1. Obtain fasta and MAF files from HAL

For MAF conversion, choose a reference genome to use as a coordinate system.

halStats $name/$HAL --genomes > $name/genomes.txt

for genomename in $(<$name/genomes.txt); do
    # Check if the entry does NOT start with "Anc"
    if [[ ! $genomename =~ ^Anc ]]; then
        echo "Processing genome: $genomename"
        hal2fasta "$name/$HAL" "$genomename" > $name/${genomename}.fa
        sed -i "/^>/s/^>/&${genomename}./" $name/${genomename}.fa
    fi
done

cactus-hal2maf ./${name}/js_${name} ./${name}/$HAL ./${name}/${name}_hal.taf.gz --refGenome $reference --chunkSize 1000000 --dupeMode single  --batchMemory 50G --maxMemory 100G  --binariesMode local --batchSystem single_machine  --cleanWorkDir onSuccess --workDir ./temp --batchCores 2 --index --noAncestors --maxCores 60 --batchParallelHal2maf 1

2. Split MAF files into windows

Create a bed file including all windows of interest. 2splitmafs.sh takes all 10kb non-overlapping sliding windows of the reference genome. taffy view extracts the region from the maf file.

taffy view --inputFile $TAF --outputFile $WINDOWi_MAF_PATH --maf --region $REFERENCE_GENOME_NAME"."$REGIONi

3. Convert MAF to Fasta for small windows

Here I also set a threshold for missing data, removing windows with more than 5000bp of missing data on average across all indiviudals.

#Convert maf2fasta
mafToFastaStitcher --maf $file --seqs $fastas --outMfa ${OUT_DIR}/${label}.mfa --breakpointPenalty 5 --interstitialSequence 20

#Check amount of missingness
samtools faidx ${OUT_DIR}/${label}.mfa
awk '{print $1 "\t0\t" $2}' "${OUT_DIR}/${label}.mfa.fai" > "$OUT_DIR/${label}.bed"
bedtools nuc -fi $OUT_DIR/${label}.mfa -bed $OUT_DIR/${label}.bed > $OUT_DIR/${label}_res.txt
average=$(awk 'NR > 2 {sum += ($10 + $11); count++} END {if (count > 0) print sum / count; else print 0}' "$OUT_DIR/${label}_res.txt")
if (( $(echo "$average < 5000" | bc -l) )); then
echo "$label : iqtree"
mv $OUT_DIR/${label}.mfa $iqtree_dir
fi

4. Infer gene trees with IQTree

IQTree will pick the most likely model of sequence evolution for each window and infer the gene tree using that. 4iqtree.sh also includes some scripts for rooting trees.

iqtree2 -S $iqtree_dir  --prefix ${name}_${CHROM}_${window_size}_${test_id} -T 90 -safe -redo

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
1hal2fasta.sh		1hal2fasta.sh
1hal2maf.sh		1hal2maf.sh
2splitmaf.sh		2splitmaf.sh
4iqtree.sh		4iqtree.sh
README.md		README.md
root.R		root.R
submit3maf2fasta.sh		submit3maf2fasta.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cactus2Trees

1. Obtain fasta and MAF files from HAL

2. Split MAF files into windows

3. Convert MAF to Fasta for small windows

4. Infer gene trees with IQTree

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cactus2Trees

1. Obtain fasta and MAF files from HAL

2. Split MAF files into windows

3. Convert MAF to Fasta for small windows

4. Infer gene trees with IQTree

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages