Pipeline to construct species phylogenies from genome assemblies or variant call data (VCF).
flowchart TD
%% ----- INPUT -----
subgraph INPUT["Mode 1: BUSCO-based"]
A_fa["Genome assemblies<br/>(FASTA)"]
A_vcf["Per-sample VCFs + reference genome"]
A_gatk["Pseudo-genome assembly<br/>GATK FastaAlternateReferenceMaker"]
end
%% ----- BUSCO -----
subgraph BUSCO["Ortholog extraction"]
B_busco["Single-copy orthologs<br/>BUSCO"]
end
%% ----- PREPROCESSING -----
subgraph PREP["Sequence processing"]
C_aln["Multiple alignment<br/>MAFFT / Muscle / PRANK"]
C_flt["Alignment filtering<br/>ClipKIT / GBlocks / TrimAl"]
end
%% ----- TREE APPROACH -----
subgraph TREE["Multispecies coalescent approach"]
D_gt["Gene tree inference<br/>IQTree"]
D_ast["Phylogenetic inference<br/>Astral-IV"]
end
%% ----- CONCAT -----
subgraph CONCAT["Supermatrix approach"]
E_cat["Concat alignment"]
E_vcf["Concat alignment<br/>vcf2phylip.py"]
E_phy["Phylogenetic inference<br/>IQTree / MrBayes / PHYLIP / RAxML-NG / RapidNJ"]
end
%% ----- MODE 2 -----
M2["Mode 2: Multi-sample VCF<br/>(If vcf2phylip: True)"]
%% ----- EDGES: MODE 1 -----
A_fa --> B_busco
A_vcf --> A_gatk
A_gatk --> B_busco
B_busco --> C_aln
C_aln --> C_flt
C_flt --> D_gt
D_gt --> D_ast
C_flt --> E_cat
E_cat --> E_phy
%% ----- EDGES: MODE 2 -----
M2 --> E_vcf
E_vcf --> E_phy
%% ----- STYLE -----
classDef input fill:#e8f4ff,stroke:#2b7cd3,stroke-width:1px
classDef process fill:#eaf7ea,stroke:#2f9e44,stroke-width:1px
classDef phylo fill:#fff4e6,stroke:#e67700,stroke-width:1px
class A_fa,A_vcf,M2 input
class A_gatk,B_busco,C_aln,C_flt,E_cat,E_vcf process
class D_gt,D_ast,E_phy phylo
- Ortholog extraction: BUSCO
- VCF-based reconstruction: GATK FastaAlternateReferenceMaker, vcf2phylip
- Alignment: MAFFT, MUSCLE, PRANK
- Trimming: ClipKIT, TrimAl, GBlocks
- Phylogenetic tree construction: IQTree, MrBayes, ASTRAL-IV, RapidNJ, PHYLIP, RAxML-NG
- Visualization: Etetoolkit, Matplotlib
Clone the repository or download the latest release:
git clone https://github.com/tomarovsky/BuscoClade.gitBuscoClade has two running modes:
The main mode. Accepts genome assemblies in FASTA format, optionally supplemented with VCF-reconstructed pseudo-genomes. Both input types can be used simultaneously.
Option A — FASTA assemblies: Place genome assemblies into the genomes/ directory. The file prefix is used as the sample name in the output phylogeny. Supported extensions: .fasta, .fna, .fa, and their gzipped versions (.fasta.gz, .fna.gz, .fa.gz).
Option B — per-sample VCFs + reference genome: If you have per-sample VCFs, the pipeline reconstructs pseudo-genome assemblies using GATK FastaAlternateReferenceMaker, then feeds them into the standard BUSCO workflow alongside any FASTA assemblies from Option A.
Place per-sample VCF files and the corresponding reference genome together into a subdirectory under input/vcf_reconstruct/. Each subdirectory is processed independently, which allows reconstructing pseudo-genomes against different references in a single run:
input/
genomes/
Species1.fasta
Species2.fasta.gz
vcf_reconstruct/
project_hg38/ # one reference per directory
reference.fasta
SampleA.vcf.gz
SampleB.vcf.gz
project_mm39/ # another reference
reference.fasta
SampleC.vcf.gz
The directory name is used only for organization — the VCF file prefix determines the sample name in the output phylogeny. No additional config changes are needed; the pipeline detects subdirectories automatically.
An alternative shortcut that builds a concatenated phylip alignment directly from a multi-sample VCF using vcf2phylip.py, bypassing BUSCO and sequence alignment entirely.
To use this mode:
- Place exactly one multi-sample
.vcf.gzfile intoinput/vcf2phylip/. - Enable the mode in the config:
vcf2phylip: True.
input/
vcf2phylip/
all_samples.vcf.gz # exactly one multi-sample VCF
The two modes are mutually exclusive: when vcf2phylip: True is set, only Mode 2 runs. To use the BUSCO-based pipeline (Mode 1), keep vcf2phylip: False.
Modify config/default.yaml (recommended: copy it and pass with --configfile). The config has four sections:
Enable or disable tools and modes:
vcf2phylip: False # set True to enable Mode 2 (vcf2phylip)
quastcore: True # assembly statistics
alignment: "mafft" # 'mafft', 'muscle' or 'prank'
filtration: "clipkit" # 'clipkit', 'trimal' or 'gblocks'
iqtree: True
astral: True
rapidnj: True
phylip: True
raxml: True
mrbayes: False # recommended to run GPU-compiled version separately
draw_phylotrees: TrueKey parameters to configure before running:
VCF-based reconstruction (Mode 1B):
gatk_path: Path to the GATK directory (e.g."$TOOLS/gatk-4.6.2.0/").
BUSCO:
busco_dataset_path: Path to a pre-downloaded OrthoDB dataset (e.g."$TOOLS/busco_datasets/mammalia_odb12/").busco_options: Use"--offline"to run without internet access.busco_mode: Typically"genome".busco_blacklist: Path to a file with BUSCO IDs to exclude (optional).
Alignment (parameters passed directly to the chosen tool):
prank_params,mafft_params,muscle_params
Filtration:
gblocks_params,trimal_params
Phylogenetic inference:
iqtree_params: e.g."-keep-ident -m TESTNEW -bb 1000". Add-o 'OUTGROUP'to set an outgroup.astral_params: e.g."--support 2". Add--root 'OUTGROUP'to set an outgroup.raxml_params: e.g."--model GTR+G --bs-trees 100".rapidnj_params: e.g."-b 1000".phylip_dnadist_params: Use"D\n"for Kimura 2-parameter model, or""for F84 (default).phylip_neighbor_params: Use"N"for UPGMA, or""for NJ (default).mrbayes_params,mrbayes_block: MrBayes configuration block file and extra parameters.
Visualization:
tree_visualization_params: Specify outgroup as"--outgroup OUTGROUP".
Input and output paths are defined here. The defaults are:
# Input
genome_dir: "input/genomes/"
vcf_reconstruct_dir: "input/vcf_reconstruct/"
vcf2phylip_dir: "input/vcf2phylip/"
# Output (all under results/)
output_dir: "results/"It is recommended to leave the directory structure unchanged.
Per-tool Slurm settings: partition (*_queue), threads (*_threads), memory in MB (*_mem_mb), and runtime (*_time). Adjust these to match your cluster configuration. Note that BUSCO and PRANK are the most time-consuming steps and may require generous time limits (default: 150h and 100h respectively).
Install Snakemake:
mamba create -c conda-forge -c bioconda -c nodefaults -n snakemake snakemake snakemake-executor-plugin-cluster-generic
mamba activate snakemakeDry run to preview all steps:
snakemake --profile profile/slurm/ --configfile config/default.yaml --dry-runRemove --dry-run to start the actual run.
Move genome assemblies (or create empty placeholder files) into genomes/, then place BUSCO output directories under results/busco/. Expected structure for Ailurus_fulgens.fasta:
results/
busco/
Ailurus_fulgens/
busco_sequences/
fragmented_busco_sequences/
multi_copy_busco_sequences/
single_copy_busco_sequences/
hmmer_output/
logs/
metaeuk_output/
full_table_Ailurus_fulgens.tsv
missing_busco_list_Ailurus_fulgens.tsv
short_summary_Ailurus_fulgens.txt
short_summary.json
short_summary.specific.mammalia_odb10.Ailurus_fulgens.json
short_summary.specific.mammalia_odb10.Ailurus_fulgens.txt
Please email: andrey.tomarovsky@gmail.com for questions or feedback.
