Snakemake workflow: BuscoClade

Description

Pipeline to construct species phylogenies from genome assemblies or variant call data (VCF).

Workflow

flowchart TD

%% ----- INPUT -----
subgraph INPUT["Mode 1: BUSCO-based"]
A_fa["Genome assemblies<br/>(FASTA)"]
A_vcf["Per-sample VCFs + reference genome"]
A_gatk["Pseudo-genome assembly<br/>GATK FastaAlternateReferenceMaker"]
end

%% ----- BUSCO -----
subgraph BUSCO["Ortholog extraction"]
B_busco["Single-copy orthologs<br/>BUSCO"]
end

%% ----- PREPROCESSING -----
subgraph PREP["Sequence processing"]
C_aln["Multiple alignment<br/>MAFFT / Muscle / PRANK"]
C_flt["Alignment filtering<br/>ClipKIT / GBlocks / TrimAl"]
end

%% ----- TREE APPROACH -----
subgraph TREE["Multispecies coalescent approach"]
D_gt["Gene tree inference<br/>IQTree"]
D_ast["Phylogenetic inference<br/>Astral-IV"]
end

%% ----- CONCAT -----
subgraph CONCAT["Supermatrix approach"]
E_cat["Concat alignment"]
E_vcf["Concat alignment<br/>vcf2phylip.py"]
E_phy["Phylogenetic inference<br/>IQTree / MrBayes / PHYLIP / RAxML-NG / RapidNJ"]
end

%% ----- MODE 2 -----
M2["Mode 2: Multi-sample VCF<br/>(If vcf2phylip: True)"]

%% ----- EDGES: MODE 1 -----
A_fa --> B_busco
A_vcf --> A_gatk
A_gatk --> B_busco
B_busco --> C_aln
C_aln --> C_flt
C_flt --> D_gt
D_gt --> D_ast
C_flt --> E_cat
E_cat --> E_phy

%% ----- EDGES: MODE 2 -----
M2 --> E_vcf
E_vcf --> E_phy

%% ----- STYLE -----
classDef input fill:#e8f4ff,stroke:#2b7cd3,stroke-width:1px
classDef process fill:#eaf7ea,stroke:#2f9e44,stroke-width:1px
classDef phylo fill:#fff4e6,stroke:#e67700,stroke-width:1px

class A_fa,A_vcf,M2 input
class A_gatk,B_busco,C_aln,C_flt,E_cat,E_vcf process
class D_gt,D_ast,E_phy phylo

Ortholog extraction: BUSCO
VCF-based reconstruction: GATK FastaAlternateReferenceMaker, vcf2phylip
Alignment: MAFFT, MUSCLE, PRANK
Trimming: ClipKIT, TrimAl, GBlocks
Phylogenetic tree construction: IQTree, MrBayes, ASTRAL-IV, RapidNJ, PHYLIP, RAxML-NG
Visualization: Etetoolkit, Matplotlib

Usage

Step 1. Deploy workflow

Clone the repository or download the latest release:

git clone https://github.com/tomarovsky/BuscoClade.git

Step 2. Prepare input data

BuscoClade has two running modes:

Mode 1: BUSCO-based phylogeny (default)

The main mode. Accepts genome assemblies in FASTA format, optionally supplemented with VCF-reconstructed pseudo-genomes. Both input types can be used simultaneously.

Option A — FASTA assemblies: Place genome assemblies into the genomes/ directory. The file prefix is used as the sample name in the output phylogeny. Supported extensions: .fasta, .fna, .fa, and their gzipped versions (.fasta.gz, .fna.gz, .fa.gz).

Option B — per-sample VCFs + reference genome: If you have per-sample VCFs, the pipeline reconstructs pseudo-genome assemblies using GATK FastaAlternateReferenceMaker, then feeds them into the standard BUSCO workflow alongside any FASTA assemblies from Option A.

Place per-sample VCF files and the corresponding reference genome together into a subdirectory under input/vcf_reconstruct/. Each subdirectory is processed independently, which allows reconstructing pseudo-genomes against different references in a single run:

input/
    genomes/
        Species1.fasta
        Species2.fasta.gz
    vcf_reconstruct/
        project_hg38/               # one reference per directory
            reference.fasta
            SampleA.vcf.gz
            SampleB.vcf.gz
        project_mm39/               # another reference
            reference.fasta
            SampleC.vcf.gz

The directory name is used only for organization — the VCF file prefix determines the sample name in the output phylogeny. No additional config changes are needed; the pipeline detects subdirectories automatically.

Mode 2: vcf2phylip (optional, off by default)

An alternative shortcut that builds a concatenated phylip alignment directly from a multi-sample VCF using vcf2phylip.py, bypassing BUSCO and sequence alignment entirely.

To use this mode:

Place exactly one multi-sample .vcf.gz file into input/vcf2phylip/.
Enable the mode in the config: vcf2phylip: True.

input/
    vcf2phylip/
        all_samples.vcf.gz    # exactly one multi-sample VCF

The two modes are mutually exclusive: when vcf2phylip: True is set, only Mode 2 runs. To use the BUSCO-based pipeline (Mode 1), keep vcf2phylip: False.

Step 3. Configure workflow

Modify config/default.yaml (recommended: copy it and pass with --configfile). The config has four sections:

Pipeline configuration

Enable or disable tools and modes:

vcf2phylip: False       # set True to enable Mode 2 (vcf2phylip)
quastcore: True         # assembly statistics

alignment: "mafft"      # 'mafft', 'muscle' or 'prank'
filtration: "clipkit"   # 'clipkit', 'trimal' or 'gblocks'

iqtree: True
astral: True
rapidnj: True
phylip: True
raxml: True
mrbayes: False          # recommended to run GPU-compiled version separately

draw_phylotrees: True

Tool parameters

Key parameters to configure before running:

VCF-based reconstruction (Mode 1B):

gatk_path: Path to the GATK directory (e.g. "$TOOLS/gatk-4.6.2.0/").

BUSCO:

busco_dataset_path: Path to a pre-downloaded OrthoDB dataset (e.g. "$TOOLS/busco_datasets/mammalia_odb12/").
busco_options: Use "--offline" to run without internet access.
busco_mode: Typically "genome".
busco_blacklist: Path to a file with BUSCO IDs to exclude (optional).

Alignment (parameters passed directly to the chosen tool):

prank_params, mafft_params, muscle_params

Filtration:

gblocks_params, trimal_params

Phylogenetic inference:

iqtree_params: e.g. "-keep-ident -m TESTNEW -bb 1000". Add -o 'OUTGROUP' to set an outgroup.
astral_params: e.g. "--support 2". Add --root 'OUTGROUP' to set an outgroup.
raxml_params: e.g. "--model GTR+G --bs-trees 100".
rapidnj_params: e.g. "-b 1000".
phylip_dnadist_params: Use "D\n" for Kimura 2-parameter model, or "" for F84 (default).
phylip_neighbor_params: Use "N" for UPGMA, or "" for NJ (default).
mrbayes_params, mrbayes_block: MrBayes configuration block file and extra parameters.

Visualization:

tree_visualization_params: Specify outgroup as "--outgroup OUTGROUP".

Directory structure

Input and output paths are defined here. The defaults are:

# Input
genome_dir:          "input/genomes/"
vcf_reconstruct_dir: "input/vcf_reconstruct/"
vcf2phylip_dir:      "input/vcf2phylip/"

# Output (all under results/)
output_dir:          "results/"

It is recommended to leave the directory structure unchanged.

Resources

Per-tool Slurm settings: partition (*_queue), threads (*_threads), memory in MB (*_mem_mb), and runtime (*_time). Adjust these to match your cluster configuration. Note that BUSCO and PRANK are the most time-consuming steps and may require generous time limits (default: 150h and 100h respectively).

Step 4. Execute workflow

Install Snakemake:

mamba create -c conda-forge -c bioconda -c nodefaults -n snakemake snakemake snakemake-executor-plugin-cluster-generic
mamba activate snakemake

Dry run to preview all steps:

snakemake --profile profile/slurm/ --configfile config/default.yaml --dry-run

Remove --dry-run to start the actual run.

Advanced usage

Starting from completed BUSCO results

Move genome assemblies (or create empty placeholder files) into genomes/, then place BUSCO output directories under results/busco/. Expected structure for Ailurus_fulgens.fasta:

results/
    busco/
        Ailurus_fulgens/
            busco_sequences/
                fragmented_busco_sequences/
                multi_copy_busco_sequences/
                single_copy_busco_sequences/
            hmmer_output/
            logs/
            metaeuk_output/
            full_table_Ailurus_fulgens.tsv
            missing_busco_list_Ailurus_fulgens.tsv
            short_summary_Ailurus_fulgens.txt
            short_summary.json
            short_summary.specific.mammalia_odb10.Ailurus_fulgens.json
            short_summary.specific.mammalia_odb10.Ailurus_fulgens.txt

Contact

Please email: andrey.tomarovsky@gmail.com for questions or feedback.

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
config		config
input		input
profile		profile
resources		resources
results		results
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snakemake workflow: BuscoClade

Description

Workflow

Usage

Step 1. Deploy workflow

Step 2. Prepare input data

Mode 1: BUSCO-based phylogeny (default)

Mode 2: vcf2phylip (optional, off by default)

Step 3. Configure workflow

Pipeline configuration

Tool parameters

Directory structure

Resources

Step 4. Execute workflow

Advanced usage

Starting from completed BUSCO results

Contact

About

Uh oh!

Releases 11

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Snakemake workflow: BuscoClade

Description

Workflow

Usage

Step 1. Deploy workflow

Step 2. Prepare input data

Mode 1: BUSCO-based phylogeny (default)

Mode 2: vcf2phylip (optional, off by default)

Step 3. Configure workflow

Pipeline configuration

Tool parameters

Directory structure

Resources

Step 4. Execute workflow

Advanced usage

Starting from completed BUSCO results

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages