Skip to content

kfuku52/csubst

Repository files navigation

Pytest GitHub release Bioconda Python Platforms Downloads License

Overview

CSUBST (/si:sʌbst/) is a tool for analyzing Combinatorial SUBSTitutions of codon sequences in phylogenetic trees. A combinatorial substitution is defined as recurrent substitutions that occur at the same protein site in multiple independent branches. If independent substitutions result in the same amino acid, they are considered convergent amino acid substitutions. The main features of CSUBST include:

  • Error-corrected rate of protein convergence with null expectation obtained by:
    • Empirical or mechanistic codon substitution model
    • Urn sampling from site-wise substitution frequencies (experimental)
  • Flexible specification of "foreground" lineages and its comparison with neighboring branches
  • Heuristic detection of higher-order convergence involving more than two branches
  • Simulated sequence evolution under specified scenarios of convergent evolution
  • Convergent substitution mapping to protein structure

Input files

CSUBST takes as inputs:

  • Newick file for the rooted tree
  • FASTA file for the multiple sequence alignment of in-frame coding sequences

Installation

CSUBST runs on python 3. Installation via bioconda is recommended for ease of use, as it handles all dependencies automatically. pip installation is also supported, but in this case IQ-TREE and a few python packages must be installed separately.

IQ-TREE compatibility: CSUBST supports IQ-TREE 2.x and 3.x outputs. For some IQ-TREE 3 codon runs where .iqtree does not print codon pi(...) entries, CSUBST estimates empirical codon frequencies from the input alignment by normalized codon counts (matching IQ-TREE's State frequencies: (empirical counts from alignment) convention) [Minh et al., 2020]. Ambiguous IUPAC nucleotide symbols are expanded over compatible codons with equal weights [Cornish-Bowden, 1985].

Option 1: Installation with conda

conda install bioconda::csubst

Option 2: Installation with pip

# IQ-TREE should be installed separately: https://iqtree.github.io/
pip install git+https://github.com/kfuku52/csubst

Test run

# Generate a test dataset
csubst dataset --name PGK

# Run csubst search
csubst search --alignment_file alignment.fa.gz --rooted_tree_file tree.nwk --foreground foreground.txt

Usage

CSUBST provides eight main subcommands:

  • csubst dataset: generate bundled example datasets (e.g., PGK, PEPC).
  • csubst doctor: validate input files, inferred IQ-TREE paths, and optional 3Di settings before heavier runs.
  • csubst benchmark: run csubst search across parameter grids on the same input data and summarize runtime/output metrics.
  • csubst benchmark-plot: collect existing benchmark outputs, compare parameter-wise performance, and write an overview figure.
  • csubst search (legacy alias: csubst analyze): run convergence analysis and output metrics such as omegaC, dNC, and dSC.
  • csubst inspect: summarize branch mappings and inspect ancestral states.
  • csubst sites (legacy alias: csubst site): compute site-wise combinatorial substitutions for selected branch combinations, generate tree + site summary plots, and optionally map sites to protein structures.
  • csubst simulate: simulate codon sequence evolution under user-defined convergent scenarios.

Get available commands and options:

csubst -h
csubst SUBCOMMAND -h

Typical workflow:

# 1) Prepare a toy dataset
csubst dataset --name PGK

# 2) Validate inputs and inferred IQ-TREE paths
csubst doctor \
  --alignment_file alignment.fa.gz \
  --rooted_tree_file tree.nwk \
  --foreground foreground.txt

# 3) Run convergence analysis
csubst search \
  --alignment_file alignment.fa.gz \
  --rooted_tree_file tree.nwk \
  --foreground foreground.txt

# 4) Inspect site-wise convergence for a branch pair (example)
csubst sites \
  --alignment_file alignment.fa.gz \
  --rooted_tree_file tree.nwk \
  --branch_id 23,51

Benchmark multiple search settings on the same input:

csubst benchmark \
  --alignment_file alignment.fa.gz \
  --rooted_tree_file tree.nwk \
  --foreground foreground.txt \
  --benchmark_expectation_methods codon_model,urn \
  --benchmark_asrv_modes each,file \
  --benchmark_pseudocount_modes none,empirical

This writes csubst_benchmark_summary.tsv, csubst_benchmark_summary.json, and per-run logs under csubst_benchmark/.

Plot and compare existing benchmark outputs:

csubst benchmark-plot \
  --benchmark_dir . \
  --benchmark_plot_format pdf

This writes csubst_benchmark_plot_summary.tsv, csubst_benchmark_plot_summary.json, and csubst_benchmark_plot_overview.pdf under csubst_benchmark_plot/.

For advanced settings (foreground formats, higher-order search, structure mapping, simulation parameters), see the CSUBST Wiki.

Citation

Fukushima K, Pollock DD. 2023. Detecting macroevolutionary genotype-phenotype associations using error-corrected rates of protein convergence. Nature Ecology & Evolution 7: 155–170. DOI: 10.1038/s41559-022-01932-7

Licensing

CSUBST is MIT-licensed. See LICENSE for details.

About

Analyzing combinations of codon substitution histories

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages