Skip to content

Preparing CSUBST inputs

Kenji Fukushima edited this page Mar 25, 2026 · 26 revisions

Rooted tree

Phylogenetic analysis

Phylogenetic trees obtained by ML, Bayes, NJ, or other methods can be used as input for CSUBST. Phylogenetic analysis software, such as IQ-TREE, RAxML-NG, and PhyloBayes, usually returns an unrooted Newick tree, so subsequent tree rooting will be necessary.

Tree rooting

Unrooted trees can be rooted by several methods including midpoint rooting, the minimal ancestor deviation (MAD), outgroup rooting, rootstrapping with non-reversible models (e.g., IQ-TREE), and phylogeny reconciliation (e.g., GeneRax and NOTUNG). Reconciliation-based methods are recommended if a species tree is available. NWKIT supports MAD, ND (minimum-variance), midpoint, and outgroup rooting.

In-frame codon alignment

Extracting coding sequences from transcripts

Transcripts (i.e., mRNA) contain both protein-coding sequences (CDSs) and untranslated regions (UTRs), and the former should be extracted for CSUBST. The code below shows an example of how SeqKit and CDSKIT can be combined to extract the longest ORF candidates from a transcriptome assembly by Trinity containing isoforms. Here, cdskit longestorf performs six-frame scanning and returns one longest ORF candidate per transcript. Note that cdskit aggregate -x "-i" collapses isoforms by sequence names.

seqkit rmdup --by-seq "Trinity.fasta" \
| cdskit longestorf --codontable 1 \
| cdskit pad \
| seqkit sort --by-name \
| cdskit aggregate -x "\-i[0-9].*" \
> "longest_orf.fa"

Padding truncated sequences with Ns

The complete CDSs, which start with a start codon and end with a stop codon, are ideal, but truncated CDSs are still analyzable. The code below shows how cdskit pad can make truncated sequences in-frame and how cdskit mask can replace ambiguous/stop codons with NNN.

cdskit pad --codontable 1 --seqfile "longest_orf.fa" \
| cdskit mask --codontable 1 --outfile "longest_orf.inframe.fa"

Generating codon alignments

With unaligned codon sequences and aligned protein sequences as input, in-frame codon alignments can be obtained with cdskit backalign.

cdskit translate \
--seqfile cds.unaligned.fasta \
--outfile pep.unaligned.fasta \
--codontable 1

mafft \
--auto \
--amino \
pep.unaligned.fasta \
> pep.aligned.fasta

cdskit backalign \
--seqfile cds.unaligned.fasta \
--aa_aln pep.aligned.fasta \
--outfile cds.aligned.fasta

Trimming poorly aligned codon sites

CSUBST can save required RAM by excluding codons that do not align well. This processing can be done with TrimAl or ClipKIT. Here is an example with TrimAl, but ClipKIT can also do the equivalent task in combination with cdskit backtrim.

trimal \
-in pep.aligned.untrimmed.fasta \
-backtrans cds.unaligned.untrimmed.fasta \
-out pep.aligned.trimmed.fasta \
-ignorestopcodon \
-automated1

Further trimming non-analyzable sites

Sites that are unambiguous (i.e., not gaps/ambiguous states) only for less than 4 OTUs cannot be analyzed for convergence. Such sites can be removed with cdskit hammer.

cdskit hammer \
--seqfile cds.aligned.trimmed.fasta \
--outfile cds.aligned.further_trimmed.fasta \
--codontable 1 \
--nail 4

Clone this wiki locally