-
Notifications
You must be signed in to change notification settings - Fork 6
Preparing CSUBST inputs
Phylogenetic trees obtained by ML, Bayes, NJ, or other methods can be used as input for CSUBST. Phylogenetic analysis software, such as IQ-TREE, RAxML-NG, and PhyloBayes, usually returns an unrooted Newick tree, so subsequent tree rooting will be necessary.
Unrooted trees can be rooted by several methods including midpoint rooting, the minimal ancestor deviation (MAD), outgroup rooting, rootstrapping with non-reversible models (e.g., IQ-TREE), and phylogeny reconciliation (e.g., GeneRax and NOTUNG). Reconciliation-based methods are recommended if a species tree is available. NWKIT supports MAD, ND (minimum-variance), midpoint, and outgroup rooting.
Transcripts (i.e., mRNA) contain both protein-coding sequences (CDSs) and untranslated regions (UTRs), and the former should be extracted for CSUBST. The code below shows an example of how SeqKit and CDSKIT can be combined to extract the longest ORF candidates from a transcriptome assembly by Trinity containing isoforms. Here, cdskit longestorf performs six-frame scanning and returns one longest ORF candidate per transcript. Note that cdskit aggregate -x "-i" collapses isoforms by sequence names.
seqkit rmdup --by-seq "Trinity.fasta" \
| cdskit longestorf --codontable 1 \
| cdskit pad \
| seqkit sort --by-name \
| cdskit aggregate -x "\-i[0-9].*" \
> "longest_orf.fa"
The complete CDSs, which start with a start codon and end with a stop codon, are ideal, but truncated CDSs are still analyzable. The code below shows how cdskit pad can make truncated sequences in-frame and how cdskit mask can replace ambiguous/stop codons with NNN.
cdskit pad --codontable 1 --seqfile "longest_orf.fa" \
| cdskit mask --codontable 1 --outfile "longest_orf.inframe.fa"
With unaligned codon sequences and aligned protein sequences as input, in-frame codon alignments can be obtained with cdskit backalign.
cdskit translate \
--seqfile cds.unaligned.fasta \
--outfile pep.unaligned.fasta \
--codontable 1
mafft \
--auto \
--amino \
pep.unaligned.fasta \
> pep.aligned.fasta
cdskit backalign \
--seqfile cds.unaligned.fasta \
--aa_aln pep.aligned.fasta \
--outfile cds.aligned.fasta
CSUBST can save required RAM by excluding codons that do not align well. This processing can be done with TrimAl or ClipKIT. Here is an example with TrimAl, but ClipKIT can also do the equivalent task in combination with cdskit backtrim.
trimal \
-in pep.aligned.untrimmed.fasta \
-backtrans cds.unaligned.untrimmed.fasta \
-out pep.aligned.trimmed.fasta \
-ignorestopcodon \
-automated1
Sites that are unambiguous (i.e., not gaps/ambiguous states) only for less than 4 OTUs cannot be analyzed for convergence. Such sites can be removed with cdskit hammer.
cdskit hammer \
--seqfile cds.aligned.trimmed.fasta \
--outfile cds.aligned.further_trimmed.fasta \
--codontable 1 \
--nail 4