sumt computes summary trees and associated statistics from one or more files of phylogenetic trees.
Input trees are typically posterior samples from Bayesian MCMC analyses (BEAST / MrBayes), but could be any collection of plausible trees (e.g. equally parsimonious trees, or bootstrap replicates).
Supported summary tree types:
- Majority-rule consensus (
--con) - Majority-rule consensus + all compatible bipartitions (
--all) - Maximum clade credibility (
--mcc) - Maximum bipartition credibility (
--mbc) - HIPSTR (
--hip) and majority-rule HIPSTR (--mrhip)
Branch-length / node-depth options:
--noblen(topology + support only)--biplen(mean bipartition lengths)--meandepth(mean clade depths, then derive branch lengths)--cadepth(mean MRCA depths, then derive branch lengths - like TreeAnnotator “--height ca”: )
Rooting options:
--rootmid,--rootminvar--rootog TAX[,TAX,...]or--rootogfile FILE--rootcredto compute root credibility on the output tree
Version 4 is a major release (breaking CLI changes, plus new capabilities).
- Now with multi-processing and computation of median + credible intervals
- For a high-level overview, see What changed (4.0.0).
- For migration help, see Upgrading from 3.x → 4.x.
python3 -m pip install sumtUpgrade:
python3 -m pip install --upgrade sumtThis README uses small example files that live in the repository under examples/data/.
macOS and most Linux distributions include curl. Windows 10/11 typically includes curl as well.
# BEAST example (NEXUS tree file, 1000 trees, 12 tips)
curl -L -o primate-mtDNA.trees \
https://raw.githubusercontent.com/agormp/sumt/main/examples/data/primate-mtDNA.trees
# MrBayes example (two runs; .t tree files)
curl -L -o mrbayes.1.t \
https://raw.githubusercontent.com/agormp/sumt/main/examples/data/mrbayes.1.t
curl -L -o mrbayes.2.t \
https://raw.githubusercontent.com/agormp/sumt/main/examples/data/mrbayes.2.tpython -c "import urllib.request as u; u.urlretrieve('https://raw.githubusercontent.com/agormp/sumt/main/examples/data/primate-mtDNA.trees','primate-mtDNA.trees')"
python -c "import urllib.request as u; u.urlretrieve('https://raw.githubusercontent.com/agormp/sumt/main/examples/data/mrbayes.1.t','mrbayes.1.t')"
python -c "import urllib.request as u; u.urlretrieve('https://raw.githubusercontent.com/agormp/sumt/main/examples/data/mrbayes.2.t','mrbayes.2.t')"For additional short copy/paste patterns (including at least one example for every option), jump to Examples covering all main options.
For full option documentation, run sumt -h (or sumt --help).
# Majority-rule consensus tree, branch lengths by mean bipartition length, midpoint rooted
sumt --con --biplen --rootmid -b 0.1 primate-mtDNA.treesThis writes (by default) a summary tree file named primate-mtDNA.con (suffix = summary-tree type).
Use --outformat newick if you prefer Newick output.
# 95% central CI for branch lengths; automatic CPU selection (0 = automatic)
sumt --con --biplen --ci 0.95 --cpus 0 primate-mtDNA.trees--ci 0.95computes a central 95% credible interval for each estimated branch length (or node depth)- Optional: --cik K adjusts quantile-approximation precision (details below)
--cpus 0chooses a default number of worker processes (use--cpus 1to force single-process, or specify an exact number of processes)
This demonstrates:
- multiple input files
- burn-in via -b (one value for all files; optional comma-separated per-file values)
- optional ASDSF (average standard deviation of split frequencies) computation (
-s) as a convergence diagnostic - optional parallel processing (
--cpus)
# Typical: same burn-in for both independent runs
sumt --con --biplen --rootmid -b 0.25 -s --cpus 0 \
mrbayes.1.t mrbayes.2.t
# If you really need different burn-in per file, use comma-separated values:
sumt --con --biplen --rootmid -b 0.25,0.4 -s --cpus 0 \
mrbayes.1.t mrbayes.2.t
# Add 80% and 95% credible intervals on branch lengths
sumt --con --biplen --rootmid -b 0.25 --ci 0.8,0.95 \
mrbayes.1.t mrbayes.2.tOutput files are written with suffixes matching the summary-tree type (.con, .mcc, .mbc, .hip, .mrhip).
Use --basename NAME to control the output prefix; otherwise it uses the stem of the first input file.
In sumt (and the underlying phylotreelib), trees are typically treated as rooted at the bottom with tips at the top.
Accordingly, the code and output use:
- node depth = distance from the tips (leaves) back to a node (i.e., “time before the most recent leaf”)
In other phylogeny software, the same quantity is often called node height (because the root is drawn at the top). I am in the process of changing terminology to match that standard.
All summary trees represent a single “best” topology derived from a set of input trees, with support values (and optionally branch-length / node-depth summaries). If your input trees are not posterior samples (e.g., equally parsimonious trees), the reported supports are empirical frequencies in your set (how often a split/clade occurs), not Bayesian posterior probabilities.
Includes all bipartitions (splits) observed in ≥ 50% of the post-burnin trees. Support is typically reported as bipartition frequency (posterior probability under a Bayesian interpretation).
Starts with the majority-rule consensus tree and then considers additional bipartitions in descending frequency order. A bipartition is added if it is compatible with the current partially resolved tree. This continues until the tree is fully resolved or no more compatible bipartitions remain.
Selects an observed input tree (not a newly constructed consensus topology) that maximizes the product of clade frequencies (equivalently: maximizes the sum of log clade frequencies). This is mainly meaningful for rooted (often clock-like) trees.
Like MCC, but uses bipartitions instead of clades, and therefore ignores rooting. Two trees can share the same bipartitions but differ in root position; MBC treats them as equivalent.
HIPSTR (Highest Independent Posterior SubTree reconstruction) builds a fully resolved summary tree by choosing, at each internal node, the child-clade pair with the highest combined posterior support. A HIPSTR tree is typically not an observed input tree. (See: HIPSTR: highest independent posterior subtree reconstruction in TreeAnnotator X. Baele et al., Bioinformatics, 41(10), 2025)
--hip: includes clades even if < 50% (yields a fully resolved tree under the HIPSTR heuristic)--mrhip: includes only clades with ≥ 50% support (majority rule)
You always pick exactly one of: --noblen, --biplen, --meandepth, --cadepth.
Computes topology + support only. Branch lengths in the output are set to 0 (or omitted if you choose Newick without lengths).
Use this when:
- you only care about the consensus topology/support, or
- your input trees do not have meaningful lengths (e.g., pure topology samples).
For each branch in the summary tree, sumt identifies the corresponding leaf bipartition (split) and sets the branch length to the
mean length of that bipartition across the post-burnin input trees (where that bipartition occurs).
This works for unrooted summaries (e.g., --con, --all, --mbc) and does not assume clock-like trees.
With --ci, credible intervals (and median) are computed for branch lengths.
Intended for rooted clock-like trees (e.g., time trees).
For each clade in the summary tree, sumt sets the node depth to the mean node depth observed for that exact monophyletic clade,
computed only across those input trees where the clade occurs as a monophyletic group.
Then branch lengths are derived from depths (blen = depth(parent) - depth(child)).
Notes:
- This can be based on very few observations for rare clades.
- It may produce negative branch lengths in some cases (a known issue with mean-depth approaches).
With --ci, credible intervals (and median) are computed for node depths.
Also intended for rooted clock-like trees.
For each clade in the summary tree, sumt computes, in every post-burnin input tree, the depth of the MRCA of that clade’s tip set,
and then takes the mean of those MRCA depths across all trees.
This corresponds to TreeAnnotator’s “heights ca” approach.
With --ci, credible intervals (and median) are computed for node depths.
You can choose at most one of: --rootmid, --rootminvar, --rootog, --rootogfile.
If you do not specify a rooting option:
- For
--mcc, the chosen sample tree’s root is retained. - For
--con,--all, and--mbc, the output should be treated as unrooted unless you root it explicitly (e.g. with--rootmid,--rootminvar, or an outgroup). - For
--hipand--mrhip, the output is rooted by construction (HIPSTR is defined on clades / child-clade pairs, i.e. rooted structure). In practice this means the output root reflects the rooted structure present in the input trees; if your input trees are not meaningfully rooted, HIPSTR-style summaries are usually not appropriate.
Places the root at the midpoint of the tree’s diameter (the longest tip-to-tip path). Often useful as a quick heuristic when no outgroup is available.
Chooses a root location that minimizes the variance in root-to-tip distances (aiming for the most “clock-like” rooting). See: Minimum variance rooting of phylogenetic trees and implications for species tree reconstruction. Mai, Sayyari & Mirarab (2017), PLOS ONE 12(8).
Roots the summary tree using an outgroup. The root is placed at the midpoint of the branch separating outgroup and ingroup.
--rootogtakes a comma-separated list of taxa on the command line.--rootogfiletakes a file with one taxon name per line.
If you provide multiple outgroup taxa, sumt attempts to place the root on the branch separating those taxa from the remaining tips.
--rootcred annotates branches in the output tree with how often the root was observed to fall on that branch among the input trees.
There are two cases:
-
If you use an outgroup (
--rootog/--rootogfile):- For each input tree,
sumtidentifies the branch where the outgroup attaches. - Root credibility for a branch is the fraction of input trees where the outgroup attached there.
- This works even if the input trees are unrooted, because the outgroup attachment defines a rooting.
- For each input tree,
-
If you do not use an outgroup:
sumtassumes the input trees are already rooted (typical for BEAST clock trees).- It tracks the observed root location (root bipartition) directly across trees.
Why the “cumulated root credibility” may be < 100%:
- Root credibility is only reported on branches that exist in the final summary topology. If some root locations occur on branches (bipartitions) that are not present in the summary tree, their mass is not represented.
Example:
sumt --con --biplen --rootmid --rootcred primate-mtDNA.treesWhen you request --ci, sumt estimates central credible intervals and the median for either:
- branch lengths (when using
--biplen), or - node depths (when using
--meandepthor--cadepth)
Implementation note (approximate quantiles):
- Credible intervals are based on estimated quantiles (not exact order statistics).
- To compute exact quantiles you would, in principle, have to store all branch-length (or depth) values across all input trees. That can require large amounts of memory for large analyses.
- Instead,
sumtuses a mergeable log-bucket histogram (seeQuantileAccumulatorinphylotreelib): it does one pass through the trees and only stores counts per bin, not every individual value. - “Log-bucket” means the bins are spaced by order of magnitude:
values around 0.01, 0.1, 1, 10, 100 fall into different magnitude ranges, and within each range
sumtsubdivides more finely. As a result, bins are narrower for small values and wider for large values (roughly constant relative precision). - Quantiles are found by walking through bins in numeric order until the cumulative count reaches (for example) 2.5%, 50% (median), or 97.5% of the samples. The quantile is reported as the midpoint of the bin where that cutoff falls. This means the resolution is limited by the bin width.
- The precision is controlled by
--cik K, which sets the within-magnitude resolution (2^K sub-bins per magnitude range):- higher K results in finer bins (more precision), but more memory/CPU
- a worst-case relative midpoint error bound is about
2^-(K+1)(e.g. K=7 ≈ 0.39%, K=8 ≈ 0.20%, K=9 ≈ 0.10%) - example: with K=7, a reported depth of 1.0 could be off by up to ~0.004 in the worst case from binning alone (and a depth of 10.0 by up to ~0.04, because the bound is relative)
- Default is
--cik 7. - By default,
sumtwrites branch/node annotations (support, branch lengths, depth summaries, credible intervals, etc.) as NEXUS metacomments on the corresponding branches/nodes in the output tree file. Many tree viewers can display these annotations, and you can also see them directly in the file as bracketed comments. - Use
--nometato suppress these metacomments if you want a “plain” NEXUS/Newick tree without embedded annotations.
Examples:
# One CI (default precision, --cik 7):
sumt --con --biplen --ci 0.95 primate-mtDNA.trees
# Several CIs:
sumt --con --biplen --ci 0.5,0.8,0.95 primate-mtDNA.trees
# Increase precision of the quantile approximation:
sumt --con --biplen --ci 0.95 --cik 9 primate-mtDNA.treesBelow are short “pattern” examples. They are deliberately small and repetitive so you can copy/paste a working starting point.
# Input format (autodetection usually works; use only if needed)
sumt --informat nexus --con --biplen primate-mtDNA.trees
sumt --informat newick --con --biplen mytrees.newick
# Output format
sumt --outformat newick --con --biplen primate-mtDNA.trees
sumt --outformat nexus --con --biplen primate-mtDNA.trees
# Basename controls output stem
sumt --basename primates_summary --con --biplen primate-mtDNA.trees
# Suppress metacomments in NEXUS output
sumt --con --biplen --nometa primate-mtDNA.trees
# Overwrite without prompting; quiet mode implies -n
sumt -n --con --biplen primate-mtDNA.trees
sumt -q --con --biplen primate-mtDNA.trees
# Verbose tracebacks (useful for debugging)
sumt -v --con --biplen primate-mtDNA.treessumt --con --biplen primate-mtDNA.trees
sumt --all --biplen primate-mtDNA.trees
sumt --mcc --meandepth primate-mtDNA.trees
sumt --mbc --biplen primate-mtDNA.trees
sumt --hip --biplen primate-mtDNA.trees
sumt --mrhip --biplen primate-mtDNA.trees# Topology + support only
sumt --con --noblen primate-mtDNA.trees
# Mean bipartition lengths
sumt --con --biplen primate-mtDNA.trees
# Mean clade depths (clock-like rooted trees)
sumt --mcc --meandepth primate-mtDNA.trees
# Common-ancestor depths (TreeAnnotator-style --height ca)
sumt --mcc --cadepth primate-mtDNA.trees# Midpoint / min-variance
sumt --con --biplen --rootmid primate-mtDNA.trees
sumt --con --biplen --rootminvar primate-mtDNA.trees
# Outgroup on command line (comma-separated list; not biologically meaningful...)
sumt --con --biplen --rootog Macaca_fuscata,M._mulatta,M._fascicularis,M._sylvanus primate-mtDNA.trees
# Outgroup from file (one taxon per line)
printf "Macaca_fuscata\nM._mulatta\nM._fascicularis\nM._sylvanus\n" > outgroup.txt
sumt --con --biplen --rootogfile outgroup.txt primate-mtDNA.trees
# Root credibility on the output tree
sumt --con --biplen --rootmid --rootcred primate-mtDNA.trees# Burn-in (one value for all files)
sumt --con --biplen -b 0.25 primate-mtDNA.trees
# Burn-in (one value per file, comma-separated; could be different per file)
sumt --con --biplen -b 0.25,0.25 mrbayes.1.t mrbayes.2.t
# Tree probabilities and credible set of topologies
sumt --con --biplen -t 0.95 primate-mtDNA.trees
# ASDSF across files (+ minimum frequency threshold)
sumt --con --biplen -s -f 0.1 mrbayes.1.t mrbayes.2.t
# Credible intervals with higher quantile-precision
sumt --con --biplen --ci 0.8,0.95 --cik 9 primate-mtDNA.treesNotes on -t PROB (.trprobs output):
- When you run with
-t(e.g.-t 0.95),sumtwrites a file named<basename>.trprobs. - The file is NEXUS format (
#NEXUS … begin trees; … end;) containing the most probable topologies up to the requested cumulative probability (a “credible set”). - Each tree entry is annotated with
p(posterior probability of that topology) andP(cumulative probability so far). - This is most useful for small numbers of taxa. With ~15–20+ taxa, almost every sampled tree is typically unique, so the “credible set” becomes long and less informative even though split/clade supports remain useful.
sumt can process large tree files faster by splitting work across multiple processes (--cpus).
Internally, each worker processes trees in chunks (--chunksize, default: 250 trees per chunk).
A larger chunk size reduces scheduling/serialization overhead, but can increase peak memory usage and sometimes
reduce load balancing across CPUs.
# Parallel processing
sumt --con --biplen --cpus 8 -b 0.2 mrbayes.1.t mrbayes.2.t
# Force single-process
sumt --con --biplen --cpus 1 -b 0.2 mrbayes.1.t mrbayes.2.t
# Chunk size: larger reduces overhead, may increase memory usage
sumt --con --biplen --cpus 8 --chunksize 500 -b 0.2 mrbayes.1.t mrbayes.2.tThis is a major release because it includes intentional breaking CLI changes:
- Removed
-i/-w/--autow: input files are now positional, and file weights are no longer supported (each tree counts equally). - Added credible intervals via
--ci. - Added multiprocessing via
--cpusand--chunksize. - Simplified output formats:
--outformatis nownewickornexus; use--nometato suppress metacomments.
Old (3.x):
sumt ... -i file1.t -i file2.t
# or
sumt ... -w 0.5 file1.t -w 1.0 file2.tNew (4.x):
sumt ... file1.t file2.tNotes:
- Weighting was removed
- Filenames are now a positional argument (no longer requiring -i FILENAME)
Old (3.x) accepted space-separated lists:
sumt ... -b 0.25 0.4 -i file1.t -i file2.tNew (4.x) uses one value or comma-separated values:
sumt ... -b 0.25,0.4 file1.t file2.tsumt ... --ci 0.8,0.95sumt ... --cpus 8 --chunksize 500If you use sumt in academic work, the simplest option is to cite the GitHub repository (GitHub “Cite this repository” in the sidebar).
- Some combinations (especially
--meandepth/--cadepth) assume clock-like, rooted trees. - Large tree files can be processed efficiently, but for best performance you may want to tune
--chunksizeand--cpus.