Skip to content

Commit bd56acc

Browse files
hyanwongjeromekelleher
authored andcommitted
Clarify "dead" leaves
Fixes #339 Update python/tskit/trees.py Co-authored-by: Jerome Kelleher <[email protected]>
1 parent f13999a commit bd56acc

File tree

2 files changed

+63
-10
lines changed

2 files changed

+63
-10
lines changed

docs/data-model.md

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -997,7 +997,7 @@ position 20 to 40, has a single root. Finally the third tree, from position
997997

998998
#### The virtual root
999999

1000-
To access all the roots in a tree, tskit uses a special additional node
1000+
To access all the :attr:`~Tree.roots` in a tree, tskit uses a special additional node
10011001
called the **virtual root**. This is primarily a bookkeeping device, and
10021002
can normally be ignored: it is not plotted in any visualizations and
10031003
does not exist as an independent node in the node table.
@@ -1097,6 +1097,39 @@ print(
10971097
In `tskit`, isolated sample nodes are closely associated with the encoding of
10981098
{ref}`sec_data_model_missing_data`.
10991099

1100+
1101+
(sec_data_model_tree_dead_leaves_and_branches)=
1102+
1103+
### Dead leaves and branches
1104+
1105+
In a `tskit` tree, a *leaf node* is defined as a node without any children. The
1106+
implications of this turn out to be slighly unintuitive, and so are worth briefly
1107+
documenting here. Firstly, the same node can be a leaf in one tree, and not a leaf
1108+
in the next tree along the tree sequence. Secondly all isolated nodes must be leaves
1109+
(as by definition they have no children). Thirdly sample nodes need not be leaves
1110+
(they could be "internal samples"); likewise leaf nodes need not be samples.
1111+
1112+
Node 7 in the example above provides a good case study. Note that it is a root node with
1113+
at least one child (i.e. not a leaf) in trees 0 and 2; in contrast in tree 1 it is
1114+
isolated. Strictly, because it is isolated in tree 1, it is also a leaf node there,
1115+
although it is not attached to a root, not a sample, and is therefore not plotted. In
1116+
this case, in that tree we can think of node 7 as a "dead leaf" (and we don't normally
1117+
plot dead leaves). In fact, in a large tree sequence of many trees, most ancestral nodes
1118+
will be isolated in any given tree, and therefore most nodes in such a tree will be of
1119+
this sort. However, these dead leaves are excluded from most calculations on trees,
1120+
because algorithms usually traverse the tree by starting at a root and working down,
1121+
or by starting at a sample and working up. Hence when we refer to the leaves of a tree,
1122+
it is usually shorthand for the leaves **on** the tree (that is, attached via branches,
1123+
to one of the the tree roots). Dead leaves are excluded from this definition.
1124+
1125+
Note that it is also possible to have trees in which there are "dead branches": that is
1126+
sections of topology which are not accessible from a root, and whose tips are all
1127+
dead leaves. Although valid, this is a relatively unusual state of affairs, and such
1128+
branches are not plotted by the standard {ref}`sec_tskit_viz` methods. The
1129+
{meth}`Tree.nodes` method will not, by default, traverse through dead branches, although
1130+
it can be made to do so by specifying the ID of a dead node as the root for traversal.
1131+
1132+
11001133
(sec_data_model_genetic_data)=
11011134

11021135
## Encoding genetic variation
@@ -1153,7 +1186,7 @@ outputs the actual allelic state for each sample, defaults to outputting an `N`
11531186
these sites. Therefore where any sample node is isolated, the haplotype will show
11541187
an `N`, indicating the DNA sequence is unknown. This will be so not only in the
11551188
middle of all of the sample genomes, but also at the right hand end of the genome of
1156-
sample 2, as it is the only isolated sample node in the rightmost tree:
1189+
sample 2, as it is an isolated sample node in the rightmost tree:
11571190

11581191
```{code-cell} ipython3
11591192
for i, h in enumerate(missing_ts.haplotypes()):

python/tskit/trees.py

Lines changed: 28 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1002,10 +1002,11 @@ def total_branch_length(self):
10021002
10031003
As this is defined by a traversal of the tree, technically we
10041004
return the sum of all branch lengths that are reachable from
1005-
roots. Thus, this is the sum of all branches that are ancestral
1005+
roots. Thus, this is the total length of all branches that are connected
10061006
to at least one sample. This distinction is only important
10071007
in tree sequences that contain 'dead branches', i.e., those
1008-
that define topology not ancestral to any samples.
1008+
that define topology that is not connected to a tree root
1009+
(see :ref:`sec_data_model_tree_dead_leaves_and_branches`)
10091010
10101011
:return: The sum of lengths of branches in this tree.
10111012
:rtype: float
@@ -1374,6 +1375,16 @@ def is_leaf(self, u):
13741375
Returns True if the specified node is a leaf. A node :math:`u` is a
13751376
leaf if it has zero children.
13761377
1378+
.. note::
1379+
:math:`u` can be any node in the entire tree sequence, including ones
1380+
which are not connected via branches to a root node of the tree (and which
1381+
are therefore not conventionally considered part of the tree). Indeed, if
1382+
there are many trees in the tree sequence, it is common for the majority of
1383+
non-sample nodes to be :meth:`isolated<is_isolated>` in any one
1384+
tree. By the definition above, this method will return ``True`` for such
1385+
a tree when a node of this sort is specified. Such nodes can be thought of
1386+
as "dead leaves", see :ref:`sec_data_model_tree_dead_leaves_and_branches`.
1387+
13771388
:param int u: The node of interest.
13781389
:return: True if u is a leaf node.
13791390
:rtype: bool
@@ -1383,7 +1394,8 @@ def is_leaf(self, u):
13831394
def is_isolated(self, u):
13841395
"""
13851396
Returns True if the specified node is isolated in this tree: that is
1386-
it has no parents and no children. Sample nodes that are isolated
1397+
it has no parents and no children (note that all isolated nodes in the tree
1398+
are therefore also :meth:`leaves<Tree.is_leaf>`). Sample nodes that are isolated
13871399
and have no mutations above them are used to represent
13881400
:ref:`missing data<sec_data_model_missing_data>`.
13891401
@@ -2004,9 +2016,17 @@ def get_leaves(self, u):
20042016

20052017
def leaves(self, u=None):
20062018
"""
2007-
Returns an iterator over all the leaves in this tree that are
2008-
underneath the specified node. If u is not specified, return all leaves
2009-
in the tree.
2019+
Returns an iterator over all the leaves in this tree that descend from
2020+
the specified node. If :math:`u` is not specified, return all leaves on
2021+
the tree (i.e. all leaves reachable from the tree root(s), see note below).
2022+
2023+
.. note::
2024+
:math:`u` can be any node in the entire tree sequence, including ones
2025+
which are not connected via branches to a root node of the tree. If
2026+
called on such a node, the iterator will return "dead" leaves
2027+
(see :ref:`sec_data_model_tree_dead_leaves_and_branches`) which cannot
2028+
be reached from a root of this tree. However, dead leaves will never be
2029+
returned if :math:`u` is left unspecified.
20102030
20112031
:param int u: The node of interest.
20122032
:return: An iterator over all leaves in the subtree rooted at u.
@@ -2309,15 +2329,15 @@ def push(nodes):
23092329

23102330
def nodes(self, root=None, order="preorder"):
23112331
"""
2312-
Returns an iterator over the node IDs reachable from the root(s) in this
2332+
Returns an iterator over the node IDs reachable from the specified node in this
23132333
tree in the specified traversal order.
23142334
23152335
.. note::
23162336
Unlike the :meth:`TreeSequence.nodes` method, this iterator produces
23172337
integer node IDs, not :class:`Node` objects.
23182338
23192339
If the ``root`` parameter is not provided or ``None``, iterate over all
2320-
nodes reachable from the roots (see :meth:`Tree.roots` for details
2340+
nodes reachable from the roots (see :attr:`Tree.roots` for details
23212341
on which nodes are considered roots). If the ``root`` parameter
23222342
is provided, only the nodes in the subtree rooted at this node
23232343
(including the specified node) will be iterated over. If the

0 commit comments

Comments
 (0)