diff --git a/docs/ibd.md b/docs/ibd.md index b52e55e261..d52b9d345c 100644 --- a/docs/ibd.md +++ b/docs/ibd.md @@ -20,16 +20,12 @@ kernelspec: # Identity by descent The {meth}`.TreeSequence.ibd_segments` method allows us to compute -segments of identity by descent. +segments of identity by descent along a tree sequence. :::{note} This documentation page is preliminary ::: -:::{todo} -Relate the concept of identity by descent to the MRCA spans in the tree sequence. -::: - ## Examples Let's take a simple tree sequence to illustrate the {meth}`.TreeSequence.ibd_segments` @@ -77,7 +73,26 @@ coordinate ``[left, right)`` and ancestral node ``a`` iff the most recent common ancestor of the segment ``[left, right)`` in nodes ``u`` and ``v`` is ``a``, and the segment has been inherited along the same genealogical path (ie. it has not been broken by recombination). The -segments returned are the longest possible ones. +definition of a "genealogical path" used here is +the sequence of edges, rather than nodes. +So, for instance, if ``u`` inherits a segment ``[x, z)`` from ``a``, +but that inheritance is represented by two edges, +one spanning ``[x, y)`` and the other spanning ``[y, z)``, +then this represents two genealogical paths, +and any IBD segments would be split at ``y``. +In other words, the method assumes that the end +of an edge represents a recombination, +an assumption that may not reflect how the tree sequence +is used -- see below for more discussion. + +This definition is purely genealogical: it depends only on the tree +sequence topology and node times, and does not inspect allelic +states or mutations. In particular, if we compute the MRCA of ``(u, v)`` +in each tree along the sequence, then (up to the additional refinement +by genealogical path) the IBD segments are those +that share the same ancestor and paths to that +ancestor. Intervals in which ``u`` and ``v`` lie in different roots +have no MRCA and therefore do not contribute IBD segments. Consider the IBD segments that we get from our example tree sequence: @@ -109,7 +124,11 @@ By default this class only stores the high-level summaries of the IBD segments discovered. As we can see in this example, we have a total of six segments and the total span (i.e., the sum lengths of the genomic intervals spanned -by IBD segments) is 30. +by IBD segments) is 30. In this default mode the object does not +store information about individual sample pairs, and methods that +inspect per-pair information (such as indexing with ``[(a, b)]`` or +iterating over the mapping) will raise an +``IdentityPairsNotStoredError``. If required, we can get more detailed information about particular segment pairs and the actual segments using the ``store_pairs`` @@ -148,8 +167,12 @@ segments = ts.ibd_segments(store_pairs=True, store_segments=True) segments[(0, 1)] ``` -The {class}`.IdentitySegmentList` behaves like a Python list, -where each element is an instance of {class}`.IdentitySegment`. +When ``store_segments=True``, the {class}`.IdentitySegmentList` behaves +like a Python list, where each element is an instance of +{class}`.IdentitySegment`. When only ``store_pairs=True`` is specified, +the number of segments and their total span are still available, but +attempting to iterate over the list or access the per-segment arrays +will raise an ``IdentitySegmentsNotStoredError``. :::{warning} The order of segments in an {class}`.IdentitySegmentList` @@ -168,19 +191,24 @@ By default we get the IBD segments between all pairs of {ref}`sample` nodes. #### IBD within a sample set + We can reduce this to pairs within a specific set using the ``within`` argument: - -```{eval-rst} -.. todo:: More detail and better examples here. -``` - ```{code-cell} segments = ts.ibd_segments(within=[0, 2], store_pairs=True) print(list(segments.keys())) ``` +Here we have restricted attention to the samples with node IDs 0 and 2, +so only the pair ``(0, 2)`` appears in the result. In general: + +- ``within`` should be a one-dimensional array-like of node IDs + (typically sample nodes). All unordered pairs from this set are + considered. +- If ``within`` is omitted (the default), all nodes flagged as samples + in the node table are used. + #### IBD between sample sets We can also compute IBD **between** sample sets: @@ -190,6 +218,17 @@ segments = ts.ibd_segments(between=[[0,1], [2]], store_pairs=True) print(list(segments.keys())) ``` +In this example we have two sample sets, ``[0, 1]`` and ``[2]``, so the +identity segments are computed only for pairs in which one sample lies +in the first set and the other lies in the second. More generally: + +- ``between`` should be a list of non-overlapping lists of node IDs. +- All pairs ``(u, v)`` are considered such that ``u`` and ``v`` belong + to different sample sets. + +The ``within`` and ``between`` arguments are mutually exclusive: passing +both at the same time raises a :class:`ValueError`. + :::{seealso} See the {meth}`.TreeSequence.ibd_segments` documentation for more details. @@ -200,6 +239,138 @@ more details. The ``max_time`` and ``min_span`` arguments allow us to constrain the segments that we consider. -```{eval-rst} -.. todo:: Add examples for these arguments. +The ``max_time`` argument specifies an upper bound on the time of the +common ancestor node: only IBD segments whose MRCA node has a time +no greater than ``max_time`` are returned. + +The ``min_span`` argument filters by genomic length: only segments with +span strictly greater than ``min_span`` are included. + +For example, working with ``ts2`` as the following tree sequence: + +```{code-cell} +:tags: [hide-input] + +import io + +nodes = io.StringIO( + """\ + id is_sample time + 0 1 0 + 1 1 0 + 2 0 1 + 3 0 3 + """ +) +edges = io.StringIO( + """\ + left right parent child + 0 4 2 0,1 + 4 10 3 0,1 + """ +) +ts2 = tskit.load_text(nodes=nodes, edges=edges, strict=False) +SVG(ts2.draw_svg()) +``` + +There are two segments: +```{code-cell} +segments = ts2.ibd_segments(store_segments=True) +print("all segments:", list(segments.values())[0]) +``` +... but only the left-hand one is more recent than 2 time units ago: +```{code-cell} +segments_recent = ts2.ibd_segments(max_time=2, store_segments=True) +print("max_time=1.2:", list(segments_recent.values())[0]) +``` +... and only the right-hand one is longer than 5 units. +```{code-cell} + +segments_long = ts2.ibd_segments(min_span=5, store_segments=True) +print("min_span=0.5:", list(segments_long.values())[0]) +``` + +So: the full result contains two IBD segments for the single sample +pair, one inherited via ancestor 2 over ``[0, 4)`` and one via +ancestor 3 over ``[4, 10)``. The ``max_time`` constraint removes the +segment inherited from the older ancestor (time 3), while the +``min_span`` constraint keeps only the longer of the two segments. + +### More on the "pathwise" definition of IBD segments + +We said above that the definition of IBD used by +{meth}`.TreeSequence.ibd_segments` says that a given segment +must be inherited from the MRCA along a single genealogical path, +and that "genealogical paths" are defined *edgewise*. +This can lead to surprising consequences. + +Returning to our example above: +```{code-cell} +:tags: [hide-input] + +SVG(ts.draw_svg()) +``` +there are two IBD segments between ``1`` and ``2``: +```{code-cell} +segments = ts.ibd_segments(within=[1, 2], store_pairs=True) +for pair, value in segments.items(): + print(pair, "::", value) +``` +This might be surprising, because the MRCA of ``1`` and ``2`` +is node ``4`` over the entire sequence. +In fact, some definitions of IBD segments +would have this as a single segment, +because the MRCA does not change, +even if there are distinct genealogical paths. + +The reason this is split into two segments +is because the path from ``4`` to ``2`` changes: +on the left-hand segment ``[0, 2)``, the node ``2`` +inherits from node ``4`` +via node ``3``, while on the right-hand segment ``[2, 10)`` +it inherits from node ``4`` directly. +The tree sequence doesn't say directly whether node ``2`` +also inherits from node ``3`` on the right-hand segment, +so whether or not this should be one IBD segment or two +depends on our interpretation +of what's stored in the tree sequence. +As discussed in +[Fritze et al](https://doi.org/10.1093/genetics/iyaf198), +most tree sequence simulators (at time of writing) +will produce this tree sequence even if node ``2`` +does in fact inherit from ``3`` over the entire sequence. +Using {meth}`.TreeSequence.extend_haplotypes` will +"put the unary nodes back": +```{code-cell} +ets = ts.extend_haplotypes() +SVG(ets.draw_svg()) +``` +and once this is done, there is only a single IBD segment: +```{code-cell} +segments = ets.ibd_segments(within=[1, 2], store_pairs=True) +for pair, value in segments.items(): + print(pair, "::", value) ``` +So, extending haplotypes may produce IBD segments +more in line with theory, if the desired definition if IBD +is the "pathwise" definition. +However, this will also probably introduce erroneous +portions of IBD segments, +so caution is needed. +Another approach would be to merge adjacent segments of IBD +that have the same MRCA. + +Summarizing this section -- +there is a confusing array of possible definitions +of what it means to be "an IBD segment"; +and these may be extracted from a tree sequence +in subtly different ways. +How much of a problem is this? +The answer depends on the precise situation, +but it seems likely that in practice, +differences due to definition are small +relative to errors due to tree sequence inference. +Indeed, empirical haplotype-matching methods +for identifying IBD segments can differ substantially +depending on the values of various hyperparameters. +More work is needed to develop a complete picture. diff --git a/python/tskit/tables.py b/python/tskit/tables.py index d61f864593..ecca366977 100644 --- a/python/tskit/tables.py +++ b/python/tskit/tables.py @@ -2737,8 +2737,8 @@ def assert_equals(self, other, *, ignore_timestamps=False): @dataclasses.dataclass(eq=True, order=True) class IdentitySegment: """ - A single segment of identity spanning a genomic interval for a - a specific ancestor node. + A single segment of identity by descent spanning a genomic interval + for a specific ancestor node. """ left: float @@ -2758,7 +2758,7 @@ def span(self) -> float: class IdentitySegmentList(collections.abc.Iterable, collections.abc.Sized): """ - A summary of identity segments for some pair of samples in a + A summary of identity-by-descent segments for some pair of samples in a :class:`.IdentitySegments` result. If the ``store_segments`` argument has been specified to :meth:`.TreeSequence.ibd_segments`, this class can be treated as a sequence of :class:`.IdentitySegment` objects. @@ -2769,7 +2769,9 @@ class IdentitySegmentList(collections.abc.Iterable, collections.abc.Sized): If ``store_segments`` is False, only the overall summary values such as :attr:`.IdentitySegmentList.total_span` and ``len()`` are - available. + available. Attempting to iterate over the list or access per-segment + arrays (``left``, ``right``, or ``node``) in this case will raise an + ``IdentitySegmentsNotStoredError``. .. warning:: The order of segments within an IdentitySegmentList is arbitrary and may change in the future @@ -2833,7 +2835,7 @@ def node(self): class IdentitySegments(collections.abc.Mapping): """ A class summarising and optionally storing the segments of identity - by state returned by :meth:`.TreeSequence.ibd_segments`. See the + by descent returned by :meth:`.TreeSequence.ibd_segments`. See the :ref:`sec_identity` for more information and examples. Along with the documented methods and attributes, the class supports @@ -2845,9 +2847,10 @@ class IdentitySegments(collections.abc.Mapping): for a given instance of this class are determined by the ``store_pairs`` and ``store_segments`` arguments provided to :meth:`.TreeSequence.ibd_segments`. For example, attempting - to access per-sample pair information if ``store_pairs`` - is False will result in a (hopefully informative) error being - raised. + to access per-sample pair information (such as indexing with + ``[(a, b)]``, iterating over the mapping, or accessing + :attr:`.IdentitySegments.pairs`) if ``store_pairs`` is False will + result in an ``IdentityPairsNotStoredError`` being raised. .. warning:: This class should not be instantiated directly. """ diff --git a/python/tskit/trees.py b/python/tskit/trees.py index 7775231dde..7f342f3ba3 100644 --- a/python/tskit/trees.py +++ b/python/tskit/trees.py @@ -10593,7 +10593,7 @@ def ibd_segments( represented by the tree sequence. :param list within: A list of node IDs defining set of nodes that - we finding IBD segments for. If not specified, this defaults to + we find IBD segments for. If not specified, this defaults to all samples in the tree sequence. :param list[list] between: A list of lists of sample node IDs. Given two sample sets A and B, only IBD segments will be returned such @@ -10608,7 +10608,7 @@ def ibd_segments( segment) is greater than this value will be included. (Default=0) :param bool store_pairs: If True store information separately for each pair of samples ``(a, b)`` that are found to be IBD. Otherwise - store summary information about all sample apirs. (Default=False) + store summary information about all sample pairs. (Default=False) :param bool store_segments: If True store each IBD segment ``(left, right, c)`` and associate it with the corresponding sample pair ``(a, b)``. If True, implies ``store_pairs``.