diff --git a/docs/stats.md b/docs/stats.md index 87a2ed5f83..700a0268e2 100644 --- a/docs/stats.md +++ b/docs/stats.md @@ -684,14 +684,319 @@ and boolean expressions (e.g., {math}`(x > 0)`) are interpreted as 0/1. and {math}`v_j` is the covariance of the trait with the j-th covariate. +(sec_stats_multi_site)= + ## Multi site statistics -:::{todo} -Document statistics that use information about correlation between sites, such as -LdCalculator (and perhaps reference {ref}`sec_identity`). Note that if we have a general -framework which has the same calling conventions as the single site stats, -we can rework the sections above. -::: +(sec_stats_two_locus)= + +### Two-locus statistics + +The {meth}`~TreeSequence.ld_matrix` method provides an interface to +a collection of two-locus statistics with predefined summary functions (see +{ref}`sec_stats_two_locus_summary_functions`) and `site` and `branch` +{ref}`modes `. The LD matrix method differs from other +statistics methods in that it provides a unified API with an argument to +specify different two-locus summaries of the data. It otherwise behaves +similarly to most other functions with respect to `sample_sets` and `indexes`. + +Two-locus statistics can be computed using two modes, either `site` or +`branch`, and these should be interpreted in the same way as these modes in the +single-site statistics. Within this framework, statistics may be either +polarised or unpolarised. For statistics that are polarised, we compute +statistic values for pairs of derived alleles. Unpolarised statistics compute +statistics over all possible alleles, derived and ancestral, and the result in +averaged over statistics computed for all pairs of alleles (see weighting +schemes below). The option for polarisation is not exposed to the user. +Instead, we implement statistics that are polarised where appropriate. + +(sec_stats_two_locus_site)= + +#### Site + +The `site` mode computes two-locus statistics summarized over alleles between +all pairs of specified sites. The default behavior, leaving `sites` +unspecified, will compute a matrix for all pairs of sites, with +one row and column for each site in the tree sequence (i.e., an n x n +matrix where n is the number of sites in the tree sequence). We can also +restrict the output to a subset of sites, either by specifying a single vector +for both rows and columns or a pair of vectors for the row sites and column +sites separately. + +The following computes a matrix of the {math}`r^2` measure of linkage +disequilibrium (LD) computed pairwise between the first 4 sites in the tree +sequence among all samples. In our computations, row sites are used as the row +("left-hand") loci and column sites are used as the column ("right-hand") +locus, and with a single list of sites specified, we obtain a symmetric square +matrix. + +```{code-cell} ipython3 +ld = ts.ld_matrix(sites=[[0, 1, 2, 3]]) +print(ld) +``` + +The following demonstrates how we can specify the row and column sites +independently of each other. We're specifying 3 columns and 2 rows, which +computes a subset of the matrix shown above. + +```{code-cell} ipython3 +ld = ts.ld_matrix(sites=[[0, 1], [1, 2, 3]]) +print(ld) +``` + +Because we allow for two-locus statistics to be computed for multi-allelic +data, we need to be able to combine statistical results from each pair of +alleles into one summary for a pair of sites. We use two implementations for +combining results from multiple alleles: `hap_weighted` and `total_weighted`. +These are statistic-specific and not chosen by the user. + +Briefly, consider a pair of sites with {math}`n` alleles at the first locus and +{math}`m` alleles at the second. Write {math}`f_{i,j}` as the statistic +computed for focal alleles {math}`A_i` and {math}`B_j`, with haplotype weights +{math}`(A_i B_j, A_i b_j, a_i B_j)`, where {math}`a_i` and {math}`b_j` are the +collection of alleles that are not the focal alleles {math}`A_i` or +{math}`B_j`, respectively. Then the weighting schemes are defined as: + +- `hap_weighted`: {math}`\sum_{i=1}^{n}\sum_{j=1}^{m}p(A_{i}B_{j})f_{ij}`, + where {math}`p(A_{i}B_{j})` is the frequency of haplotype {math}`A_{i}B_{j}`. + This method was first introduced in [Karlin + (1981)](https://doi.org/10.1111/j.1469-1809.1981.tb00308.x) and reviewed in + [Zhao (2007)](https://doi.org/10.1017/S0016672307008634). + +- `total_weighted`: {math}`\frac{1}{n m}\sum_{i=1}^{n}\sum_{j=1}^{m}f_{ij}`. + This method assigns equal weight to each of the possible pairs of focal + alleles at the two sites, taking the arithmetic mean of statistics over + focal haplotypes. + +Out of all of the available summary functions, only {math}`r^2` uses +`hap_weighted` normalisation, with the remainder using uniform weighting +(`total_weighted`). + +(sec_stats_two_locus_branch)= + +#### Branch + +The `branch` mode computes expected two-locus statistics between pairs of +trees, conditioned on the marginal topologies and branch lengths of those +trees. The trees for which we compute statistics are specified by positions, +and for a pair of positions we consider all possible haplotypes that could be +generated by a single mutation occurring at the two trees. + +For two trees, one with {math}`n` branches and the other with {math}`m` +branches, there are {math}`nm` possible pairs of branches that may carry the +pair of mutations. For each pair, we compute the two-locus statistic, and then +sum these values weighted by the product of the two branch lengths. Given the +two mutations occur, this accounts for the relative probability that the two +mutations fall on any pair of branches. + +The time complexity of this method is quadratic in the number of samples, +due to the pairwise comparisons of branches from each pair of trees. +By default, this method computes +a symmetric matrix for all pairs of trees, with rows and columns representing +each tree in the tree sequence. Similar to the site method, we can restrict the +output to a subset of trees, either by specifying a vector of positions or +a pair of vectors for row and column positions separately. To select a specific +tree, the specified positions must land in the tree span (`[start, end)`). + +In the following, we compute a matrix of expected {math}`r^2` within and +between the first 4 trees in the tree sequence. The tree breakpoints are +a convenient way to specify those first four trees. + +```{code-cell} ipython3 +ld = ts.ld_matrix(mode="branch", positions=[ts.breakpoints(as_array=True)[0:4]]) +print(ld) +``` + +Again, we can specify the row and column trees separately. + +```{code-cell} ipython3 +breakpoints = ts.breakpoints(as_array=True) +ld = ts.ld_matrix(mode="branch", positions=[breakpoints[[0]], breakpoints[0:4]]) +print(ld) +``` + +(sec_stats_two_locus_sample_sets)= + +#### Sample Sets + +The two-locus API provides a mechanism by which to subset the samples under +consideration, providing the ability to compute a separate LD matrix for each +sample set or an LD matrix between sample sets. Output dimensions are handled +in the same manner as the rest of the stats API (see +{ref}`sec_stats_output_dimensions`). The two-locus API differs from the rest of +the stats API in that we handle one-way and two-way statistics in the same +function call. + +To compute a two-way two-locus statistic, the `index` argument must be +provided. The statistics are selected in the same way (with the `stat` +argument), but we provide a restricted set of two-way statistics (see +{ref}`sec_stats_two_locus_summary_functions_two_way`). The dimension-dropping +rules for the result follow the rest of the tskit stats API in that a single list +or tuple will produce a single two-dimensional matrix, while list of these +will produce a three-dimensional array, with the outer dimension of length +equal to the length of the list. + +For concreteness, we would expect the following dimensions with the specified +`sample_sets` and `indexes` arguments. + +``` +# one-way +ts.ld_matrix(sample_sets=None) -> 2 dimensions +ts.ld_matrix(sample_sets=[0, 1, 2, 3]) -> 2 dimensions +ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) -> 3 dimensions +# two-way +ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dimensions +ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions +``` + +By computing the LD statistic for a subset of samples in the tree sequence, it +is possible that statistics are computed over pairs of sites that are not +variable within the sample set. This can result in zero- or nan-valued statistics +as outputs. To avoid this, the tree sequence may first be simplified to the +sample subset, which will remove sites that are invariant within that subset: + +```{code-cell} ipython3 +ts = msprime.sim_ancestry( + 20, + population_size=10000, + sequence_length=1000, + recombination_rate=2e-8, + random_seed=12) +ts = msprime.sim_mutations(ts, rate=2e-8, random_seed=12) + +ld_all = ts.ld_matrix(mode="site") +print(ld_all) +``` + +Considering a subset of samples: + +```{code-cell} ipython3 +ld_sub = ts.ld_matrix(mode="site", sample_sets=range(8)) +print(ld_sub) +``` + +Simplifying and computing the LD matrix avoids computing {math}`r^2` for +invariant sites in the subsample: + +```{code-cell} ipython3 +ts_simp = ts.simplify(samples=range(8), filter_sites=True) +ld_sub = ts_simp.ld_matrix(mode="site", sample_sets=range(8)) +print(ld_sub) +``` + +Any remaining sites that produce {math}`r^2` with a nan values are those with +a mutation that occurs above the root of the tree, so that all samples in the +subset carry the derived allele (so that its allele frequency is 1). Also note +that under certain mutation models, back mutations may create sites that are +invariant in a sample, and this will also result in nan values observed in the +LD matrix. + +(sec_stats_two_locus_sample_one_way_stats)= + +##### One-way Statistics + +One-way statistics are summaries of two loci in a single sample set, using a +triple of haplotype counts {math}`\{w_{Ab}, w_{aB}, w_{AB}\}` and the size of +the sample set {math}`n`, where the capitalized and lowercase letters in our notation +represent alternate alleles. + +(sec_stats_two_locus_sample_two_way_stats)= + +##### Two-way Statistics + +Two-way statistics are summaries of haplotype counts between two sample sets, +which operate on the three haplotype counts (as in one-way stats, above) +computed from each sample set, indexed by `(i, j)`. These statistics take on +a different meaning from their one-way counterparts. For instance {math}`D_i +D_j` can be thought of as the covariance between two loci when computed for +a haplotype in a single sample set. + +Only a subset of our summary functions are two-way statistics (see +{ref}`sec_two_locus_summary_functions_two_way`). Note that the unbiased two-way +statistics expect non-overlapping sample sets, and we do not make any +assertions about the sample sets and assume that `i` and `j` represent disjoint +sets of samples (see also the note in {meth}`~TreeSequence.divergence`). + +(sec_stats_two_locus_summary_functions)= + +#### Summary Functions + +(sec_stats_two_locus_summary_functions_one_way)= + +##### One-way + +The two-locus summary functions all take haplotype counts and sample set size as +input. Each of our summary functions has the signature +{math}`f(w_{Ab}, w_{aB}, w_{AB}, n)`, converting to haplotype frequencies +{math}`\{p_{Ab}, p_{aB}, p_{AB}\}` where appropriate by dividing by {math}`n`. + +`D` +: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{ab} - p_{a}p_{b}` + + This statistic is polarised, as the unpolarised result, which averages over + allele labelings, is zero. Uses the `total` weighting method. + +`D_prime` +: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D}{D_{\max}}`, + + where {math} + \min\{p_A (1-p_B), p_B (1-p_B)\} & \textrm{if }D>=0 \\ + \min\{p_A p_B, (1-p_B) (1-p_B)\} & \textrm{otherwise} + \end{cases}``` + + and {math}`D` is defined above. Polarised, `total` weighted. + +`D2` +: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = (p_{ab} - p_{a} p_{b})^2` + + Unpolarised, `total` weighted. + +`Dz` +: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = D (1 - 2 p_{a})(1 - 2p_{b})`, + + where {math}`D` is defined above. Unpolarised, `total` weighted. + +`pi2` +: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{a}p_{b}(1-p_{a})(1-p_{b})` + + Unpolarised, `total` weighted. + +`r` +: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D}{\sqrt{p_{a}p_{b}(1-p_{a})(1-p_{b})}}`, + + where {math}`D` is defined above. Polarised, `total` weighted. + +`r2` +: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D^{2}}{p_{a}p_{b}(1-p_{a})(1-p_{b})}`, + + where {math}`D` is defined above. Unpolarised, `haplotype` weighted. + +Unbiased two-locus statistics from the Hill-Robertson (1968) system are +computed from haplotype counts. Definitions of these unbiased estimators can +be found in [Ragsdale and Gravel +(2020)](https://doi.org/10.1093/molbev/msz265). They require at least 4 samples +to be valid and are specified as `stat="D2_unbiased"`, `"Dz_unbiased"`, or +`"pi2_unbiased"`. + +(sec_two_locus_summary_functions_two_way)= + +(sec_stats_two_locus_summary_functions_two_way)= + +##### Two-way + +Two-way statistics are indexed by sample sets {math}`i, j` and compute values +using haplotype counts within pairs of sample sets. + +`D2` +: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = D_i * D_j`, + +where {math}`D` is defined above. + +`r2` +: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{(p_{AB_i} - (p_{A_i} p_{B_i})) (p_{AB_j} - (p_{A_j} p_{B_j}))}{\sqrt{p_{A_i} (1 - p_{A_i}) p_{B_i} (1 - p_{B_i})}\sqrt{p_{A_j} (1 - p_{A_j}) p_{B_j} (1 - p_{B_j})}}` + +And `D2_unbiased`, which can be found in [Ragsdale and Gravel +(2020)](https://doi.org/10.1093/molbev/msz265). (sec_stats_notes)= diff --git a/python/CHANGELOG.rst b/python/CHANGELOG.rst index 916412cc81..bc9b6217da 100644 --- a/python/CHANGELOG.rst +++ b/python/CHANGELOG.rst @@ -7,6 +7,12 @@ In development - Add ``json+struct`` metadata codec that allows storing binary data using a struct schema alongside JSON metadata. (:user:`benjeffery`, :pr:`3306`) +**Features** + +- Add ``TreeSequence.ld_matrix`` stats method and documentation, for computing + two-locus statistics in site and branch mode. + (:user:`lkirk`, :user:`apragsdale`, :pr:`3416`) + -------------------- [1.0.2] - 2026-03-06 -------------------- diff --git a/python/tskit/trees.py b/python/tskit/trees.py index 45d2da59e0..2f0d9e9089 100644 --- a/python/tskit/trees.py +++ b/python/tskit/trees.py @@ -10930,12 +10930,92 @@ def impute_unknown_mutations_time( def ld_matrix( self, sample_sets=None, - sites=None, - positions=None, mode="site", stat="r2", + sites=None, + positions=None, indexes=None, ): + r""" + + Returns a matrix of the specified two-locus statistic (default + :math:`r^2`) computed from sample allelic states or branch lengths. + The resulting linkage disequilibrium (LD) matrix represents either the + two-locus statistic as computed between all pairs of specified + ``sites`` ("site" mode, producing a num_sites-by-num_sites sized + matrix), or as computed from the branch structures at marginal trees + between pairs of trees at all specified ``positions`` ("branch" mode, + producing a num_positions-by-num_positions sized matrix). + + In the site mode, the sites under consideration can be restricted using + the ``sites`` argument. Sites can be passed as a list of lists, + specifying the ``[[row_sites], [col_sites]]``, resulting in a + rectangular matrix, or by specifying a single list of ``[sites]``, in + which a square matrix will be produced (see + :ref:`sec_stats_two_locus_site` for examples). + + Similarly, in the branch mode, the ``positions`` argument specifies + loci for which the expectation for the two-locus statistic is computed + over pairs of trees at those positions. LD stats are computed between + trees whose ``[start, end)`` contains the given position (such that + repeats of trees are possible). Similar to the site mode, a nested list + of row and column positions can be specified separately (resulting in a + rectangular matrix) or a single list of a specified positions results + in a square matrix (see :ref:`sec_stats_two_locus_branch` for + examples). + + Some LD statistics are defined for two sample sets instead of within a + single set of samples. If the ``indexes`` argument is specified, at + least two sample sets must also be specified. ``indexes`` specifies the + indexes of the sample sets in the ``sample_sets`` list + between which to compute LD. + + For more on how the ``indexes`` and ``sample_sets`` interact with the + output dimensions, see the :ref:`sec_stats_two_locus_sample_sets` + section. + + **Available Stats** (use ``Stat Name`` in the ``stat`` keyword + argument). + + ======================= ========== ================ ============== + Stat Polarised Multi Sample Set Stat Name + ======================= ========== ================ ============== + :math:`r^2` n y "r2" + :math:`r` y n "r" + :math:`D^2` n y "D2" + :math:`D` y n "D" + :math:`D'` y n "D_prime" + :math:`D_z` n n "Dz" + :math:`\pi_2` n n "pi2" + :math:`\widehat{D^2}` n y "D2_unbiased" + :math:`\widehat{D_z}` n n "Dz_unbiased" + :math:`\widehat{\pi_2}` n n "pi2_unbiased" + ======================= ========== ================ ============== + + :param list sample_sets: A list of lists of sample Node IDs, specifying + the groups of nodes to compute the statistic with. Defaults to all + samples grouped by population. + :param str mode: A string giving the "type" of the statistic to be + computed. Defaults to "site", can be "site" or "branch". + :param str stat: A string giving the selected two-locus statistic to + compute. Defaults to "r2". + :param list sites: A list of sites over which to compute LD. Can be + specified as a list of lists to control the row and column sites. + Only applicable in site mode. Specify as + ``[[row_sites], [col_sites]]`` or ``[all_sites]``. + Defaults to all sites. + :param list positions: A list of genomic positions where expected LD is + computed. Only applicable in branch mode. Can be specified as a list + of lists to control the row and column positions. Specify as + ``[[row_positions], [col_positions]]`` or ``[all_positions]``. + Defaults to the leftmost coordinates of all trees. + :param list indexes: A list of 2-tuples or a single 2-tuple, specifying + the indexes of two sample sets over which to compute a two-way LD + statistic. Only :math:`r^2`, :math:`D^2`, and :math:`\widehat{D^2}` + are implemented for two-way statistics. + :return: A 2D or 3D array of LD matrices. + :rtype: numpy.ndarray + """ one_way_stats = { "D": self._ll_tree_sequence.D_matrix, "D2": self._ll_tree_sequence.D2_matrix,