@@ -684,8 +684,251 @@ and boolean expressions (e.g., {math}`(x > 0)`) are interpreted as 0/1.
684684 and {math}` v_j ` is the covariance of the trait with the j-th covariate.
685685
686686
687+ (sec_stats_multi_site)=
688+
687689## Multi site statistics
688690
691+ (sec_stats_two_locus)=
692+
693+ ### Two-locus statistics
694+
695+ The {meth}` ~TreeSequence.ld_matrix ` method provides an interface to
696+ a collection of two-locus statistics with predefined summary functions (see
697+ {ref}` sec_stats_two_locus_summary_functions ` ) and ` site ` and ` branch `
698+ {ref}` modes <sec_stats_mode> ` . The LD matrix method differs from other
699+ statistics methods in that it provides a unified API with an argument to
700+ specify different two-locus summaries of the data. It otherwise behaves
701+ similarly to most other functions with respect to ` sample_sets ` and ` indexes ` .
702+
703+ Two-locus statistics can be computed using two modes, either ` sites ` or
704+ ` branch ` , and these should be interpreted in the same way as these modes in the
705+ single-site statistics. Site statistics allow for multi-allelic data, while
706+ branch statistics assume an infinite sites model. Within this framework, we
707+ also implement polarisation, but do not expose it to the user, opting to
708+ provide statistics that are polarised where appropriate.
709+
710+ (sec_stats_two_locus_site)=
711+
712+ #### Site
713+
714+ The ` site ` mode computes two-locus statistics summarized over alleles between
715+ all pairs of specified sites. The default behavior, leaving ` sites `
716+ unspecified, this method will compute a matrix for all pairs of sites, with
717+ rows and columns representing each site in the tree sequence (i.e., an n×n
718+ matrix where n is the number of sites in the tree sequence). We can also
719+ restrict the output to a subset of sites, either by specifying a single vector
720+ for both rows and columns or a pair of vectors for the row sites and column
721+ sites separately.
722+
723+ The following computes a matrix of the {math}` r^2 ` measure of linkage
724+ disequilibrium (LD) computed pairwise between the first 4 sites in the tree
725+ sequence among all samples. In our computations, row sites are used as the row
726+ ("left-hand") loci and column sites are used as the column ("right-hand")
727+ locus, and with a single list of sites specified, we obtain a symmetric square
728+ matrix.
729+
730+ ``` {code-cell} ipython3
731+ ld = ts.ld_matrix(sites=[[0, 1, 2, 3]])
732+ print(ld)
733+ ```
734+
735+ The following demonstrates how we can specify the row and column sites
736+ independently of each other. We're specifying 3 columns and 2 rows, which
737+ computes a subset of the matrix shown above.
738+
739+ ``` {code-cell} ipython3
740+ ld = ts.ld_matrix(sites=[[0, 1], [1, 2, 3]])
741+ print(ld)
742+ ```
743+
744+ Because we implement two-locus statistics for multi-allelic data, we require
745+ a method for combining the statistical results from each pair of alleles into
746+ one summary for a pair of sites. There two methods for combining results from
747+ multiple alleles, ` hap_weighted ` and ` total_weighted ` , which are
748+ statistic-specific and not chosen by the user:
749+
750+ (sec_stats_two_locus_branch)=
751+
752+ #### Branch
753+
754+ The ` branch ` mode computes two-locus statistics between pairs of trees. The
755+ resulting statistic is a summary depending on marginal tree topologies and the
756+ branch lengths of the trees at a pair of positions. To perform this
757+ computation, we consider the counts of all possible two-locus haplotypes that
758+ may be generated by possible mutations on each pair of branches between the two
759+ trees, which is summarized using an LD statistic. We then sum each pairwise
760+ comparison weighted by the product of the two branch lengths above the nodes on
761+ each tree.
762+
763+ Similar to the single-site statistics computed in ` branch ` mode, this results
764+ in a statistic that is proportional to the expected statistic under an infinite
765+ sites model (which mutation rate 1), conditioned on the pair of trees.
766+
767+ The time complexity of this method is quadratic, due to the pairwise
768+ comparisons of branches from a pair of trees. By default, this method computes
769+ a symmetric matrix for all pairs of trees, with rows and columns representing
770+ each tree in the tree sequence. Similar to the site method, we can restrict the
771+ output to a subset of trees, either by specifying a vector of positions or
772+ a pair of vectors for row and column positions separately. To select a specific
773+ tree, the specified positions must land in the tree span (` [start, end) ` ).
774+
775+ In the following, we compute a matrix of expected {math}` r^2 ` within and
776+ between the first 4 trees in the tree sequence. The tree breakpoints are
777+ a convenient way to specify those first four trees tree.
778+
779+ ``` {code-cell} ipython3
780+ ld = ts.ld_matrix(mode="branch", positions=[ts.breakpoints(as_array=True)[0:4]])
781+ print(ld)
782+ ```
783+
784+ Again, we can specify the row and column trees separately.
785+
786+ ``` {code-cell} ipython3
787+ breakpoints = ts.breakpoints(as_array=True)
788+ ld = ts.ld_matrix(mode="branch", positions=[breakpoints[[0]], breakpoints[0:4]])
789+ print(ld)
790+ ```
791+
792+ #### Sample Sets
793+
794+ The two-locus API provides a mechanism by which to subset the samples under
795+ consideration, providing the ability to compute a separate LD matrix for each
796+ sample set or an LD matrix between sample sets. Output dimensions are handled
797+ in the same manner as the rest of the stats API (see
798+ {ref}` sec_stats_output_dimensions ` ). The two-locus API differs from the rest of
799+ the stats API in that we handle one-way and two-way statistics in the same
800+ function call.
801+
802+ To compute a two-way two-locus statistic, the ` index ` argument must be
803+ provided. The statistics are selected in the same way (with the ` stat `
804+ argument), but we provide a restricted set of two-way statistics (see
805+ {ref}` sec_stats_two_locus_summary_functions_two_way ` ). The dimension-dropping
806+ rules for the result follow the rest of the tskit stats API in that scalar
807+ indexes will produce a single two-dimensional matrix, while list of indexes
808+ will produce a three-dimensional array, with the outer dimension of length
809+ equal to the length of the list of indexes.
810+
811+ For concreteness, we would expect the following dimensions with the specified
812+ ` sample_sets ` and ` indexes ` arguments.
813+
814+ ```
815+ # one-way
816+ ts.ld_matrix(sample_sets=None) -> 2 dimensions
817+ ts.ld_matrix(sample_sets=[0, 1, 2, 3]) -> 2 dimensions
818+ ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) -> 3 dimensions
819+ # two-way
820+ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dimensions
821+ ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions
822+ ```
823+
824+ (sec_stats_two_locus_sample_one_way_stats)=
825+
826+ ##### One-way Statistics
827+
828+ One-way statistics are summaries of two loci in a single sample set, using a
829+ triple of haplotype counts {math}` \{w_{Ab}, w_{aB}, w_{AB}\} ` and the size of
830+ the sample set {math}` n ` , where the capitalized and lowercase letters in our notation
831+ represent alternate alleles.
832+
833+ (sec_stats_two_locus_sample_two_way_stats)=
834+
835+ ##### Two-way Statistics
836+
837+ Two-way statistics are summaries of haplotype counts between two sample sets,
838+ which operate on the three haplotype counts (as in one-way stats, above)
839+ computed from each sample set, indexed by ` (i, j) ` . These statistics take on
840+ a different meaning from their one-way counterparts. For instance {math}`D_i
841+ D_j` can be thought of as the covariance between two loci when computed for
842+ a haplotype in a single sample set.
843+
844+ Only a subset of our summary functions are two-way statistics (see
845+ {ref}` sec_two_locus_summary_functions_two_way ` ). Note that the unbiased two-way
846+ statistics expect non-overlapping sample sets, and we do not make any
847+ assertions about the sample sets and assume that ` i ` and ` j ` represent disjoint
848+ sets of samples (see also the note in {meth}` ~TreeSequence.divergence ` ).
849+
850+ (sec_stats_two_locus_summary_functions)=
851+
852+ #### Summary Functions
853+
854+ (sec_stats_two_locus_summary_functions_one_way)=
855+
856+ ##### One-way
857+
858+ The two-locus summary functions all take haplotype counts and sample set size as
859+ input. Each of our summary functions has the signature
860+ {math}` f(w_{Ab}, w_{aB}, w_{AB}, n) ` , converting to haplotype frequencies
861+ {math}` \{p_{Ab}, p_{aB}, p_{AB}\} ` where appropriate by dividing by {math}` n ` .
862+
863+ ` D `
864+ : {math}` f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{ab} - p_{a}p_{b} `
865+
866+ This statistic is polarised, as the unpolarised result, which averages over
867+ allele labelings, is zero. Uses the ` total ` normalisation method.
868+
869+ ` D_prime `
870+ : {math}` f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D}{D_{\max}} `
871+
872+ Where {math}```
873+ D_ {\max} = \begin{cases}
874+ \min\{ p_A (1-p_B), p_B (1-p_B)\} & \textrm{if }D>=0 \\
875+ \min\{ p_A p_B, (1-p_B) (1-p_B)\} & \textrm{otherwise}
876+ \end{cases}```
877+
878+ and {math}` D ` is defined above.
879+
880+ ` D2 `
881+ : {math}` f(w_{Ab}, w_{aB}, w_{AB}, n) = (p_{ab} - p_{a} p_{b})^2 `
882+
883+ Unpolarised, total weighted.
884+
885+ ` Dz `
886+ : {math}` f(w_{Ab}, w_{aB}, w_{AB}, n) = D (1 - 2 p_{a})(1 - 2p_{b}) `
887+
888+ Where {math}` D ` is defined above. Unpolarised, total weighted.
889+
890+ ` pi2 `
891+ : {math}` f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{a}p_{b}(1-p_{a})(1-p_{b}) `
892+
893+ Unpolarised, total weighted.
894+
895+ ` r `
896+ : {math}` f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D}{\sqrt{p_{a}p_{b}(1-p_{a})(1-p_{b})}} `
897+
898+ Where {math}` D ` is defined above. Polarised, total weighted.
899+
900+ ` r2 `
901+ : {math}` f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D^{2}}{p_{a}p_{b}(1-p_{a})(1-p_{b})} `
902+
903+ Where {math}` D ` is defined above. Unpolarised, haplotype weighted.
904+
905+ Unbiased two-locus statistics from the Hill-Robertson (1968) system are
906+ computed from haplotype counts. Derivations for these unbiased estimators can
907+ be found in [ Ragsdale and Gravel
908+ (2020)] ( https://doi.org/10.1093/molbev/msz265 ) . They require at least 4 samples
909+ to be valid and are specified as ` stat=D2_unbiased ` , ` Dz_unbiased ` , or
910+ ` pi2_unbiased ` .
911+
912+ (sec_two_locus_summary_functions_two_way)=
913+
914+ (sec_stats_two_locus_summary_functions_two_way)=
915+
916+ ##### Two-way
917+
918+ Two-way statistics are indexed by sample sets {math}` i, j ` and compute values
919+ using haplotype counts within pairs of sample sets.
920+
921+ ` D2 `
922+ : {math}` f(w_{Ab}, w_{aB}, w_{AB}, n) = D_i * D_j `
923+
924+ Where {math}` D ` is defined above.
925+
926+ ` r2 `
927+ : {math}` f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{(p_{AB_i} - (p_{A_i} p_{B_i})) (p_{AB_j} - (p_{A_j} p_{B_j}))}{\sqrt{p_{A_i} (1 - p_{A_i}) p_{B_i} (1 - p_{B_i})}\sqrt{p_{A_j} (1 - p_{A_j}) p_{B_j} (1 - p_{B_j})}} `
928+
929+ And ` D2_unbiased ` , which can be found in [ Ragsdale and Gravel
930+ (2020)] ( https://doi.org/10.1093/molbev/msz265 ) .
931+
689932:::{todo}
690933Document statistics that use information about correlation between sites, such as
691934LdCalculator (and perhaps reference {ref}` sec_identity ` ). Note that if we have a general
0 commit comments