Skip to content

Commit 271bcaa

Browse files
committed
Add, edit, streamline two-locus docs
1 parent 3b01b89 commit 271bcaa

File tree

1 file changed

+243
-0
lines changed

1 file changed

+243
-0
lines changed

docs/stats.md

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -684,8 +684,251 @@ and boolean expressions (e.g., {math}`(x > 0)`) are interpreted as 0/1.
684684
and {math}`v_j` is the covariance of the trait with the j-th covariate.
685685

686686

687+
(sec_stats_multi_site)=
688+
687689
## Multi site statistics
688690

691+
(sec_stats_two_locus)=
692+
693+
### Two-locus statistics
694+
695+
The {meth}`~TreeSequence.ld_matrix` method provides an interface to
696+
a collection of two-locus statistics with predefined summary functions (see
697+
{ref}`sec_stats_two_locus_summary_functions`) and `site` and `branch`
698+
{ref}`modes <sec_stats_mode>`. The LD matrix method differs from other
699+
statistics methods in that it provides a unified API with an argument to
700+
specify different two-locus summaries of the data. It otherwise behaves
701+
similarly to most other functions with respect to `sample_sets` and `indexes`.
702+
703+
Two-locus statistics can be computed using two modes, either `sites` or
704+
`branch`, and these should be interpreted in the same way as these modes in the
705+
single-site statistics. Site statistics allow for multi-allelic data, while
706+
branch statistics assume an infinite sites model. Within this framework, we
707+
also implement polarisation, but do not expose it to the user, opting to
708+
provide statistics that are polarised where appropriate.
709+
710+
(sec_stats_two_locus_site)=
711+
712+
#### Site
713+
714+
The `site` mode computes two-locus statistics summarized over alleles between
715+
all pairs of specified sites. The default behavior, leaving `sites`
716+
unspecified, this method will compute a matrix for all pairs of sites, with
717+
rows and columns representing each site in the tree sequence (i.e., an n×n
718+
matrix where n is the number of sites in the tree sequence). We can also
719+
restrict the output to a subset of sites, either by specifying a single vector
720+
for both rows and columns or a pair of vectors for the row sites and column
721+
sites separately.
722+
723+
The following computes a matrix of the {math}`r^2` measure of linkage
724+
disequilibrium (LD) computed pairwise between the first 4 sites in the tree
725+
sequence among all samples. In our computations, row sites are used as the row
726+
("left-hand") loci and column sites are used as the column ("right-hand")
727+
locus, and with a single list of sites specified, we obtain a symmetric square
728+
matrix.
729+
730+
```{code-cell} ipython3
731+
ld = ts.ld_matrix(sites=[[0, 1, 2, 3]])
732+
print(ld)
733+
```
734+
735+
The following demonstrates how we can specify the row and column sites
736+
independently of each other. We're specifying 3 columns and 2 rows, which
737+
computes a subset of the matrix shown above.
738+
739+
```{code-cell} ipython3
740+
ld = ts.ld_matrix(sites=[[0, 1], [1, 2, 3]])
741+
print(ld)
742+
```
743+
744+
Because we implement two-locus statistics for multi-allelic data, we require
745+
a method for combining the statistical results from each pair of alleles into
746+
one summary for a pair of sites. There two methods for combining results from
747+
multiple alleles, `hap_weighted` and `total_weighted`, which are
748+
statistic-specific and not chosen by the user:
749+
750+
(sec_stats_two_locus_branch)=
751+
752+
#### Branch
753+
754+
The `branch` mode computes two-locus statistics between pairs of trees. The
755+
resulting statistic is a summary depending on marginal tree topologies and the
756+
branch lengths of the trees at a pair of positions. To perform this
757+
computation, we consider the counts of all possible two-locus haplotypes that
758+
may be generated by possible mutations on each pair of branches between the two
759+
trees, which is summarized using an LD statistic. We then sum each pairwise
760+
comparison weighted by the product of the two branch lengths above the nodes on
761+
each tree.
762+
763+
Similar to the single-site statistics computed in `branch` mode, this results
764+
in a statistic that is proportional to the expected statistic under an infinite
765+
sites model (which mutation rate 1), conditioned on the pair of trees.
766+
767+
The time complexity of this method is quadratic, due to the pairwise
768+
comparisons of branches from a pair of trees. By default, this method computes
769+
a symmetric matrix for all pairs of trees, with rows and columns representing
770+
each tree in the tree sequence. Similar to the site method, we can restrict the
771+
output to a subset of trees, either by specifying a vector of positions or
772+
a pair of vectors for row and column positions separately. To select a specific
773+
tree, the specified positions must land in the tree span (`[start, end)`).
774+
775+
In the following, we compute a matrix of expected {math}`r^2` within and
776+
between the first 4 trees in the tree sequence. The tree breakpoints are
777+
a convenient way to specify those first four trees tree.
778+
779+
```{code-cell} ipython3
780+
ld = ts.ld_matrix(mode="branch", positions=[ts.breakpoints(as_array=True)[0:4]])
781+
print(ld)
782+
```
783+
784+
Again, we can specify the row and column trees separately.
785+
786+
```{code-cell} ipython3
787+
breakpoints = ts.breakpoints(as_array=True)
788+
ld = ts.ld_matrix(mode="branch", positions=[breakpoints[[0]], breakpoints[0:4]])
789+
print(ld)
790+
```
791+
792+
#### Sample Sets
793+
794+
The two-locus API provides a mechanism by which to subset the samples under
795+
consideration, providing the ability to compute a separate LD matrix for each
796+
sample set or an LD matrix between sample sets. Output dimensions are handled
797+
in the same manner as the rest of the stats API (see
798+
{ref}`sec_stats_output_dimensions`). The two-locus API differs from the rest of
799+
the stats API in that we handle one-way and two-way statistics in the same
800+
function call.
801+
802+
To compute a two-way two-locus statistic, the `index` argument must be
803+
provided. The statistics are selected in the same way (with the `stat`
804+
argument), but we provide a restricted set of two-way statistics (see
805+
{ref}`sec_stats_two_locus_summary_functions_two_way`). The dimension-dropping
806+
rules for the result follow the rest of the tskit stats API in that scalar
807+
indexes will produce a single two-dimensional matrix, while list of indexes
808+
will produce a three-dimensional array, with the outer dimension of length
809+
equal to the length of the list of indexes.
810+
811+
For concreteness, we would expect the following dimensions with the specified
812+
`sample_sets` and `indexes` arguments.
813+
814+
```
815+
# one-way
816+
ts.ld_matrix(sample_sets=None) -> 2 dimensions
817+
ts.ld_matrix(sample_sets=[0, 1, 2, 3]) -> 2 dimensions
818+
ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) -> 3 dimensions
819+
# two-way
820+
ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dimensions
821+
ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions
822+
```
823+
824+
(sec_stats_two_locus_sample_one_way_stats)=
825+
826+
##### One-way Statistics
827+
828+
One-way statistics are summaries of two loci in a single sample set, using a
829+
triple of haplotype counts {math}`\{w_{Ab}, w_{aB}, w_{AB}\}` and the size of
830+
the sample set {math}`n`, where the capitalized and lowercase letters in our notation
831+
represent alternate alleles.
832+
833+
(sec_stats_two_locus_sample_two_way_stats)=
834+
835+
##### Two-way Statistics
836+
837+
Two-way statistics are summaries of haplotype counts between two sample sets,
838+
which operate on the three haplotype counts (as in one-way stats, above)
839+
computed from each sample set, indexed by `(i, j)`. These statistics take on
840+
a different meaning from their one-way counterparts. For instance {math}`D_i
841+
D_j` can be thought of as the covariance between two loci when computed for
842+
a haplotype in a single sample set.
843+
844+
Only a subset of our summary functions are two-way statistics (see
845+
{ref}`sec_two_locus_summary_functions_two_way`). Note that the unbiased two-way
846+
statistics expect non-overlapping sample sets, and we do not make any
847+
assertions about the sample sets and assume that `i` and `j` represent disjoint
848+
sets of samples (see also the note in {meth}`~TreeSequence.divergence`).
849+
850+
(sec_stats_two_locus_summary_functions)=
851+
852+
#### Summary Functions
853+
854+
(sec_stats_two_locus_summary_functions_one_way)=
855+
856+
##### One-way
857+
858+
The two-locus summary functions all take haplotype counts and sample set size as
859+
input. Each of our summary functions has the signature
860+
{math}`f(w_{Ab}, w_{aB}, w_{AB}, n)`, converting to haplotype frequencies
861+
{math}`\{p_{Ab}, p_{aB}, p_{AB}\}` where appropriate by dividing by {math}`n`.
862+
863+
`D`
864+
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{ab} - p_{a}p_{b}`
865+
866+
This statistic is polarised, as the unpolarised result, which averages over
867+
allele labelings, is zero. Uses the `total` normalisation method.
868+
869+
`D_prime`
870+
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D}{D_{\max}}`
871+
872+
Where {math}```
873+
D_{\max} = \begin{cases}
874+
\min\{p_A (1-p_B), p_B (1-p_B)\} & \textrm{if }D>=0 \\
875+
\min\{p_A p_B, (1-p_B) (1-p_B)\} & \textrm{otherwise}
876+
\end{cases}```
877+
878+
and {math}`D` is defined above.
879+
880+
`D2`
881+
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = (p_{ab} - p_{a} p_{b})^2`
882+
883+
Unpolarised, total weighted.
884+
885+
`Dz`
886+
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = D (1 - 2 p_{a})(1 - 2p_{b})`
887+
888+
Where {math}`D` is defined above. Unpolarised, total weighted.
889+
890+
`pi2`
891+
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{a}p_{b}(1-p_{a})(1-p_{b})`
892+
893+
Unpolarised, total weighted.
894+
895+
`r`
896+
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D}{\sqrt{p_{a}p_{b}(1-p_{a})(1-p_{b})}}`
897+
898+
Where {math}`D` is defined above. Polarised, total weighted.
899+
900+
`r2`
901+
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D^{2}}{p_{a}p_{b}(1-p_{a})(1-p_{b})}`
902+
903+
Where {math}`D` is defined above. Unpolarised, haplotype weighted.
904+
905+
Unbiased two-locus statistics from the Hill-Robertson (1968) system are
906+
computed from haplotype counts. Derivations for these unbiased estimators can
907+
be found in [Ragsdale and Gravel
908+
(2020)](https://doi.org/10.1093/molbev/msz265). They require at least 4 samples
909+
to be valid and are specified as `stat=D2_unbiased`, `Dz_unbiased`, or
910+
`pi2_unbiased`.
911+
912+
(sec_two_locus_summary_functions_two_way)=
913+
914+
(sec_stats_two_locus_summary_functions_two_way)=
915+
916+
##### Two-way
917+
918+
Two-way statistics are indexed by sample sets {math}`i, j` and compute values
919+
using haplotype counts within pairs of sample sets.
920+
921+
`D2`
922+
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = D_i * D_j`
923+
924+
Where {math}`D` is defined above.
925+
926+
`r2`
927+
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{(p_{AB_i} - (p_{A_i} p_{B_i})) (p_{AB_j} - (p_{A_j} p_{B_j}))}{\sqrt{p_{A_i} (1 - p_{A_i}) p_{B_i} (1 - p_{B_i})}\sqrt{p_{A_j} (1 - p_{A_j}) p_{B_j} (1 - p_{B_j})}}`
928+
929+
And `D2_unbiased`, which can be found in [Ragsdale and Gravel
930+
(2020)](https://doi.org/10.1093/molbev/msz265).
931+
689932
:::{todo}
690933
Document statistics that use information about correlation between sites, such as
691934
LdCalculator (and perhaps reference {ref}`sec_identity`). Note that if we have a general

0 commit comments

Comments
 (0)