Skip to content

Commit 5e90c6e

Browse files
petrelharpjeromekelleher
authored andcommitted
more links in the docs
1 parent dd9ddb4 commit 5e90c6e

File tree

2 files changed

+158
-48
lines changed

2 files changed

+158
-48
lines changed

docs/stats.rst

Lines changed: 54 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -70,13 +70,14 @@ Windowing
7070
*********
7171

7272
Each statistic has an argument, ``windows``,
73-
which defines a collection of contiguous windows along the genome.
74-
If ``windows`` is a list of ``n+1`` increasing numbers between 0 and the ``sequence_length``,
75-
then the statistic will be computed separately in each of the ``n`` windows,
73+
which defines a collection of contiguous windows spanning the genome.
74+
``windows`` should be a list of ``n+1`` increasing numbers beginning with 0
75+
and ending with the ``sequence_length``.
76+
The statistic will be computed separately in each of the ``n`` windows,
7677
and the ``k``-th row of the output will report the values of the statistic
7778
in the ``k``-th window, i.e., from (and including) ``windows[k]`` to (but not including) ``windows[k+1]``.
7879

79-
All windowed statistics by default return **averages** within each of the windows,
80+
Most windowed statistics by default return **averages** within each of the windows,
8081
so the values are comparable between windows, even of different lengths.
8182
(However, shorter windows may be noisier.)
8283
Suppose for instance that you compute some statistic with ``windows = [a, b, c]``
@@ -108,6 +109,13 @@ There are some shortcuts to other useful options:
108109
since the windows are all different sizes you probably want to also pass
109110
``span_normalise=False`` (see below).
110111

112+
113+
.. _sec_general_stats_span_normalise:
114+
115+
+++++++++++++
116+
Normalisation
117+
+++++++++++++
118+
111119
Furthermore, there is an option, ``span_normalise`` (default ``True``),
112120
that if ``False`` returns the **sum** of the relevant statistic across each window rather than the average.
113121
The statistic that is returned by default is an average because we divide by
@@ -206,6 +214,7 @@ Here are some additional special cases:
206214
were that allowed.)
207215

208216

217+
209218
.. _sec_general_stats_output_format:
210219

211220
*************
@@ -343,6 +352,18 @@ regression with other covariates (as in GWAS).
343352
- :meth:`.TreeSequence.trait_covariance`
344353
- :meth:`.TreeSequence.trait_correlation`
345354

355+
------------------
356+
Derived statistics
357+
------------------
358+
359+
The other statistics above all have the property that `mode="branch"` and
360+
`mode="site"` are "dual" in the sense that they are equal, on average, under
361+
a high neutral mutation rate. The following statistics do not have this
362+
property (since both are ratios of statistics that do have this property).
363+
364+
- :meth:`.TreeSequence.Fst`
365+
- :meth:`.TreeSequence.TajimasD`
366+
346367
---------------
347368
General methods
348369
---------------
@@ -355,15 +376,35 @@ using these methods directly, so they should be preferred.
355376
- :meth:`.TreeSequence.general_stat`
356377
- :meth:`.TreeSequence.sample_count_stat`
357378

358-
------------------
359-
Derived statistics
360-
------------------
361379

362-
The other statistics above all have the property that `mode="branch"` and
363-
`mode="site"` are "dual" in the sense that they are equal, on average, under
364-
a high neutral mutation rate. The following statistics do not have this
365-
property (since both are ratios of statistics that do have this property).
380+
.. _sec_general_stats_advanced:
366381

367-
- :meth:`.TreeSequence.Fst`
368-
- :meth:`.TreeSequence.TajimasD`
382+
****************
383+
Advanced methods
384+
****************
385+
386+
The methods :meth:`.TreeSequence.general_stat` and :meth:`.TreeSequence.sample_count_stat`
387+
provide access to the general-purpose algorithm for computing statistics.
388+
Here is a bit more discussion of how to use these.
389+
390+
.. _sec_general_stats_polarisation:
391+
392+
++++++++++++
393+
Polarisation
394+
++++++++++++
395+
396+
Many statistics calculated from genome sequence treat all alleles on equal footing,
397+
as one must without knowledge of the ancestral state and sequence of mutations that produced the data.
398+
Separating out the *ancestral* allele (e.g., as inferred using an outgroup)
399+
is known as *polarisiation*.
400+
For instance, in the allele frequency spectrum, a site with alleles at 20% and 80% frequency
401+
is no different than another whose alleles are at 80% and 20%,
402+
unless we know in each case which allele is ancestral,
403+
and so while the unpolarised allele frequency spectrum gives the distribution of frequencies of *all* alleles,
404+
the *polarised* allele frequency spectrum gives the distribution of frequencies of only *derived* alleles.
369405

406+
This concept is extended to more general statistics as follows.
407+
For site statistics, summary functions are applied to the total weight or number of samples
408+
associated with each allele; but if polarised, then the ancestral allele is left out of this sum.
409+
For branch or node statistics, summary functions are applied to the total weight or number of samples
410+
below, and above each branch or node; if polarised, then only the weight below is used.

0 commit comments

Comments
 (0)