@@ -70,13 +70,14 @@ Windowing
7070*********
7171
7272Each statistic has an argument, ``windows ``,
73- which defines a collection of contiguous windows along the genome.
74- If ``windows `` is a list of ``n+1 `` increasing numbers between 0 and the ``sequence_length ``,
75- then the statistic will be computed separately in each of the ``n `` windows,
73+ which defines a collection of contiguous windows spanning the genome.
74+ ``windows `` should be a list of ``n+1 `` increasing numbers beginning with 0
75+ and ending with the ``sequence_length ``.
76+ The statistic will be computed separately in each of the ``n `` windows,
7677and the ``k ``-th row of the output will report the values of the statistic
7778in the ``k ``-th window, i.e., from (and including) ``windows[k] `` to (but not including) ``windows[k+1] ``.
7879
79- All windowed statistics by default return **averages ** within each of the windows,
80+ Most windowed statistics by default return **averages ** within each of the windows,
8081so the values are comparable between windows, even of different lengths.
8182(However, shorter windows may be noisier.)
8283Suppose for instance that you compute some statistic with ``windows = [a, b, c] ``
@@ -108,6 +109,13 @@ There are some shortcuts to other useful options:
108109 since the windows are all different sizes you probably want to also pass
109110 ``span_normalise=False `` (see below).
110111
112+
113+ .. _sec_general_stats_span_normalise :
114+
115+ +++++++++++++
116+ Normalisation
117+ +++++++++++++
118+
111119Furthermore, there is an option, ``span_normalise `` (default ``True ``),
112120that if ``False `` returns the **sum ** of the relevant statistic across each window rather than the average.
113121The statistic that is returned by default is an average because we divide by
@@ -206,6 +214,7 @@ Here are some additional special cases:
206214 were that allowed.)
207215
208216
217+
209218.. _sec_general_stats_output_format :
210219
211220*************
@@ -343,6 +352,18 @@ regression with other covariates (as in GWAS).
343352- :meth: `.TreeSequence.trait_covariance `
344353- :meth: `.TreeSequence.trait_correlation `
345354
355+ ------------------
356+ Derived statistics
357+ ------------------
358+
359+ The other statistics above all have the property that `mode="branch" ` and
360+ `mode="site" ` are "dual" in the sense that they are equal, on average, under
361+ a high neutral mutation rate. The following statistics do not have this
362+ property (since both are ratios of statistics that do have this property).
363+
364+ - :meth: `.TreeSequence.Fst `
365+ - :meth: `.TreeSequence.TajimasD `
366+
346367---------------
347368General methods
348369---------------
@@ -355,15 +376,35 @@ using these methods directly, so they should be preferred.
355376- :meth: `.TreeSequence.general_stat `
356377- :meth: `.TreeSequence.sample_count_stat `
357378
358- ------------------
359- Derived statistics
360- ------------------
361379
362- The other statistics above all have the property that `mode="branch" ` and
363- `mode="site" ` are "dual" in the sense that they are equal, on average, under
364- a high neutral mutation rate. The following statistics do not have this
365- property (since both are ratios of statistics that do have this property).
380+ .. _sec_general_stats_advanced :
366381
367- - :meth: `.TreeSequence.Fst `
368- - :meth: `.TreeSequence.TajimasD `
382+ ****************
383+ Advanced methods
384+ ****************
385+
386+ The methods :meth: `.TreeSequence.general_stat ` and :meth: `.TreeSequence.sample_count_stat `
387+ provide access to the general-purpose algorithm for computing statistics.
388+ Here is a bit more discussion of how to use these.
389+
390+ .. _sec_general_stats_polarisation :
391+
392+ ++++++++++++
393+ Polarisation
394+ ++++++++++++
395+
396+ Many statistics calculated from genome sequence treat all alleles on equal footing,
397+ as one must without knowledge of the ancestral state and sequence of mutations that produced the data.
398+ Separating out the *ancestral * allele (e.g., as inferred using an outgroup)
399+ is known as *polarisiation *.
400+ For instance, in the allele frequency spectrum, a site with alleles at 20% and 80% frequency
401+ is no different than another whose alleles are at 80% and 20%,
402+ unless we know in each case which allele is ancestral,
403+ and so while the unpolarised allele frequency spectrum gives the distribution of frequencies of *all * alleles,
404+ the *polarised * allele frequency spectrum gives the distribution of frequencies of only *derived * alleles.
369405
406+ This concept is extended to more general statistics as follows.
407+ For site statistics, summary functions are applied to the total weight or number of samples
408+ associated with each allele; but if polarised, then the ancestral allele is left out of this sum.
409+ For branch or node statistics, summary functions are applied to the total weight or number of samples
410+ below, and above each branch or node; if polarised, then only the weight below is used.
0 commit comments