@@ -125,7 +125,8 @@ shared by many statistics, which we describe in detail in the following subsecti
125125 What section(s) of the genome are we interested in?
126126
127127:ref: `sec_stats_span_normalise `
128- Should we normalise information in windows by their span along the genome?
128+ Should the statistic calculated for each window be normalised by the length
129+ (i.e., the span) of that window?
129130
130131The statistics functions are highly efficient and are based where possible
131132on numpy arrays. Each of these statistics will return the results as a numpy
@@ -283,12 +284,12 @@ in the ``k``-th window, i.e., from (and including) ``windows[k]`` to (but not in
283284Most windowed statistics by default return **averages ** within each of the windows,
284285so the values are comparable between windows, even of different lengths.
285286(However, shorter windows may be noisier.)
286- Suppose for instance that you compute some statistic with ``windows = [a, b, c ] ``
287- for some valid positions ``a < b < c ``,
287+ Suppose for instance that you compute some statistic with ``windows = [0, a, b ] ``
288+ for some valid positions ``0 < a < b ``,
288289and get an output array ``S `` with two rows.
289- Then, computing the same statistic with ``windows = [a, c ] ``
290+ Then, computing the same statistic with ``windows = [0, b ] ``
290291would be equivalent to averaging the rows of ``S ``,
291- obtaining ``((b - a ) * S[0] + (c - b ) * S[1]) / (c - a ) ``.
292+ obtaining ``((a - 0 ) * S[0] + (b - a ) * S[1]) / (b - 0 ) ``.
292293
293294There are some shortcuts to other useful options:
294295
@@ -319,35 +320,34 @@ There are some shortcuts to other useful options:
319320Span normalise
320321++++++++++++++
321322
322- In addition to windowing there is an option, ``span_normalise `` (default ``True ``),
323- that if ``False `` returns the **sum ** of the relevant statistic across each window rather than the average.
324- The statistic that is returned by default is an average because we divide by
325- rather than normalizing (i.e., dividing) by the length of the window.
326- As above, if the statistic ``S `` was computed with ``span_normalise=False ``,
327- then the value obtained with ``windows = [a, c] `` would be equal to ``S[0] + S[1] ``.
323+ In addition to windowing there is an option, ``span_normalise `` (which defaults to ``True ``),
324+ All the primary statistics defined here are *sums * across locations in the genome:
325+ something is computed for each position, and these values are added up across all positions in each window.
326+ Whether the total length of the window is then taken into account is determined by the option ``span_normalise ``:
327+ if it is ``True `` (the default), the sum for each window is converted into an *average *,
328+ by dividing by the window's length (i.e., its *span *).
329+ Otherwise, the sum itself is returned.
330+ The default is ``span_normalise=True ``,
331+ because this makes the values comparable across windows of different sizes.
332+ To make this more concrete: :meth: `pairwise sequence divergence <.TreeSequence.divergence> `
333+ between two samples with ``mode="site" `` is the density of sites that differ between the samples;
334+ this is computed for each window by counting up the number of sites
335+ at which the two differ, and dividing by the total length of the window.
336+ If we wanted the number of sites at which the two differed in each window,
337+ we'd calculate divergence with ``span_normalise=False ``.
338+
339+ Following on from above, suppose we computed the statistic ``S `` with
340+ ``windows = [0, a, b] `` and ``span_normalise=True ``,
341+ and then computed ``T `` in just the same way except with ``span_normalize=False ``.
342+ Then ``S[0] `` would be equal to ``T[0] / a `` and ``S[1] = T[1] / (b - a) ``.
343+ Furthermore, the value obtained with ``windows = [0, b] `` would be equal to ``T[0] + T[1] ``.
328344However, you probably usually want the (default) normalized version:
329345don't get unnormalised values unless you're sure that's what you want.
330346The exception is when computing a site statistic with ``windows = "sites" ``:
331347this case, computes a statistic with the pattern of genotypes at each site,
332348and normalising would divide these statistics by the distance to the previous variant site
333349(probably not what you want to do).
334350
335- To explain normalization a bit more:
336- a good way to think about these statistics in general
337- is that they all have a way of summarizing something **locally **,
338- i.e., at each point along the genome,
339- and this summary is then **averaged ** across each window.
340- For instance, pairwise sequence divergence between two samples
341- is the density of sites that differ between them;
342- this is computed for each window by counting up the number of sites
343- at which the two differ, and dividing by the total length of the window.
344- Branch statistics do just the same thing,
345- except that we average over **all ** locations on the sequence,
346- not just the locations of mutations.
347- So, usually "divergence" gives us the average number of differing sites
348- per unit of genome length; but if we set ``span_normalise=False ``
349- then we'd just obtain the number of differing sites per window.
350-
351351And, a final note about "length": in tree sequences produced by ``msprime ``
352352coordinates along the sequence are **continuous **,
353353so the "lengths" used here may not correspond to distance along the genome in (say) base pairs.
@@ -506,16 +506,16 @@ and boolean expressions (e.g., :math:`(x > 0)`) are interpreted as 0/1.
506506 :math: `f(x_1 , x_2 ) = \frac {x_1 (n_2 - x_2 ) (n_2 - x_2 - 1 )}{n_1 n_2 (n_2 - 1 )}`
507507
508508``f2 ``
509- :math: `f(x_1 , x_2 ) = \frac {x_1 (x_1 - 1 ) (n_2 - x_2 ) (n_2 - x_2 - 1 )}{n_1 (n_1 - 1 ) n_2 (n_2 - 1 )}`
509+ :math: `f(x_1 , x_2 ) = \frac {x_1 (x_1 - 1 ) (n_2 - x_2 ) (n_2 - x_2 - 1 )}{n_1 (n_1 - 1 ) n_2 (n_2 - 1 )} - \frac {x_ 1 (n_ 1 - x_ 1 ) (n_ 2 - x_ 2 ) x_ 2 }{n_ 1 (n_ 1 - 1 ) n_ 2 (n_ 2 - 1 )} `
510510
511511``Y3 ``
512512 :math: `f(x_1 , x_2 , x_3 ) = \frac {x_1 (n_2 - x_2 ) (n_3 - x_3 )}{n_1 n_2 n_3 }`
513513
514514``f3 ``
515- :math: `f(x_1 , x_2 , x_3 ) = \frac {x_1 (x_1 - 1 ) (n_2 - x_2 ) (n_3 - x_3 )}{n_1 (n_1 - 1 ) n_2 n_3 }`
515+ :math: `f(x_1 , x_2 , x_3 ) = \frac {x_1 (x_1 - 1 ) (n_2 - x_2 ) (n_3 - x_3 )}{n_1 (n_1 - 1 ) n_2 n_3 } - \frac {x_ 1 (n_ 1 - x_ 1 ) (n_ 2 - x_ 2 ) x_ 3 }{n_ 1 (n_ 1 - 1 ) n_ 2 n_ 3 } `
516516
517517``f4 ``
518- :math: `f(x_1 , x_2 , x_3 , x_4 ) = \frac {x_1 x_3 (n_2 - x_2 ) (n_4 - x_4 )}{n_1 n_2 n_3 n_4 }`
518+ :math: `f(x_1 , x_2 , x_3 , x_4 ) = \frac {x_1 x_3 (n_2 - x_2 ) (n_4 - x_4 )}{n_1 n_2 n_3 n_4 } - \frac {x_ 1 x_ 4 (n_ 2 - x_ 2 ) (n_ 3 - x_ 3 )}{n_ 1 n_ 2 n_ 3 n_ 4 } `
519519
520520``trait_covariance ``
521521 :math: `f(w) = \frac {w^2 }{2 (n-1 )^2 }`,
0 commit comments