Skip to content

Commit 80b2fea

Browse files
Merge pull request #359 from petrelharp/general_stat_error
better error value on general stat
2 parents 2f58af6 + cb72b5c commit 80b2fea

File tree

3 files changed

+39
-32
lines changed

3 files changed

+39
-32
lines changed

docs/stats.rst

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,8 @@ shared by many statistics, which we describe in detail in the following subsecti
125125
What section(s) of the genome are we interested in?
126126

127127
:ref:`sec_stats_span_normalise`
128-
Should we normalise information in windows by their span along the genome?
128+
Should the statistic calculated for each window be normalised by the length
129+
(i.e., the span) of that window?
129130

130131
The statistics functions are highly efficient and are based where possible
131132
on numpy arrays. Each of these statistics will return the results as a numpy
@@ -283,12 +284,12 @@ in the ``k``-th window, i.e., from (and including) ``windows[k]`` to (but not in
283284
Most windowed statistics by default return **averages** within each of the windows,
284285
so the values are comparable between windows, even of different lengths.
285286
(However, shorter windows may be noisier.)
286-
Suppose for instance that you compute some statistic with ``windows = [a, b, c]``
287-
for some valid positions ``a < b < c``,
287+
Suppose for instance that you compute some statistic with ``windows = [0, a, b]``
288+
for some valid positions ``0 < a < b``,
288289
and get an output array ``S`` with two rows.
289-
Then, computing the same statistic with ``windows = [a, c]``
290+
Then, computing the same statistic with ``windows = [0, b]``
290291
would be equivalent to averaging the rows of ``S``,
291-
obtaining ``((b - a) * S[0] + (c - b) * S[1]) / (c - a)``.
292+
obtaining ``((a - 0) * S[0] + (b - a) * S[1]) / (b - 0)``.
292293

293294
There are some shortcuts to other useful options:
294295

@@ -319,35 +320,34 @@ There are some shortcuts to other useful options:
319320
Span normalise
320321
++++++++++++++
321322

322-
In addition to windowing there is an option, ``span_normalise`` (default ``True``),
323-
that if ``False`` returns the **sum** of the relevant statistic across each window rather than the average.
324-
The statistic that is returned by default is an average because we divide by
325-
rather than normalizing (i.e., dividing) by the length of the window.
326-
As above, if the statistic ``S`` was computed with ``span_normalise=False``,
327-
then the value obtained with ``windows = [a, c]`` would be equal to ``S[0] + S[1]``.
323+
In addition to windowing there is an option, ``span_normalise`` (which defaults to ``True``),
324+
All the primary statistics defined here are *sums* across locations in the genome:
325+
something is computed for each position, and these values are added up across all positions in each window.
326+
Whether the total length of the window is then taken into account is determined by the option ``span_normalise``:
327+
if it is ``True`` (the default), the sum for each window is converted into an *average*,
328+
by dividing by the window's length (i.e., its *span*).
329+
Otherwise, the sum itself is returned.
330+
The default is ``span_normalise=True``,
331+
because this makes the values comparable across windows of different sizes.
332+
To make this more concrete: :meth:`pairwise sequence divergence <.TreeSequence.divergence>`
333+
between two samples with ``mode="site"`` is the density of sites that differ between the samples;
334+
this is computed for each window by counting up the number of sites
335+
at which the two differ, and dividing by the total length of the window.
336+
If we wanted the number of sites at which the two differed in each window,
337+
we'd calculate divergence with ``span_normalise=False``.
338+
339+
Following on from above, suppose we computed the statistic ``S`` with
340+
``windows = [0, a, b]`` and ``span_normalise=True``,
341+
and then computed ``T`` in just the same way except with ``span_normalize=False``.
342+
Then ``S[0]`` would be equal to ``T[0] / a`` and ``S[1] = T[1] / (b - a)``.
343+
Furthermore, the value obtained with ``windows = [0, b]`` would be equal to ``T[0] + T[1]``.
328344
However, you probably usually want the (default) normalized version:
329345
don't get unnormalised values unless you're sure that's what you want.
330346
The exception is when computing a site statistic with ``windows = "sites"``:
331347
this case, computes a statistic with the pattern of genotypes at each site,
332348
and normalising would divide these statistics by the distance to the previous variant site
333349
(probably not what you want to do).
334350

335-
To explain normalization a bit more:
336-
a good way to think about these statistics in general
337-
is that they all have a way of summarizing something **locally**,
338-
i.e., at each point along the genome,
339-
and this summary is then **averaged** across each window.
340-
For instance, pairwise sequence divergence between two samples
341-
is the density of sites that differ between them;
342-
this is computed for each window by counting up the number of sites
343-
at which the two differ, and dividing by the total length of the window.
344-
Branch statistics do just the same thing,
345-
except that we average over **all** locations on the sequence,
346-
not just the locations of mutations.
347-
So, usually "divergence" gives us the average number of differing sites
348-
per unit of genome length; but if we set ``span_normalise=False``
349-
then we'd just obtain the number of differing sites per window.
350-
351351
And, a final note about "length": in tree sequences produced by ``msprime``
352352
coordinates along the sequence are **continuous**,
353353
so the "lengths" used here may not correspond to distance along the genome in (say) base pairs.
@@ -506,16 +506,16 @@ and boolean expressions (e.g., :math:`(x > 0)`) are interpreted as 0/1.
506506
:math:`f(x_1, x_2) = \frac{x_1 (n_2 - x_2) (n_2 - x_2 - 1)}{n_1 n_2 (n_2 - 1)}`
507507

508508
``f2``
509-
:math:`f(x_1, x_2) = \frac{x_1 (x_1 - 1) (n_2 - x_2) (n_2 - x_2 - 1)}{n_1 (n_1 - 1) n_2 (n_2 - 1)}`
509+
:math:`f(x_1, x_2) = \frac{x_1 (x_1 - 1) (n_2 - x_2) (n_2 - x_2 - 1)}{n_1 (n_1 - 1) n_2 (n_2 - 1)} - \frac{x_1 (n_1 - x_1) (n_2 - x_2) x_2}{n_1 (n_1 - 1) n_2 (n_2 - 1)}`
510510

511511
``Y3``
512512
:math:`f(x_1, x_2, x_3) = \frac{x_1 (n_2 - x_2) (n_3 - x_3)}{n_1 n_2 n_3}`
513513

514514
``f3``
515-
:math:`f(x_1, x_2, x_3) = \frac{x_1 (x_1 - 1) (n_2 - x_2) (n_3 - x_3)}{n_1 (n_1 - 1) n_2 n_3}`
515+
:math:`f(x_1, x_2, x_3) = \frac{x_1 (x_1 - 1) (n_2 - x_2) (n_3 - x_3)}{n_1 (n_1 - 1) n_2 n_3} - \frac{x_1 (n_1 - x_1) (n_2 - x_2) x_3}{n_1 (n_1 - 1) n_2 n_3}`
516516

517517
``f4``
518-
:math:`f(x_1, x_2, x_3, x_4) = \frac{x_1 x_3 (n_2 - x_2) (n_4 - x_4)}{n_1 n_2 n_3 n_4}`
518+
:math:`f(x_1, x_2, x_3, x_4) = \frac{x_1 x_3 (n_2 - x_2) (n_4 - x_4)}{n_1 n_2 n_3 n_4} - \frac{x_1 x_4 (n_2 - x_2) (n_3 - x_3)}{n_1 n_2 n_3 n_4}`
519519

520520
``trait_covariance``
521521
:math:`f(w) = \frac{w^2}{2 (n-1)^2}`,

python/tests/test_tree_stats.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3368,6 +3368,13 @@ def test_bad_summary_function(self):
33683368
with self.assertRaises(ValueError):
33693369
ts.general_stat(W, lambda x: np.array([1.0]), 1, windows="sites")
33703370

3371+
def test_nonnumpy_summary_function(self):
3372+
ts = self.get_tree_sequence()
3373+
W = np.ones((ts.num_samples, 3))
3374+
sigma1 = ts.general_stat(W, lambda x: [0.0], 1)
3375+
sigma2 = ts.general_stat(W, lambda x: np.array([0.0]), 1)
3376+
self.assertArrayEqual(sigma1, sigma2)
3377+
33713378

33723379
class TestGeneralBranchStats(StatsTestCase):
33733380
"""

python/tskit/trees.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3554,10 +3554,10 @@ def general_stat(self, W, f, output_dim, windows=None, polarised=False, mode=Non
35543554
total_weights = np.sum(W, axis=0)
35553555
for x in [total_weights, total_weights * 0.0]:
35563556
with np.errstate(invalid='ignore', divide='ignore'):
3557-
fx = f(x)
3557+
fx = np.array(f(x))
35583558
fx[np.isnan(fx)] = 0.0
35593559
if not np.allclose(fx, np.zeros((output_dim, ))):
3560-
raise ValueError("Summary function does not return zero for both"
3560+
raise ValueError("Summary function does not return zero for both "
35613561
"zero weight and total weight.")
35623562
return self.__run_windowed_stat(
35633563
windows, self.ll_tree_sequence.general_stat,

0 commit comments

Comments
 (0)