Skip to content

Commit cd30239

Browse files
authored
Merge pull request #221 from tskit-dev/node_stats
Implement node statistics
2 parents afa6e4c + 8ad1ed0 commit cd30239

File tree

4 files changed

+444
-189
lines changed

4 files changed

+444
-189
lines changed

c/tskit/trees.c

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1559,13 +1559,17 @@ tsk_treeseq_node_general_stat(tsk_treeseq_t *self,
15591559

15601560
/* Set the initial conditions */
15611561
for (j = 0; j < self->num_samples; j++) {
1562-
u = self->samples[j];
1563-
state_u = GET_2D_ROW(state, state_dim, u);
1562+
// total_weight is used in the next loop
15641563
weight_u = GET_2D_ROW(sample_weights, state_dim, j);
1565-
memcpy(state_u, weight_u, state_dim * sizeof(*state_u));
15661564
for (k = 0; k < state_dim; k++) {
15671565
total_weight[k] += weight_u[k];
15681566
}
1567+
u = self->samples[j];
1568+
state_u = GET_2D_ROW(state, state_dim, u);
1569+
weight_u = GET_2D_ROW(sample_weights, state_dim, j);
1570+
memcpy(state_u, weight_u, state_dim * sizeof(*state_u));
1571+
}
1572+
for (u = 0; u < (tsk_id_t) num_nodes; u++) {
15691573
ret = update_node_summary(u, result_dim, node_summary, state, state_dim,
15701574
f, f_params, total_weight, polarised);
15711575
if (ret != 0) {

docs/stats.rst

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,7 @@ then we'd just obtain the number of differing sites per window.
129129

130130
And, a final note about "length": in tree sequences produced by ``msprime``
131131
coordinates along the sequence are **continuous**,
132-
so the the "lengths" used here may not correspond to distance along the genome in (say) base pairs.
132+
so the "lengths" used here may not correspond to distance along the genome in (say) base pairs.
133133
For instance, pairwise sequence divergence is usually a number between 0 and 1
134134
because it is the proportion of bases that differ;
135135
this will only be true if length is measured in base pairs
@@ -155,7 +155,7 @@ For instance, nucleotide divergence is defined for a *pair* of groups of samples
155155
so if you wanted to compute pairwise divergences between some groups of samples,
156156
you'd specify these as your ``sample_sets``.
157157
Then, if ``p[i]`` is the derived allele frequency in sample set ``i``,
158-
under the hood we (essentially) compute the divergency between sample sets ``i`` and ``j``
158+
under the hood we (essentially) compute the divergence between sample sets ``i`` and ``j``
159159
by averaging ``p[i] * (1 - p[j]) + (1 - p[i]) * p[j]`` across the genome.
160160

161161
So, what if you
@@ -220,6 +220,20 @@ from ``windows[i]`` to ``windows[i + 1]`` (including the left but not the right
220220

221221
The final dimension of the arrays in other cases is specified by the method.
222222

223+
A note about **default values** and **division by zero**:
224+
Under the hood, statistics computation fills in zeros everywhere, then updates these
225+
(since statistics are all **additive**, this makes sense).
226+
But now suppose that you've got a statistic that returns ``nan``
227+
("not a number") sometimes, like if you're taking the diversity of a sample set with only ``n=1`` sample,
228+
which involves dividing by ``n * (n - 1)``.
229+
Usually, you'll just get ``nan`` everywhere that the division by zero happens.
230+
But there's a couple of caveats.
231+
For ``site`` statistics, any windows without any sites in them never get touched,
232+
so they will have a value of 0.
233+
For ``branch`` statistics, any windows with **no branches** will similarly remain 0.
234+
That said, you should **not** rely on the specific behavior of whether ``0`` or ``nan`` is returned
235+
for "empty" cases like these: it is subject to change.
236+
223237

224238
.. Commenting these out for now as they are duplicates of the methods in the TreeSequence
225239
and sphinx is unhappy.

0 commit comments

Comments
 (0)