@@ -129,7 +129,7 @@ then we'd just obtain the number of differing sites per window.
129129
130130And, a final note about "length": in tree sequences produced by ``msprime ``
131131coordinates along the sequence are **continuous **,
132- so the the "lengths" used here may not correspond to distance along the genome in (say) base pairs.
132+ so the "lengths" used here may not correspond to distance along the genome in (say) base pairs.
133133For instance, pairwise sequence divergence is usually a number between 0 and 1
134134because it is the proportion of bases that differ;
135135this will only be true if length is measured in base pairs
@@ -155,7 +155,7 @@ For instance, nucleotide divergence is defined for a *pair* of groups of samples
155155so if you wanted to compute pairwise divergences between some groups of samples,
156156you'd specify these as your ``sample_sets ``.
157157Then, if ``p[i] `` is the derived allele frequency in sample set ``i ``,
158- under the hood we (essentially) compute the divergency between sample sets ``i `` and ``j ``
158+ under the hood we (essentially) compute the divergence between sample sets ``i `` and ``j ``
159159by averaging ``p[i] * (1 - p[j]) + (1 - p[i]) * p[j] `` across the genome.
160160
161161So, what if you
@@ -220,6 +220,20 @@ from ``windows[i]`` to ``windows[i + 1]`` (including the left but not the right
220220
221221The final dimension of the arrays in other cases is specified by the method.
222222
223+ A note about **default values ** and **division by zero **:
224+ Under the hood, statistics computation fills in zeros everywhere, then updates these
225+ (since statistics are all **additive **, this makes sense).
226+ But now suppose that you've got a statistic that returns ``nan ``
227+ ("not a number") sometimes, like if you're taking the diversity of a sample set with only ``n=1 `` sample,
228+ which involves dividing by ``n * (n - 1) ``.
229+ Usually, you'll just get ``nan `` everywhere that the division by zero happens.
230+ But there's a couple of caveats.
231+ For ``site `` statistics, any windows without any sites in them never get touched,
232+ so they will have a value of 0.
233+ For ``branch `` statistics, any windows with **no branches ** will similarly remain 0.
234+ That said, you should **not ** rely on the specific behavior of whether ``0 `` or ``nan `` is returned
235+ for "empty" cases like these: it is subject to change.
236+
223237
224238.. Commenting these out for now as they are duplicates of the methods in the TreeSequence
225239 and sphinx is unhappy.
0 commit comments