@@ -14,9 +14,8 @@ and of the underlying trees (whose definition does not depend on the genome sequ
1414Furthermore, these statistics have a common interface to easily compute
1515(a) averaged statistics in windows along the genome,
1616and (b) summary statistics of many combinations of sets of samples simultaneously.
17- All methods return numpy arrays,
18- whose rows correspond to the windows along the genome,
19- and whose remaining dimensions are determined by the statistic.
17+ All methods return numpy arrays whose dimensions are
18+ determined by the parameters (see :ref: `sec_general_stats_output_dimensions `).
2019
2120.. warning :: :ref:`sec_data_model_missing_data` is not currently
2221 handled correctly by site statistics defined here, as we always
@@ -70,12 +69,6 @@ e.g., the sites of the SNPs.
7069Windowing
7170*********
7271
73- By default, statistics
74-
75- ``windows = None ``
76- This is the default, and equivalent to passing ``windows = [0.0, ts.sequence_length] ``.
77- The output will still be a two-dimensional array, but with only one row.
78-
7972Each statistic has an argument, ``windows ``,
8073which defines a collection of contiguous windows along the genome.
8174If ``windows `` is a list of ``n+1 `` increasing numbers between 0 and the ``sequence_length ``,
@@ -95,6 +88,14 @@ obtaining ``((b - a) * S[0] + (c - b) * S[1]) / (c - a)``.
9588
9689There are some shortcuts to other useful options:
9790
91+ ``windows = None ``
92+ This is the default and computes statistics in single window over the whole
93+ sequence. As the first returned array contains only a single
94+ value, we drop this dimension as described in the :ref: `output dimensions
95+ <sec_general_stats_output_dimensions>` section. **Note: ** if you really do
96+ want to have an array with a single value as the result, please use
97+ ``windows = [0.0, ts.sequence_length] ``.
98+
9899``windows = "trees" ``
99100 This says that you want statistics computed separately on the portion of the genome
100101 spanned by each tree, so is equivalent to passing ``windows = ts.breakpoints() ``.
@@ -191,7 +192,9 @@ Here are some additional special cases:
191192 If the statistic takes ``k `` inputs for ``k > 1 ``,
192193 and there are exactly ``k `` lists in ``sample_sets ``,
193194 then this will compute just one statistic, and is equivalent to passing
194- ``indexes = [(0, 1, ..., k-1)] ``.
195+ ``indexes = (0, 1, ..., k-1) ``. Note that this also drops the last
196+ dimension of the output, as described in the :ref: `sec_general_stats_output_dimensions `
197+ section.
195198 If there are not exactly ``k `` sample sets, this will throw an error.
196199
197200``k=1 `` does not allow ``indexes ``:
@@ -203,16 +206,16 @@ Here are some additional special cases:
203206 were that allowed.)
204207
205208
206- .. _ sec_general_stats_output :
209+ .. _ sec_general_stats_output_format :
207210
208- ******
209- Output
210- ******
211+ *************
212+ Output format
213+ *************
211214
212215Each of the statistics methods returns a ``numpy `` ndarray.
213216Suppose that the output is name ``out ``.
214- In all cases , the number of rows of the output is equal to the number of windows,
215- so that ``out.shape[0] `` is equal to ``len(windows) - 1 ``
217+ If `` windows `` has been specified , the number of rows of the output is equal to the
218+ number of windows, so that ``out.shape[0] `` is equal to ``len(windows) - 1 ``
216219and ``out[i] `` is an array of statistics describing the portion of the tree sequence
217220from ``windows[i] `` to ``windows[i + 1] `` (including the left but not the right endpoint).
218221
@@ -229,6 +232,9 @@ from ``windows[i]`` to ``windows[i + 1]`` (including the left but not the right
229232
230233The final dimension of the arrays in other cases is specified by the method.
231234
235+ Note, however, that empty dimensions can optionally be dropped,
236+ as described in the :ref: `sec_general_stats_output_dimensions ` section.
237+
232238A note about **default values ** and **division by zero **:
233239Under the hood, statistics computation fills in zeros everywhere, then updates these
234240(since statistics are all **additive **, this makes sense).
@@ -243,6 +249,46 @@ For ``branch`` statistics, any windows with **no branches** will similarly remai
243249That said, you should **not ** rely on the specific behavior of whether ``0 `` or ``nan `` is returned
244250for "empty" cases like these: it is subject to change.
245251
252+ .. _sec_general_stats_output_dimensions :
253+
254+ *****************
255+ Output dimensions
256+ *****************
257+
258+ In the general case, tskit outputs two dimensional (or three dimensional, in the case of node
259+ stats) numpy arrays, as described in the :ref: `sec_general_stats_output_format ` section.
260+ The first dimension corresponds to the window along the genome
261+ such that for some result array ``x ``, ``x[j] `` contains information about the jth window.
262+ The last dimension corresponds to the statistics being computed, so that ``x[j, k] `` is the
263+ value of the kth statistic in the jth window (in the two dimensional case). This is
264+ a powerful and general interface, but in many cases we will not use this full generality
265+ and the extra dimensions in the numpy arrays are inconvenient.
266+
267+ Tskit optionally removes empty dimensions from the output arrays following a few
268+ simple rules.
269+
270+ 1. If ``windows `` is None we are computing over the single window covering the
271+ full sequence. We therefore drop the first dimension of the array.
272+
273+ 2. In one-way stats, if the ``sample_sets `` argument is a 1D array we interpret
274+ this as specifying a single sample set (and therefore a single statistic), and
275+ drop the last dimension of the output array. If ``sample_sets `` is None
276+ (the default), we use the sample set ``ts.samples() ``, invoking
277+ this rule (we therefore drop the last dimension by default).
278+
279+ 3. In k-way stats, if the ``indexes `` argument is a 1D array of length k
280+ we intepret this as specifying a single statistic and drop the last
281+ dimension of the array. If ``indexes `` is None (the default) and
282+ there are k sample sets, we compute the statistic from these sample sets
283+ and drop the last dimension.
284+
285+ Rules 2 and 3 can be summarised by "the dimensions of the input determines
286+ the dimensions of the output". Note that dropping these dimensions is
287+ **optional **: it is always possible to keep the full dimensions of the
288+ output arrays.
289+
290+ Please see the :ref: `tutorial <sec_tutorial_stats >` for examples of the
291+ various output dimension options.
246292
247293********************
248294Available statistics
0 commit comments