Skip to content

Commit 74b1742

Browse files
Documentation for dimension stripping.
1 parent 8e695cc commit 74b1742

File tree

3 files changed

+317
-18
lines changed

3 files changed

+317
-18
lines changed

docs/examples.py

Lines changed: 52 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -175,7 +175,58 @@ def missing_data():
175175
tree = ts.first()
176176
tree.draw("_static/missing_data1.svg")
177177

178+
179+
def stats():
180+
ts = msprime.simulate(
181+
10**4, Ne=10**4, recombination_rate=1e-8, mutation_rate=1e-8, length=10**7,
182+
random_seed=42)
183+
print("num_trees = ", ts.num_trees, ", num_sites = ", ts.num_sites, sep="")
184+
185+
x = ts.diversity()
186+
print("Average diversity per unit sequence length = {:.3G}".format(x))
187+
188+
windows = np.linspace(0, ts.sequence_length, num=5)
189+
x = ts.diversity(windows=windows)
190+
print(windows)
191+
print(x)
192+
193+
A = ts.samples()[:100]
194+
x = ts.diversity(sample_sets=A)
195+
print(x)
196+
197+
B = ts.samples()[100:200]
198+
C = ts.samples()[200:300]
199+
x = ts.diversity(sample_sets=[A, B, C])
200+
print(x)
201+
202+
x = ts.diversity(sample_sets=[A, B, C], windows=windows)
203+
print("shape = ", x.shape)
204+
print(x)
205+
206+
A = ts.samples()[:100]
207+
B = ts.samples()[:100]
208+
x = ts.divergence([A, B])
209+
print(x)
210+
211+
x = ts.divergence([A, B], windows=windows)
212+
print(x)
213+
214+
x = ts.divergence([A, B, C], indexes=[(0, 1), (0, 2)])
215+
print(x)
216+
217+
x = ts.divergence([A, B, C], indexes=(0, 1))
218+
print(x)
219+
220+
x = ts.divergence([A, B, C], indexes=[(0, 1)])
221+
print(x)
222+
223+
x = ts.divergence([A, B, C], indexes=[(0, 1), (0, 2)], windows=windows)
224+
print(x)
225+
178226
# moving_along_tree_sequence()
179227
# parsimony()
180228
# allele_frequency_spectra()
181-
missing_data()
229+
# missing_data()
230+
231+
stats()
232+

docs/stats.rst

Lines changed: 62 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,8 @@ and of the underlying trees (whose definition does not depend on the genome sequ
1414
Furthermore, these statistics have a common interface to easily compute
1515
(a) averaged statistics in windows along the genome,
1616
and (b) summary statistics of many combinations of sets of samples simultaneously.
17-
All methods return numpy arrays,
18-
whose rows correspond to the windows along the genome,
19-
and whose remaining dimensions are determined by the statistic.
17+
All methods return numpy arrays whose dimensions are
18+
determined by the parameters (see :ref:`sec_general_stats_output_dimensions`).
2019

2120
.. warning:: :ref:`sec_data_model_missing_data` is not currently
2221
handled correctly by site statistics defined here, as we always
@@ -70,12 +69,6 @@ e.g., the sites of the SNPs.
7069
Windowing
7170
*********
7271

73-
By default, statistics
74-
75-
``windows = None``
76-
This is the default, and equivalent to passing ``windows = [0.0, ts.sequence_length]``.
77-
The output will still be a two-dimensional array, but with only one row.
78-
7972
Each statistic has an argument, ``windows``,
8073
which defines a collection of contiguous windows along the genome.
8174
If ``windows`` is a list of ``n+1`` increasing numbers between 0 and the ``sequence_length``,
@@ -95,6 +88,14 @@ obtaining ``((b - a) * S[0] + (c - b) * S[1]) / (c - a)``.
9588

9689
There are some shortcuts to other useful options:
9790

91+
``windows = None``
92+
This is the default and computes statistics in single window over the whole
93+
sequence. As the first returned array contains only a single
94+
value, we drop this dimension as described in the :ref:`output dimensions
95+
<sec_general_stats_output_dimensions>` section. **Note:** if you really do
96+
want to have an array with a single value as the result, please use
97+
``windows = [0.0, ts.sequence_length]``.
98+
9899
``windows = "trees"``
99100
This says that you want statistics computed separately on the portion of the genome
100101
spanned by each tree, so is equivalent to passing ``windows = ts.breakpoints()``.
@@ -191,7 +192,9 @@ Here are some additional special cases:
191192
If the statistic takes ``k`` inputs for ``k > 1``,
192193
and there are exactly ``k`` lists in ``sample_sets``,
193194
then this will compute just one statistic, and is equivalent to passing
194-
``indexes = [(0, 1, ..., k-1)]``.
195+
``indexes = (0, 1, ..., k-1)``. Note that this also drops the last
196+
dimension of the output, as described in the :ref:`sec_general_stats_output_dimensions`
197+
section.
195198
If there are not exactly ``k`` sample sets, this will throw an error.
196199

197200
``k=1`` does not allow ``indexes``:
@@ -203,16 +206,16 @@ Here are some additional special cases:
203206
were that allowed.)
204207

205208

206-
.. _sec_general_stats_output:
209+
.. _sec_general_stats_output_format:
207210

208-
******
209-
Output
210-
******
211+
*************
212+
Output format
213+
*************
211214

212215
Each of the statistics methods returns a ``numpy`` ndarray.
213216
Suppose that the output is name ``out``.
214-
In all cases, the number of rows of the output is equal to the number of windows,
215-
so that ``out.shape[0]`` is equal to ``len(windows) - 1``
217+
If ``windows`` has been specified, the number of rows of the output is equal to the
218+
number of windows, so that ``out.shape[0]`` is equal to ``len(windows) - 1``
216219
and ``out[i]`` is an array of statistics describing the portion of the tree sequence
217220
from ``windows[i]`` to ``windows[i + 1]`` (including the left but not the right endpoint).
218221

@@ -229,6 +232,9 @@ from ``windows[i]`` to ``windows[i + 1]`` (including the left but not the right
229232

230233
The final dimension of the arrays in other cases is specified by the method.
231234

235+
Note, however, that empty dimensions can optionally be dropped,
236+
as described in the :ref:`sec_general_stats_output_dimensions` section.
237+
232238
A note about **default values** and **division by zero**:
233239
Under the hood, statistics computation fills in zeros everywhere, then updates these
234240
(since statistics are all **additive**, this makes sense).
@@ -243,6 +249,46 @@ For ``branch`` statistics, any windows with **no branches** will similarly remai
243249
That said, you should **not** rely on the specific behavior of whether ``0`` or ``nan`` is returned
244250
for "empty" cases like these: it is subject to change.
245251

252+
.. _sec_general_stats_output_dimensions:
253+
254+
*****************
255+
Output dimensions
256+
*****************
257+
258+
In the general case, tskit outputs two dimensional (or three dimensional, in the case of node
259+
stats) numpy arrays, as described in the :ref:`sec_general_stats_output_format` section.
260+
The first dimension corresponds to the window along the genome
261+
such that for some result array ``x``, ``x[j]`` contains information about the jth window.
262+
The last dimension corresponds to the statistics being computed, so that ``x[j, k]`` is the
263+
value of the kth statistic in the jth window (in the two dimensional case). This is
264+
a powerful and general interface, but in many cases we will not use this full generality
265+
and the extra dimensions in the numpy arrays are inconvenient.
266+
267+
Tskit optionally removes empty dimensions from the output arrays following a few
268+
simple rules.
269+
270+
1. If ``windows`` is None we are computing over the single window covering the
271+
full sequence. We therefore drop the first dimension of the array.
272+
273+
2. In one-way stats, if the ``sample_sets`` argument is a 1D array we interpret
274+
this as specifying a single sample set (and therefore a single statistic), and
275+
drop the last dimension of the output array. If ``sample_sets`` is None
276+
(the default), we use the sample set ``ts.samples()``, invoking
277+
this rule (we therefore drop the last dimension by default).
278+
279+
3. In k-way stats, if the ``indexes`` argument is a 1D array of length k
280+
we intepret this as specifying a single statistic and drop the last
281+
dimension of the array. If ``indexes`` is None (the default) and
282+
there are k sample sets, we compute the statistic from these sample sets
283+
and drop the last dimension.
284+
285+
Rules 2 and 3 can be summarised by "the dimensions of the input determines
286+
the dimensions of the output". Note that dropping these dimensions is
287+
**optional**: it is always possible to keep the full dimensions of the
288+
output arrays.
289+
290+
Please see the :ref:`tutorial <sec_tutorial_stats>` for examples of the
291+
various output dimension options.
246292

247293
********************
248294
Available statistics

0 commit comments

Comments
 (0)