Skip to content

Commit 47ec75f

Browse files
Merge pull request #313 from jeromekelleher/default-windows-strip-dims
Strip empty first dimension for default windows.
2 parents 28a56cb + 74b1742 commit 47ec75f

File tree

5 files changed

+746
-143
lines changed

5 files changed

+746
-143
lines changed

docs/examples.py

Lines changed: 52 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -175,7 +175,58 @@ def missing_data():
175175
tree = ts.first()
176176
tree.draw("_static/missing_data1.svg")
177177

178+
179+
def stats():
180+
ts = msprime.simulate(
181+
10**4, Ne=10**4, recombination_rate=1e-8, mutation_rate=1e-8, length=10**7,
182+
random_seed=42)
183+
print("num_trees = ", ts.num_trees, ", num_sites = ", ts.num_sites, sep="")
184+
185+
x = ts.diversity()
186+
print("Average diversity per unit sequence length = {:.3G}".format(x))
187+
188+
windows = np.linspace(0, ts.sequence_length, num=5)
189+
x = ts.diversity(windows=windows)
190+
print(windows)
191+
print(x)
192+
193+
A = ts.samples()[:100]
194+
x = ts.diversity(sample_sets=A)
195+
print(x)
196+
197+
B = ts.samples()[100:200]
198+
C = ts.samples()[200:300]
199+
x = ts.diversity(sample_sets=[A, B, C])
200+
print(x)
201+
202+
x = ts.diversity(sample_sets=[A, B, C], windows=windows)
203+
print("shape = ", x.shape)
204+
print(x)
205+
206+
A = ts.samples()[:100]
207+
B = ts.samples()[:100]
208+
x = ts.divergence([A, B])
209+
print(x)
210+
211+
x = ts.divergence([A, B], windows=windows)
212+
print(x)
213+
214+
x = ts.divergence([A, B, C], indexes=[(0, 1), (0, 2)])
215+
print(x)
216+
217+
x = ts.divergence([A, B, C], indexes=(0, 1))
218+
print(x)
219+
220+
x = ts.divergence([A, B, C], indexes=[(0, 1)])
221+
print(x)
222+
223+
x = ts.divergence([A, B, C], indexes=[(0, 1), (0, 2)], windows=windows)
224+
print(x)
225+
178226
# moving_along_tree_sequence()
179227
# parsimony()
180228
# allele_frequency_spectra()
181-
missing_data()
229+
# missing_data()
230+
231+
stats()
232+

docs/stats.rst

Lines changed: 60 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,8 @@ and of the underlying trees (whose definition does not depend on the genome sequ
1414
Furthermore, these statistics have a common interface to easily compute
1515
(a) averaged statistics in windows along the genome,
1616
and (b) summary statistics of many combinations of sets of samples simultaneously.
17-
All methods return numpy arrays,
18-
whose rows correspond to the windows along the genome,
19-
and whose remaining dimensions are determined by the statistic.
17+
All methods return numpy arrays whose dimensions are
18+
determined by the parameters (see :ref:`sec_general_stats_output_dimensions`).
2019

2120
.. warning:: :ref:`sec_data_model_missing_data` is not currently
2221
handled correctly by site statistics defined here, as we always
@@ -90,8 +89,12 @@ obtaining ``((b - a) * S[0] + (c - b) * S[1]) / (c - a)``.
9089
There are some shortcuts to other useful options:
9190

9291
``windows = None``
93-
This is the default, and equivalent to passing ``windows = [0.0, ts.sequence_length]``.
94-
The output will still be a two-dimensional array, but with only one row.
92+
This is the default and computes statistics in single window over the whole
93+
sequence. As the first returned array contains only a single
94+
value, we drop this dimension as described in the :ref:`output dimensions
95+
<sec_general_stats_output_dimensions>` section. **Note:** if you really do
96+
want to have an array with a single value as the result, please use
97+
``windows = [0.0, ts.sequence_length]``.
9598

9699
``windows = "trees"``
97100
This says that you want statistics computed separately on the portion of the genome
@@ -189,7 +192,9 @@ Here are some additional special cases:
189192
If the statistic takes ``k`` inputs for ``k > 1``,
190193
and there are exactly ``k`` lists in ``sample_sets``,
191194
then this will compute just one statistic, and is equivalent to passing
192-
``indexes = [(0, 1, ..., k-1)]``.
195+
``indexes = (0, 1, ..., k-1)``. Note that this also drops the last
196+
dimension of the output, as described in the :ref:`sec_general_stats_output_dimensions`
197+
section.
193198
If there are not exactly ``k`` sample sets, this will throw an error.
194199

195200
``k=1`` does not allow ``indexes``:
@@ -201,16 +206,16 @@ Here are some additional special cases:
201206
were that allowed.)
202207

203208

204-
.. _sec_general_stats_output:
209+
.. _sec_general_stats_output_format:
205210

206-
******
207-
Output
208-
******
211+
*************
212+
Output format
213+
*************
209214

210215
Each of the statistics methods returns a ``numpy`` ndarray.
211216
Suppose that the output is name ``out``.
212-
In all cases, the number of rows of the output is equal to the number of windows,
213-
so that ``out.shape[0]`` is equal to ``len(windows) - 1``
217+
If ``windows`` has been specified, the number of rows of the output is equal to the
218+
number of windows, so that ``out.shape[0]`` is equal to ``len(windows) - 1``
214219
and ``out[i]`` is an array of statistics describing the portion of the tree sequence
215220
from ``windows[i]`` to ``windows[i + 1]`` (including the left but not the right endpoint).
216221

@@ -227,6 +232,9 @@ from ``windows[i]`` to ``windows[i + 1]`` (including the left but not the right
227232

228233
The final dimension of the arrays in other cases is specified by the method.
229234

235+
Note, however, that empty dimensions can optionally be dropped,
236+
as described in the :ref:`sec_general_stats_output_dimensions` section.
237+
230238
A note about **default values** and **division by zero**:
231239
Under the hood, statistics computation fills in zeros everywhere, then updates these
232240
(since statistics are all **additive**, this makes sense).
@@ -241,6 +249,46 @@ For ``branch`` statistics, any windows with **no branches** will similarly remai
241249
That said, you should **not** rely on the specific behavior of whether ``0`` or ``nan`` is returned
242250
for "empty" cases like these: it is subject to change.
243251

252+
.. _sec_general_stats_output_dimensions:
253+
254+
*****************
255+
Output dimensions
256+
*****************
257+
258+
In the general case, tskit outputs two dimensional (or three dimensional, in the case of node
259+
stats) numpy arrays, as described in the :ref:`sec_general_stats_output_format` section.
260+
The first dimension corresponds to the window along the genome
261+
such that for some result array ``x``, ``x[j]`` contains information about the jth window.
262+
The last dimension corresponds to the statistics being computed, so that ``x[j, k]`` is the
263+
value of the kth statistic in the jth window (in the two dimensional case). This is
264+
a powerful and general interface, but in many cases we will not use this full generality
265+
and the extra dimensions in the numpy arrays are inconvenient.
266+
267+
Tskit optionally removes empty dimensions from the output arrays following a few
268+
simple rules.
269+
270+
1. If ``windows`` is None we are computing over the single window covering the
271+
full sequence. We therefore drop the first dimension of the array.
272+
273+
2. In one-way stats, if the ``sample_sets`` argument is a 1D array we interpret
274+
this as specifying a single sample set (and therefore a single statistic), and
275+
drop the last dimension of the output array. If ``sample_sets`` is None
276+
(the default), we use the sample set ``ts.samples()``, invoking
277+
this rule (we therefore drop the last dimension by default).
278+
279+
3. In k-way stats, if the ``indexes`` argument is a 1D array of length k
280+
we intepret this as specifying a single statistic and drop the last
281+
dimension of the array. If ``indexes`` is None (the default) and
282+
there are k sample sets, we compute the statistic from these sample sets
283+
and drop the last dimension.
284+
285+
Rules 2 and 3 can be summarised by "the dimensions of the input determines
286+
the dimensions of the output". Note that dropping these dimensions is
287+
**optional**: it is always possible to keep the full dimensions of the
288+
output arrays.
289+
290+
Please see the :ref:`tutorial <sec_tutorial_stats>` for examples of the
291+
various output dimension options.
244292

245293
********************
246294
Available statistics

docs/tutorial.rst

Lines changed: 203 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -821,6 +821,208 @@ to remember that the algorithm is minimising the number of state transitions;
821821
this may not correspond always to what we might consider the most parsimonious
822822
explanation.
823823

824+
.. _sec_tutorial_stats:
825+
826+
********************
827+
Computing statistics
828+
********************
829+
830+
Tskit provides an extensive and flexible interface for computing population
831+
genetic statistics, which is documented in detail in the :ref:`general statistics
832+
<sec_general_stats>` section. This tutorial aims to give a quick overview of
833+
how the APIs work how to use them effectively.
834+
835+
First, lets simulate a tree sequence to work with which has roughly human
836+
parameters for 10 thousand samples and 10Mb chromosomes::
837+
838+
ts = msprime.simulate(
839+
10**4, Ne=10**4, recombination_rate=1e-8, mutation_rate=1e-8, length=10**7,
840+
random_seed=42)
841+
842+
We end up with 36K trees 39K segregating sites. We'd now like to compute some statistics on
843+
this dataset.
844+
845+
++++++++++++++++++
846+
One-way statistics
847+
++++++++++++++++++
848+
849+
We refer to statistics that are defined with respect to a single set of
850+
samples as "one-way". An example of such a statistic is diversity, which
851+
is computed using the :meth:`.TreeSequence.diversity` method::
852+
853+
x = ts.diversity()
854+
print("Average diversity per unit sequence length = {:.3G}".format(x))
855+
856+
[Output]
857+
858+
Average diversity per unit sequence length = 0.000401
859+
860+
This tells the average diversity across the whole sequence and returns a single
861+
number. We'll usually want to compute statistics in
862+
:ref:`windows <sec_general_stats_windowing>` along the genome and we
863+
use the ``windows`` argument to do this::
864+
865+
windows = np.linspace(0, ts.sequence_length, num=5)
866+
x = ts.diversity(windows=windows)
867+
print(windows)
868+
print(x)
869+
870+
[Output]
871+
872+
[ 0. 2500000. 5000000. 7500000. 10000000.]
873+
[0.00041602 0.00039112 0.00041554 0.00038329]
874+
875+
The ``windows`` argument takes a numpy array specifying the breakpoints
876+
along the genome. Here, we use numpy to create four equally spaced windows
877+
of size 2.5 megabases (the windows array contains k + 1 elements to define
878+
k windows). Because we have asked for values in windows, tskit now returns
879+
a numpy array rather than a single value. (See
880+
:ref:`sec_general_stats_output_dimensions` for a full description of how the output
881+
dimensions of statistics are determined by the ``windows`` argument.)
882+
883+
Suppose we wanted to compute diversity within a specific subset of samples.
884+
We can do this using the ``sample_sets`` argument::
885+
886+
A = ts.samples()[:100]
887+
x = ts.diversity(sample_sets=A)
888+
print(x)
889+
890+
[Output]
891+
892+
0.00040166573737371227
893+
894+
Here, we've computed the average diversity within the first hundred samples across
895+
the whole genome. As we've not specified any windows, this is again a single value.
896+
897+
We can also compute diversity in *multiple* sample sets at the same time by providing
898+
a list of sample sets as an argument::
899+
900+
A = ts.samples()[:100]
901+
B = ts.samples()[100:200]
902+
C = ts.samples()[200:300]
903+
x = ts.diversity(sample_sets=[A, B, C])
904+
print(x)
905+
906+
[Output]
907+
908+
[0.00040167 0.00040008 0.00040103]
909+
910+
Because we've computed multiple statistics concurrently, tskit returns a numpy array
911+
of these statistics. We have asked for diversity within three different sample sets,
912+
and tskit therefore returns an array with three values. (In general, the
913+
dimensions of the input determines the dimensions of the output: see
914+
:ref:`sec_general_stats_output_dimensions` for a detailed description of the rules.)
915+
916+
We can also compute multiple statistics in multiple windows::
917+
918+
x = ts.diversity(sample_sets=[A, B, C], windows=windows)
919+
print("shape = ", x.shape)
920+
print(x)
921+
922+
[Output]
923+
924+
shape = (4, 3)
925+
[[0.0004139 0.00041567 0.00041774]
926+
[0.00039148 0.00039152 0.00038997]
927+
[0.00042019 0.00041039 0.00041475]
928+
[0.0003811 0.00038274 0.00038166]]
929+
930+
We have computed diversity within three different sample sets across four
931+
genomic windows, and our output is therefore a 2D numpy array with four
932+
rows and three columns: each row contains the diversity values within
933+
A, B and C for a particular window.
934+
935+
++++++++++++++++++++
936+
Multi-way statistics
937+
++++++++++++++++++++
938+
939+
Many population genetic statistics compare multiple sets of samples to
940+
each other. For example, the :meth:`TreeSequence.divergence` method computes
941+
the divergence between two subsets of samples::
942+
943+
A = ts.samples()[:100]
944+
B = ts.samples()[:100]
945+
x = ts.divergence([A, B])
946+
print(x)
947+
948+
[Output]
949+
950+
0.00039764908000000676
951+
952+
The divergence between two sets of samples A and B is a single number,
953+
and we we again return a single floating point value as the result. We can also
954+
compute this in windows along the genome, as before::
955+
956+
957+
x = ts.divergence([A, B], windows=windows)
958+
print(x)
959+
960+
[Output]
961+
962+
[0.00040976 0.00038756 0.00041599 0.00037728]
963+
964+
965+
Again, as we have defined four genomic windows along the sequence, the result is
966+
numpy array with four values.
967+
968+
A powerful feature of tskit's stats API is that we can compute the divergences
969+
between multiple sets of samples simultaneously using the ``indexes`` argument::
970+
971+
972+
x = ts.divergence([A, B, C], indexes=[(0, 1), (0, 2)])
973+
print(x)
974+
975+
[Output]
976+
977+
[0.00039765 0.00040181]
978+
979+
Here, we've specified three sample sets A, B and C and we've computed the
980+
divergences between A and B, and between A and C. The ``indexes`` argument is used
981+
to specify which pairs of sets we are interested in. In this example
982+
we've computed two different divergence values and the output is therefore
983+
a numpy array of length 2.
984+
985+
As before, we can combine computing multiple statistics in multiple windows
986+
to return a 2D numpy array::
987+
988+
windows = np.linspace(0, ts.sequence_length, num=5)
989+
x = ts.divergence([A, B, C], indexes=[(0, 1), (0, 2)], windows=windows)
990+
print(x)
991+
992+
[Output]
993+
994+
[[0.00040976 0.0004161 ]
995+
[0.00038756 0.00039025]
996+
[0.00041599 0.00041847]
997+
[0.00037728 0.0003824 ]]
998+
999+
Each row again corresponds to a window, which contains the average divergence
1000+
values between the chosen sets.
1001+
1002+
If the ``indexes`` parameter is 1D array, we interpret this as specifying
1003+
a single statistic and remove the empty outer dimension::
1004+
1005+
x = ts.divergence([A, B, C], indexes=(0, 1))
1006+
print(x)
1007+
1008+
[Output]
1009+
1010+
0.00039764908000000676
1011+
1012+
It's important to note that we don't **have** to remove empty dimensions: tskit
1013+
will only do this if you explicitly ask it to. Here, for example, we can keep the
1014+
output as an array with one value if we wish::
1015+
1016+
x = ts.divergence([A, B, C], indexes=[(0, 1)])
1017+
print(x)
1018+
1019+
[Output]
1020+
1021+
[0.00039765]
1022+
1023+
Please see :ref:`sec_general_stats_sample_sets` for a
1024+
full description of the ``sample_sets`` and ``indexes`` arguments.
1025+
8241026
.. _sec_tutorial_afs:
8251027

8261028
************************
@@ -881,7 +1083,7 @@ giving::
8811083
This time, we've asked for the number of sites at each frequency in two
8821084
equal windows. Now we can see that in the first half of the sequence we
8831085
have three sites (compare with the site table above): one singleton,
884-
one doubleton and one tripleton.
1086+
one doubleton and one tripleton.
8851087

8861088
+++++++++++++
8871089
Joint spectra

0 commit comments

Comments
 (0)