@@ -821,6 +821,208 @@ to remember that the algorithm is minimising the number of state transitions;
821821this may not correspond always to what we might consider the most parsimonious
822822explanation.
823823
824+ .. _sec_tutorial_stats :
825+
826+ ********************
827+ Computing statistics
828+ ********************
829+
830+ Tskit provides an extensive and flexible interface for computing population
831+ genetic statistics, which is documented in detail in the :ref: `general statistics
832+ <sec_general_stats>` section. This tutorial aims to give a quick overview of
833+ how the APIs work how to use them effectively.
834+
835+ First, lets simulate a tree sequence to work with which has roughly human
836+ parameters for 10 thousand samples and 10Mb chromosomes::
837+
838+ ts = msprime.simulate(
839+ 10**4, Ne=10**4, recombination_rate=1e-8, mutation_rate=1e-8, length=10**7,
840+ random_seed=42)
841+
842+ We end up with 36K trees 39K segregating sites. We'd now like to compute some statistics on
843+ this dataset.
844+
845+ ++++++++++++++++++
846+ One-way statistics
847+ ++++++++++++++++++
848+
849+ We refer to statistics that are defined with respect to a single set of
850+ samples as "one-way". An example of such a statistic is diversity, which
851+ is computed using the :meth: `.TreeSequence.diversity ` method::
852+
853+ x = ts.diversity()
854+ print("Average diversity per unit sequence length = {:.3G}".format(x))
855+
856+ [Output]
857+
858+ Average diversity per unit sequence length = 0.000401
859+
860+ This tells the average diversity across the whole sequence and returns a single
861+ number. We'll usually want to compute statistics in
862+ :ref: `windows <sec_general_stats_windowing >` along the genome and we
863+ use the ``windows `` argument to do this::
864+
865+ windows = np.linspace(0, ts.sequence_length, num=5)
866+ x = ts.diversity(windows=windows)
867+ print(windows)
868+ print(x)
869+
870+ [Output]
871+
872+ [ 0. 2500000. 5000000. 7500000. 10000000.]
873+ [0.00041602 0.00039112 0.00041554 0.00038329]
874+
875+ The ``windows `` argument takes a numpy array specifying the breakpoints
876+ along the genome. Here, we use numpy to create four equally spaced windows
877+ of size 2.5 megabases (the windows array contains k + 1 elements to define
878+ k windows). Because we have asked for values in windows, tskit now returns
879+ a numpy array rather than a single value. (See
880+ :ref: `sec_general_stats_output_dimensions ` for a full description of how the output
881+ dimensions of statistics are determined by the ``windows `` argument.)
882+
883+ Suppose we wanted to compute diversity within a specific subset of samples.
884+ We can do this using the ``sample_sets `` argument::
885+
886+ A = ts.samples()[:100]
887+ x = ts.diversity(sample_sets=A)
888+ print(x)
889+
890+ [Output]
891+
892+ 0.00040166573737371227
893+
894+ Here, we've computed the average diversity within the first hundred samples across
895+ the whole genome. As we've not specified any windows, this is again a single value.
896+
897+ We can also compute diversity in *multiple * sample sets at the same time by providing
898+ a list of sample sets as an argument::
899+
900+ A = ts.samples()[:100]
901+ B = ts.samples()[100:200]
902+ C = ts.samples()[200:300]
903+ x = ts.diversity(sample_sets=[A, B, C])
904+ print(x)
905+
906+ [Output]
907+
908+ [0.00040167 0.00040008 0.00040103]
909+
910+ Because we've computed multiple statistics concurrently, tskit returns a numpy array
911+ of these statistics. We have asked for diversity within three different sample sets,
912+ and tskit therefore returns an array with three values. (In general, the
913+ dimensions of the input determines the dimensions of the output: see
914+ :ref: `sec_general_stats_output_dimensions ` for a detailed description of the rules.)
915+
916+ We can also compute multiple statistics in multiple windows::
917+
918+ x = ts.diversity(sample_sets=[A, B, C], windows=windows)
919+ print("shape = ", x.shape)
920+ print(x)
921+
922+ [Output]
923+
924+ shape = (4, 3)
925+ [[0.0004139 0.00041567 0.00041774]
926+ [0.00039148 0.00039152 0.00038997]
927+ [0.00042019 0.00041039 0.00041475]
928+ [0.0003811 0.00038274 0.00038166]]
929+
930+ We have computed diversity within three different sample sets across four
931+ genomic windows, and our output is therefore a 2D numpy array with four
932+ rows and three columns: each row contains the diversity values within
933+ A, B and C for a particular window.
934+
935+ ++++++++++++++++++++
936+ Multi-way statistics
937+ ++++++++++++++++++++
938+
939+ Many population genetic statistics compare multiple sets of samples to
940+ each other. For example, the :meth: `TreeSequence.divergence ` method computes
941+ the divergence between two subsets of samples::
942+
943+ A = ts.samples()[:100]
944+ B = ts.samples()[:100]
945+ x = ts.divergence([A, B])
946+ print(x)
947+
948+ [Output]
949+
950+ 0.00039764908000000676
951+
952+ The divergence between two sets of samples A and B is a single number,
953+ and we we again return a single floating point value as the result. We can also
954+ compute this in windows along the genome, as before::
955+
956+
957+ x = ts.divergence([A, B], windows=windows)
958+ print(x)
959+
960+ [Output]
961+
962+ [0.00040976 0.00038756 0.00041599 0.00037728]
963+
964+
965+ Again, as we have defined four genomic windows along the sequence, the result is
966+ numpy array with four values.
967+
968+ A powerful feature of tskit's stats API is that we can compute the divergences
969+ between multiple sets of samples simultaneously using the ``indexes `` argument::
970+
971+
972+ x = ts.divergence([A, B, C], indexes=[(0, 1), (0, 2)])
973+ print(x)
974+
975+ [Output]
976+
977+ [0.00039765 0.00040181]
978+
979+ Here, we've specified three sample sets A, B and C and we've computed the
980+ divergences between A and B, and between A and C. The ``indexes `` argument is used
981+ to specify which pairs of sets we are interested in. In this example
982+ we've computed two different divergence values and the output is therefore
983+ a numpy array of length 2.
984+
985+ As before, we can combine computing multiple statistics in multiple windows
986+ to return a 2D numpy array::
987+
988+ windows = np.linspace(0, ts.sequence_length, num=5)
989+ x = ts.divergence([A, B, C], indexes=[(0, 1), (0, 2)], windows=windows)
990+ print(x)
991+
992+ [Output]
993+
994+ [[0.00040976 0.0004161 ]
995+ [0.00038756 0.00039025]
996+ [0.00041599 0.00041847]
997+ [0.00037728 0.0003824 ]]
998+
999+ Each row again corresponds to a window, which contains the average divergence
1000+ values between the chosen sets.
1001+
1002+ If the ``indexes `` parameter is 1D array, we interpret this as specifying
1003+ a single statistic and remove the empty outer dimension::
1004+
1005+ x = ts.divergence([A, B, C], indexes=(0, 1))
1006+ print(x)
1007+
1008+ [Output]
1009+
1010+ 0.00039764908000000676
1011+
1012+ It's important to note that we don't **have ** to remove empty dimensions: tskit
1013+ will only do this if you explicitly ask it to. Here, for example, we can keep the
1014+ output as an array with one value if we wish::
1015+
1016+ x = ts.divergence([A, B, C], indexes=[(0, 1)])
1017+ print(x)
1018+
1019+ [Output]
1020+
1021+ [0.00039765]
1022+
1023+ Please see :ref: `sec_general_stats_sample_sets ` for a
1024+ full description of the ``sample_sets `` and ``indexes `` arguments.
1025+
8241026.. _sec_tutorial_afs :
8251027
8261028************************
@@ -881,7 +1083,7 @@ giving::
8811083This time, we've asked for the number of sites at each frequency in two
8821084equal windows. Now we can see that in the first half of the sequence we
8831085have three sites (compare with the site table above): one singleton,
884- one doubleton and one tripleton.
1086+ one doubleton and one tripleton.
8851087
8861088+++++++++++++
8871089Joint spectra
0 commit comments