Skip to content

Commit f20e19d

Browse files
committed
README updates
1 parent b9dc2dc commit f20e19d

File tree

1 file changed

+71
-33
lines changed

1 file changed

+71
-33
lines changed

README.rst

Lines changed: 71 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,7 @@ Quick Start Guide
2424

2525
Depending on sample size and your machine, the fitting procedure
2626
should take between a few minutes and a few hours. The fitted model
27-
will be stored in JSON format in ``analysis/model.final.json``. For
28-
details on the format, see below.
27+
will be stored in JSON format in ``analysis/model.final.json``.
2928

3029
4. Visualize the results using plot_::
3130

@@ -38,11 +37,12 @@ populations; see split_.
3837
Installation instructions
3938
=========================
4039

41-
Installer binaries are available from the `releases page`_. Download the
42-
installer for your platform and then run it using ``bash``. The script
43-
will walk you through the installation process. Be sure to
44-
``source /path/to/smcpp/bin/activate`` before running ``/path/to/smcpp/bin/smc++``
45-
in order to prevent conflicts with your existing Python installation.
40+
Installer binaries are available from the `releases page`_. Download
41+
the installer for your platform and then run it using ``bash``.
42+
The script will walk you through the installation process. You may
43+
need to ``source /path/to/smcpp/bin/activate`` before running
44+
``/path/to/smcpp/bin/smc++`` in order to prevent conflicts with your
45+
existing Python installation.
4646

4747
The installers are based on the Anaconda_ scientific Python distribution.
4848
If Anaconda already exists on your machine, a more efficient way to
@@ -92,13 +92,11 @@ After installing the requirements, SMC++ may be built by running::
9292
Note for OS X users
9393
-------------------
9494
Versions of Clang shipping with Mac OS X do not currently support
95-
OpenMP_. SMC++ will be *much faster* if it runs in multithreaded mode
96-
using OpenMP. For this reason it is recommended that you use ``gcc``
97-
instead. In order to do so, set the ``CC`` and ``CXX`` environment
98-
variables, e.g.::
95+
OpenMP_. In order to build SMC++ on OS X you must use a compiler that
96+
does, such as ``gcc``::
9997

100-
$ export CC=gcc-5 CXX=g++-5
101-
$ CC=gcc-5 CXX=g++-5 python setup.py install
98+
$ brew install gcc
99+
$ CC=gcc-6 CXX=g++-6 python setup.py install
102100

103101
.. _OpenMP: http://openmp.org
104102

@@ -111,7 +109,7 @@ root access on your system, you may wish to install SMC++ inside of a
111109
environment::
112110

113111
$ virtualenv -p python3 <desired location>
114-
$ source <desired location>/bin/active
112+
$ source <desired location>/bin/activate
115113

116114
Then, install SMC++ as described above.
117115

@@ -132,7 +130,7 @@ vcf2smc
132130
-------
133131

134132
This subcommand converts (biallelic, diploid) VCF data to the format
135-
used by SMC++.
133+
used by SMC++.
136134

137135
Required arguments
138136
^^^^^^^^^^^^^^^^^^
@@ -156,7 +154,7 @@ For example, to convert contig ``chr1`` of ``vcf.gz`` using samples
156154

157155
Optional arguments
158156
^^^^^^^^^^^^^^^^^^
159-
- ``-d``. SMC++ relies crucially on the notion of a pair of *distinguished lineages*
157+
- ``-d``: SMC++ relies crucially on the notion of a pair of *distinguished lineages*
160158
(see paper for details on this terminology). The identity of the
161159
distinguished lineages is set using the ``-d`` option, which specifies
162160
the sample(s) which will form the distinguished pair. ``-d`` accepts to
@@ -166,19 +164,64 @@ Optional arguments
166164
167165
$ smc++ vcf2smc -d NA12878 NA12879 vcf.gz chr1.smc.gz chr1 CEU:NA12878,NA12879
168166
169-
Note that "first" and "second" allele have no meaning for unphased data!
167+
Note that "first" and "second" allele have no meaning for unphased data; if your
168+
data are not phased, it only makes sense to specify a single individual
169+
(e.g. ``-d NA12878 NA12878``).
170+
171+
- ``--mask``, ``--m``: This specifies a BED-formatted mask file whose
172+
positions will be marked as missing data (across all samples) in
173+
the outputted SMC++ data set. This can be used to delineate large
174+
uncalled regions (e.g. centromeres) which are often omitted in VCF
175+
files; without additional information provided by ``--mask``, there
176+
is no way to distinguish these missing regions from very long runs
177+
of homozygosity. For finer-grained control of missing data, setting
178+
individual positions and samples to the missing genotype, ``./.``,
179+
also works fine. (The point of ``--mask`` is to save the user the
180+
trouble of emitting millions of rows of missing observations in the
181+
VCF).
182+
183+
- ``--missing-cutoff``, ``-c``: This is an alternative to ``--mask`` which will
184+
automatically treat runs of homozgosity longer than ``-c`` base pairs
185+
as missing. Typically ``-c`` should be set fairly high so as not
186+
to filter out legitimate long runs of homozyous bases, which are
187+
informative about recent demography. This is a fairly crude approach
188+
to filtering and is only recommended for use in cases where using
189+
``--mask`` is not possible.
170190

171-
By varying ``-d`` over the same VCF, the user can create distinct data
172-
sets for estimation. This is useful for forming composite likelihoods.
173-
For example, the following command will create three data sets from
174-
contig ``chr1`` of ``myvcf.gz``, by varying the identity of the distinguished
175-
individual and treating the remaining two samples as "undistinguished":
176-
177-
.. code-block:: bash
178-
179-
for i in {7..9};
180-
do smc++ vcf2smc -d NA1287$i NA1287$i myvcf.gz out.$i.txt chr1 NA12877 NA12878 NA12890;
181-
done
191+
Composite Likelihood
192+
^^^^^^^^^^^^^^^^^^^^
193+
By varying ``-d`` over the same VCF, you can create distinct data
194+
sets for estimation. This is useful for forming composite likelihoods.
195+
For example, the following command will create three data sets from
196+
contig ``chr1`` of ``myvcf.gz``, by varying the identity of the distinguished
197+
individual and treating the remaining two samples as "undistinguished":
198+
199+
.. code-block:: bash
200+
201+
for i in {7..9};
202+
do smc++ vcf2smc -d NA1287$i NA1287$i myvcf.gz out.$i.txt chr1 NA12877 NA12878 NA12890;
203+
done
204+
205+
You can then pass these data sets into estimate_::
206+
207+
$ smc++ estimate -o output/ out.*.txt
208+
209+
SMC++ treats each file ``out.*.txt`` as an independently evolving
210+
sequence (i.e., a chromosome); the likelihood is simply the product
211+
of SMC++ likelihoods over each of the data sets. In the example above
212+
where the data sets are generated from the same chromosome but different
213+
distinguished individuals (different ``-d``), this independence
214+
assumption is violated, leading to a so-called **composite likelihood**.
215+
The advantage of this approach is that it incorporates genealogical
216+
information from additional distinguished individuals into the analysis,
217+
potentially leading to improved estimates.
218+
219+
Since (a portion of) the computational and memory requirements of SMC++
220+
scale linearly with the total analyzed sequence length, it is generally
221+
advisable to composite over a relatively small number of individuals. In
222+
practice we generally use 2-10 individuals, depending on genome length,
223+
sample size, etc., and have found that this leads to improved estimation
224+
without causing significant degeneracy in the likelihood.
182225

183226
Manual conversion
184227
^^^^^^^^^^^^^^^^^
@@ -194,11 +237,6 @@ is::
194237

195238
$ smc++ estimate -o out data.smc.gz
196239

197-
Required arguments
198-
^^^^^^^^^^^^^^^^^^
199-
200-
None.
201-
202240
Recommended arguments
203241
^^^^^^^^^^^^^^^^^^^^^
204242

0 commit comments

Comments
 (0)