@@ -24,8 +24,7 @@ Quick Start Guide
2424
2525 Depending on sample size and your machine, the fitting procedure
2626 should take between a few minutes and a few hours. The fitted model
27- will be stored in JSON format in ``analysis/model.final.json``. For
28- details on the format, see below.
27+ will be stored in JSON format in ``analysis/model.final.json``.
2928
30294. Visualize the results using plot _::
3130
@@ -38,11 +37,12 @@ populations; see split_.
3837Installation instructions
3938=========================
4039
41- Installer binaries are available from the `releases page `_. Download the
42- installer for your platform and then run it using ``bash ``. The script
43- will walk you through the installation process. Be sure to
44- ``source /path/to/smcpp/bin/activate `` before running ``/path/to/smcpp/bin/smc++ ``
45- in order to prevent conflicts with your existing Python installation.
40+ Installer binaries are available from the `releases page `_. Download
41+ the installer for your platform and then run it using ``bash ``.
42+ The script will walk you through the installation process. You may
43+ need to ``source /path/to/smcpp/bin/activate `` before running
44+ ``/path/to/smcpp/bin/smc++ `` in order to prevent conflicts with your
45+ existing Python installation.
4646
4747The installers are based on the Anaconda _ scientific Python distribution.
4848If Anaconda already exists on your machine, a more efficient way to
@@ -92,13 +92,11 @@ After installing the requirements, SMC++ may be built by running::
9292Note for OS X users
9393-------------------
9494Versions of Clang shipping with Mac OS X do not currently support
95- OpenMP _. SMC++ will be *much faster * if it runs in multithreaded mode
96- using OpenMP. For this reason it is recommended that you use ``gcc ``
97- instead. In order to do so, set the ``CC `` and ``CXX `` environment
98- variables, e.g.::
95+ OpenMP _. In order to build SMC++ on OS X you must use a compiler that
96+ does, such as ``gcc ``::
9997
100- $ export CC=gcc-5 CXX=g++-5
101- $ CC=gcc-5 CXX=g++-5 python setup.py install
98+ $ brew install gcc
99+ $ CC=gcc-6 CXX=g++-6 python setup.py install
102100
103101.. _OpenMP : http://openmp.org
104102
@@ -111,7 +109,7 @@ root access on your system, you may wish to install SMC++ inside of a
111109environment::
112110
113111 $ virtualenv -p python3 <desired location>
114- $ source <desired location>/bin/active
112+ $ source <desired location>/bin/activate
115113
116114Then, install SMC++ as described above.
117115
@@ -132,7 +130,7 @@ vcf2smc
132130-------
133131
134132This subcommand converts (biallelic, diploid) VCF data to the format
135- used by SMC++.
133+ used by SMC++.
136134
137135Required arguments
138136^^^^^^^^^^^^^^^^^^
@@ -156,7 +154,7 @@ For example, to convert contig ``chr1`` of ``vcf.gz`` using samples
156154
157155Optional arguments
158156^^^^^^^^^^^^^^^^^^
159- - ``-d ``. SMC++ relies crucially on the notion of a pair of *distinguished lineages *
157+ - ``-d ``: SMC++ relies crucially on the notion of a pair of *distinguished lineages *
160158 (see paper for details on this terminology). The identity of the
161159 distinguished lineages is set using the ``-d `` option, which specifies
162160 the sample(s) which will form the distinguished pair. ``-d `` accepts to
@@ -166,19 +164,64 @@ Optional arguments
166164
167165 $ smc++ vcf2smc -d NA12878 NA12879 vcf.gz chr1.smc.gz chr1 CEU:NA12878,NA12879
168166
169- Note that "first" and "second" allele have no meaning for unphased data!
167+ Note that "first" and "second" allele have no meaning for unphased data; if your
168+ data are not phased, it only makes sense to specify a single individual
169+ (e.g. ``-d NA12878 NA12878 ``).
170+
171+ - ``--mask ``, ``--m ``: This specifies a BED-formatted mask file whose
172+ positions will be marked as missing data (across all samples) in
173+ the outputted SMC++ data set. This can be used to delineate large
174+ uncalled regions (e.g. centromeres) which are often omitted in VCF
175+ files; without additional information provided by ``--mask ``, there
176+ is no way to distinguish these missing regions from very long runs
177+ of homozygosity. For finer-grained control of missing data, setting
178+ individual positions and samples to the missing genotype, ``./. ``,
179+ also works fine. (The point of ``--mask `` is to save the user the
180+ trouble of emitting millions of rows of missing observations in the
181+ VCF).
182+
183+ - ``--missing-cutoff ``, ``-c ``: This is an alternative to ``--mask `` which will
184+ automatically treat runs of homozgosity longer than ``-c `` base pairs
185+ as missing. Typically ``-c `` should be set fairly high so as not
186+ to filter out legitimate long runs of homozyous bases, which are
187+ informative about recent demography. This is a fairly crude approach
188+ to filtering and is only recommended for use in cases where using
189+ ``--mask `` is not possible.
170190
171- By varying ``-d `` over the same VCF, the user can create distinct data
172- sets for estimation. This is useful for forming composite likelihoods.
173- For example, the following command will create three data sets from
174- contig ``chr1 `` of ``myvcf.gz ``, by varying the identity of the distinguished
175- individual and treating the remaining two samples as "undistinguished":
176-
177- .. code-block :: bash
178-
179- for i in {7..9};
180- do smc++ vcf2smc -d NA1287$i NA1287$i myvcf.gz out.$i .txt chr1 NA12877 NA12878 NA12890;
181- done
191+ Composite Likelihood
192+ ^^^^^^^^^^^^^^^^^^^^
193+ By varying ``-d `` over the same VCF, you can create distinct data
194+ sets for estimation. This is useful for forming composite likelihoods.
195+ For example, the following command will create three data sets from
196+ contig ``chr1 `` of ``myvcf.gz ``, by varying the identity of the distinguished
197+ individual and treating the remaining two samples as "undistinguished":
198+
199+ .. code-block :: bash
200+
201+ for i in {7..9};
202+ do smc++ vcf2smc -d NA1287$i NA1287$i myvcf.gz out.$i .txt chr1 NA12877 NA12878 NA12890;
203+ done
204+
205+ You can then pass these data sets into estimate _::
206+
207+ $ smc++ estimate -o output/ out.*.txt
208+
209+ SMC++ treats each file ``out.*.txt `` as an independently evolving
210+ sequence (i.e., a chromosome); the likelihood is simply the product
211+ of SMC++ likelihoods over each of the data sets. In the example above
212+ where the data sets are generated from the same chromosome but different
213+ distinguished individuals (different ``-d ``), this independence
214+ assumption is violated, leading to a so-called **composite likelihood **.
215+ The advantage of this approach is that it incorporates genealogical
216+ information from additional distinguished individuals into the analysis,
217+ potentially leading to improved estimates.
218+
219+ Since (a portion of) the computational and memory requirements of SMC++
220+ scale linearly with the total analyzed sequence length, it is generally
221+ advisable to composite over a relatively small number of individuals. In
222+ practice we generally use 2-10 individuals, depending on genome length,
223+ sample size, etc., and have found that this leads to improved estimation
224+ without causing significant degeneracy in the likelihood.
182225
183226Manual conversion
184227^^^^^^^^^^^^^^^^^
@@ -194,11 +237,6 @@ is::
194237
195238 $ smc++ estimate -o out data.smc.gz
196239
197- Required arguments
198- ^^^^^^^^^^^^^^^^^^
199-
200- None.
201-
202240Recommended arguments
203241^^^^^^^^^^^^^^^^^^^^^
204242
0 commit comments