Merge pull request #150 from markovmodel/revision-beh

cwehmeyer · web-flow · commit 11d616b450cc · 2018-08-31T10:33:34.000+02:00
First draft of a theory section
diff --git a/manuscript/literature.bib b/manuscript/literature.bib
@@ -1,3 +1,14 @@
+@article{husic2017note,
+  title={Note: {MSM} lag time cannot be used for variational model selection},
+  author={Husic, Brooke E and Pande, Vijay S},
+  journal={J. Chem. Phys.},
+  volume={147},
+  number={17},
+  pages={176101},
+  year={2017},
+  publisher={AIP Publishing}
+}
+
 @book{msm-book,
     Title = {An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation},
     Publisher = {Springer Netherlands},
diff --git a/manuscript/manuscript.tex b/manuscript/manuscript.tex
@@ -79,16 +79,102 @@ \subsection{Scope}
 
 \section{Prerequisites}
 
-In the following, we summarize the recommended theoretical background knowledge of Markov state modeling for this tutorial.
+In the following, we summarize the recommended theory and background knowledge of Markov state modeling for this tutorial.
 Then, we address the software required to work through the lessons.
 
-\subsection{Background knowledge}
+\subsection{Essential theory}
+\label{sec:theory}
+
+Markov state modeling is a mathematical framework for the analysis of time-series data, often but not limited to high-dimensional MD simulation datasets.
+In its standard formulation, the creation of a Markov state model involves decomposing the phase or configuration space occupied by a system into a set of disjoint, discrete states that adhere to the Markov property.
+The Markov property asserts that the dynamics in the state space is ``memoryless'':~in other words, the probability of transitioning from any state $i$ to any other state $j$ after a time $\tau$ is independent of the history of the system before it was in $i$.
+%This lag time must be sufficiently short to resolve the dynamics of interest, but long enough such that the Markovian approximation is appropriate.
+
+In order to create a Markov state model for a dynamical system, each data point in the time series is assigned to a state.
+Given an appropriate lag time, every pairwise transition at that lag time is counted and stored in a count matrix.
+Then, the count matrix is converted to a row-stochastic transition probability matrix, which is defined for the specified lag time.
+To ensure that the transition probability matrix has desirable mathematical properties, detailed balance is enforced when converting the count matrix to the transition probability matrix, which requires that,
+
+\begin{equation}
+\label{eq:balance}
+p_t(i) p(i \rightarrow j) = p_t(j)p(j \rightarrow i),
+\end{equation}
+
+\noindent{}where $p_t(i)$ represents the probability of being in state $i$ at time $t$, and $p(i \rightarrow j)$ is the probability of transitioning from state $i$ to $j$ at the next time step.
+% Detailed balance requires that the probability of transitioning from state $i$ to state $j$, conditioned upon the system being in state $i$, is the same as the probability of transitioning from state $j$ to state $i$ conditioned upon the system being in state $j$.
+The requirement of detailed balance indicates that we assume our system is reversible.
+We additionally require that the system is ergodic, which means that every state is accessible from every other state.
+
+When estimating an MSM it is critical to choose a lag time, $\tau$, which is long enough to ensure Markovian dynamics in our state space, but short enough to resolve the dynamics in which we are interested.
+Plotting the implied timescales (ITS) as a function of $\tau$ can be a helpful diagnostic when selecting the MSM lag time~\cite{swope-its}.
+The ITS $t_i$ approximates the decorrelation time of the $i^\textrm{th}$ process and is computed from the eigenvalues $\lambda_i$ of the MSM transition matrix via,
+\begin{equation}
+\label{eq:its}
+t_i = \frac{-\tau}{\ln\left|\lambda_i(\tau)\right|}.
+\end{equation}
+
+\noindent{}When the ITS become approximately constant with the lag time, we say that our timescales have converged and choose the smallest lag time with the converged timescales in order to maximize the model's temporal resolution.
+
+Once we have used the ITS to choose the lag time, we can check whether a given transition probability matrix $T(\tau)$ is approximately Markovian using the Chapman-Kolmogorov (CK) test~\cite{msm-jhp}.
+The CK property for a Markovian matrix is,
+
+\begin{equation}
+T(k \tau) = T^k(\tau),
+\end{equation}
+
+\noindent{}where the left-hand side of the equation corresponds to an MSM estimated at lag time $k\tau$, where $k$ is an integer larger than 1, whereas the right-hand side of the equation is our estimated MSM transition probability matrix to the $k^\textrm{th}$ power.
+By assessing how well the approximated transition probability matrix adheres to the CK property, we can validate the appropriateness of the Markovian assumption for the model.
+
+Once validated, the transition matrix can be decomposed into eigenvectors and eigenvalues,
+
+\begin{equation}
+\label{eq:transmat}
+T(\tau) \circ \phi_i = \lambda_i \phi_i,
+\end{equation}
+
+\noindent{}where the eigenvalues are indexed in decreasing order. The highest eigenvalue, $\lambda_1$, is unique and is equal to $1$, and its corresponding left eigenvector $\phi_1$ corresponds to the stationary distribution of the system.
+The right eigenvector $\psi_1$ is a vector consisting of $1$'s.
+The subsequent eigenvalues $\lambda_{i>1}$ are real with absolute values less than $1$ and correspond to dynamical processes within the system.
+The right eigenvectors $\psi_i$ each represent a dynamical process (for $i>1$), and the coefficients of the eigenvectors represent the flux into and out of the MSM states that characterizes that process.
+The corresponding left eigenvectors $\phi_i$ contain the same information weighted by the stationary distribution.
+
+The timescale of a given dynamical process is a function of the relevant eigenvalue and the lag time at which the model was defined,
+
+\begin{equation}
+\label{eq:timescales}
+t_i \equiv -\frac{\tau}{\log(|\lambda_i|)}.
+\end{equation}
+
+\subsection{MSM construction the variational approach}
+\label{sec:construction}
+
+The theory described in the previous section required the decomposition of the phase or configuration space occupied by a dynamical system into discrete, disjoint states.
+Starting from the output of an MD simulation of a protein, there are several steps that can be taken to obtain an MSM from the original configuration space:
+
+\begin{itemize}
+	\item Featurization -- The Cartesian coordinates characterizing each frame of the MD trajectory are transformed into an intuitive basis such as the protein's dihedral angles or contact distance pairs.
+	\item Dimensionality reduction -- Optionally, a basis set transformation can be performed that produces a linear (or nonlinear) combination of the features in the previous step.
+	Frequently, time-lagged independent component analysis (TICA)~\cite{tica,tica3,tica2,kinetic-maps} is used to transform the features into a set of slow coordinates.
+	\item Clustering -- This is the step at which the state decomposition occurs.
+	The features or TICs are grouped into a set of states using a clustering algorithm such as $k$-means.
+	\item Transition matrix approximation -- At this stage, transitions are counted at a pre-specified lag time, and the estimation and validation described in the previous section are performed.
+\end{itemize}
+
+It is apparent that there are many choices involved in MSM construction such as what features should be used and how many states should be chosen.
+In 2013, the variational approach to conformational dynamics (VAC) was derived, which enabled an objective comparison among different state decomposition choices for models built with the same Markovian lag time~\cite{noe-vac}.
+More recently, the more general variational approach to Markov processes (VAMP) has been developed in order to facilitate the approximation and comparison of reversible models for basis sets that are continuous, as opposed to discrete states~\cite{vamp-preprint}.
+The VAMP can thus be used to perform model selection.
+Specifically, we use the VAMP-2 score, which captures the kinetic variance explained by the model.
+However, the MSM lag time cannot be optimized using VAMP, and must be chosen using a separate validation as described above~\cite{husic2017note}.
+
+\subsection{Background knowledge and resources}
 \label{sec:background}
 
-For those unfamiliar with Markov state modeling, ``\emph{Markov State Models: From an Art to a Science}''~\cite{msm-brooke} provides a recent overview, while ``\emph{Markov models of molecular kinetics: Generation and validation}''~\cite{msm-jhp} describes the basic MSM theory and methodology in detail. Additionally, two textbooks exist that focus on computational methods and applications~\cite{msm-book} and mathematical theory~\cite{schuette-sarich-book}.
+For those seeking further resources, ``\emph{Markov State Models: From an Art to a Science}''~\cite{msm-brooke} provides a recent overview, while ``\emph{Markov models of molecular kinetics: Generation and validation}''~\cite{msm-jhp} describes the basic MSM theory and methodology in detail.
+Additionally, two textbooks exist that focus on computational methods and applications~\cite{msm-book} and mathematical theory~\cite{schuette-sarich-book}.
 
 In addition to publications on theory and application of Markov state modeling~\cite{schuette-msm,buchete-msm-2008,noe-tmat-sampling,bowman-msm-2009,noe-folding-pathways,sarich-msm-quality,noe-fingerprints,noe-dy-neut-scatt,Chodera2014,ben-rev-msm,simon-mech-mod-nmr,oom-feliks,simon-amm},
-we also recommend the literature on time-lagged independent component analysis (TICA)~\cite{tica,tica3,tica2,kinetic-maps}, transition path theory (TPT)~\cite{weinan-tpt,metzner-msm-tpt},
+we also recommend the literature on TICA~\cite{tica,tica3,tica2,kinetic-maps}, transition path theory (TPT)~\cite{weinan-tpt,metzner-msm-tpt},
 hidden Markov state models (HMMs)~\cite{noe-proj-hid-msm,hmm-baum-welch-alg,hmm-tutorial}, and variational techniques~\cite{noe-vac,vamp-preprint,gmrq}, as these topics play important roles within the standard MSM workflow.
 
 \subsection{Software/system requirements}
@@ -189,21 +275,22 @@ \subsection{Discretization}
 TICA yields a representation of our molecular simulation data with a reduced dimensionality, which can greatly facilitate the decomposition of our system into the discrete Markovian states necessary for MSM estimation. Here, we use the $k$-means algorithm to segment the four dimensional TICA space into $k=75$ cluster centers. The number of cluster centers has been chosen to optimize the VAMP-2 score in a manner identical to how the feature selection was carried out above, which is shown in the showcase notebook (00).
 
 \subsection{MSM estimation and validation}
-When estimating an MSM it is critical to choose a lag time, $\tau$, which is long enough to ensure Markovian dynamics in our reduced space, but short enough to resolve the dynamics in which we are interested.
-Plotting the implied timescales (ITS) as a function of $\tau$ can be a helpful diagnostic when selecting the MSM lag time~\cite{swope-its}. The ITS $t_i$ approximates the decorrelation time of the $i^\textrm{th}$ process and is computed from the eigenvalues $\lambda_i$ of the MSM transition matrix via
-\begin{equation}
-\label{eq:its}
-t_i = \frac{-\tau}{\ln\left|\lambda_i(\tau)\right|}.
-\end{equation}
+% When estimating an MSM it is critical to choose a lag time, $\tau$, which is long enough to ensure Markovian dynamics in our reduced space, but short enough to resolve the dynamics in which we are interested.
+% Plotting the implied timescales (ITS) as a function of $\tau$ can be a helpful diagnostic when selecting the MSM lag time~\cite{swope-its}. The ITS $t_i$ approximates the decorrelation time of the $i^\textrm{th}$ process and is computed from the eigenvalues $\lambda_i$ of the MSM transition matrix via
+% \begin{equation}
+% \label{eq:its}
+% t_i = \frac{-\tau}{\ln\left|\lambda_i(\tau)\right|}.
+% \end{equation}
 A necessary condition for Markovian dynamics in our reduced space is that the ITS are approximately constant as a function of $\tau$; accordingly, we chose the smallest possible $\tau$ which fulfills this condition within the model uncertainty. The uncertainty bounds are computed using a Bayesian scheme~\cite{ben-rev-msm,noe-tmat-sampling} with $100$ samples.
 In our example, we find that the four slowest ITS converge quickly and are constant within a $95\%$ confidence interval for lag times above $0.5$~ns (Fig.~\ref{fig:io-to-ck}e). Using this lag time we can now estimate a (Bayesian) MSM with $\tau=0.5$~ns. 
 
-To test the validity of our MSM we perform a Chapman-Kolmogorov (CK) test. The CK test compares the right and the left side of the Chapman-Kolmogorov equation
-\begin{equation}
-\label{eq:ck}
-T(k \tau) = T^k(\tau)
-\end{equation}
-where $T$ is the MSM transition matrix. The left-hand side of the equation corresponds to an MSM estimated at lag time $k\tau$, where $k$ is an integer larger than 1, whereas the right-hand side of the equation is our estimated MSM to the $k^\textrm{th}$ power.
+To test the validity of our MSM we perform a Chapman-Kolmogorov (CK) test.
+% The CK test compares the right and the left side of the Chapman-Kolmogorov equation
+% \begin{equation}
+% \label{eq:ck}
+% T(k \tau) = T^k(\tau)
+% \end{equation}
+% where $T$ is the MSM transition matrix. The left-hand side of the equation corresponds to an MSM estimated at lag time $k\tau$, where $k$ is an integer larger than 1, whereas the right-hand side of the equation is our estimated MSM to the $k^\textrm{th}$ power.
 Visualizing the full transition probability matrix $T$ is difficult; we therefore coarse-grain $T$ into a smaller number of metastable states before performing the test.
 An appropriate number of metastable states can be chosen by identifying a relatively large gap in the ITS plot.
 For this analysis, we chose 5 metastable states.