You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: manuscript/manuscript.tex
+50-44Lines changed: 50 additions & 44 deletions
Original file line number
Diff line number
Diff line change
@@ -81,7 +81,7 @@
81
81
82
82
\section{Introduction}
83
83
84
-
PyEMMA~\cite{pyemma} (\url{http://emma-project.org}) is a software for the analysis of molecular dynamics (MD) simulations using Markov state models~\cite{schuette-msm,singhal-msm-naming} (MSMs).
84
+
PyEMMA~\cite{pyemma} (\url{http://emma-project.org}) is a software for the analysis of molecular dynamics (MD) simulations using Markov state models~\cite{schuette-msm,singhal-msm-naming,noe2007jcp,chodera2007jcp,buchete-msm-2008} (MSMs).
85
85
The package is written in Python (\url{http://python.org}), relies heavily on NumPy/SciPy~\cite{numpy,scipy}, and is compatible with the scikit-learn~\cite{sklearn} framework for machine learning.
86
86
87
87
\subsection{Scope}
@@ -202,17 +202,24 @@ \subsection{Variational approach and TICA}
202
202
A commonly used method for dimensionality reduction, TICA, is a particular implementation of the VAC.
203
203
To apply TICA, we need to compute instantaneous ($\mathbf{C}(0)$) and time-lagged ($\mathbf{C}(\tau)$) covariance matrices with elements
@@ -223,12 +230,12 @@ \subsection{Variational approach and TICA}
223
230
is the total kinetic variance explained by all $n$ features.
224
231
225
232
If we further scale the independent components $\mathbf{u}_i$ by the corresponding eigenvectors $\lambda_i(\tau)$,
226
-
we obtain a \emph{kinetic map} which is the default behavior in PyEMMA.
233
+
we obtain a \emph{kinetic map}~\cite{kinetic-maps} which is the default behavior in PyEMMA.
227
234
228
-
Note, though, that TICA requires the data to be in equilibrium.
229
-
To use TICA with nonequilibrium data, we can either symmetrize the time-lagged covariance matrix $\mathbf{C}(\tau)$
230
-
or apply a Koopman reweighting~\cite{vamp-preprint}.
231
-
For short trajectories and nonequilibrium data we generally recommend to use VAMP~\cite{vamp-preprint}.
235
+
Note, though, that TICA requires the dynamics to be simulated at equilibrium conditions.
236
+
To use TICA with nonequilibrium MD, e.g., subject to external forces,
237
+
or simply to perform dimension reduction on short trajectory data without worrying about reweighting,
238
+
we recommend to use VAMP~\cite{vamp-preprint}.
232
239
233
240
For all these approaches,
234
241
dimensionality reduction is performed by projecting the (mean free) features $\tilde{\mathbf{x}}(t)$
@@ -242,7 +249,7 @@ \subsection{Hidden Markov state models}
242
249
243
250
\begin{figure}
244
251
\includegraphics[width=0.48\textwidth]{figure_1}
245
-
\caption{The HMM transition matrix $\tilde{\mathbf{P}}(\tau)$ propagates the hidden state trajectory $\tilde{s}(t)$ (orange circles) and, at each time step $t$, the emission into the observable state $s(t)$ is governed by the emission probabilities $\bm{\chi}\left( s(t) \middle| \tilde{s}(t) \right)$.}
252
+
\caption{The HMM transition matrix $\tilde{\mathbf{P}}(\tau)$ propagates the hidden state trajectory $\tilde{s}(t)$ (orange circles) and, at each time step $t$, the emission into the observable state $s(t)$(cyan circles) is governed by the emission probabilities $\bm{\chi}\left( s(t) \middle| \tilde{s}(t) \right)$.}
246
253
\label{fig:hmm-scheme}
247
254
\end{figure}
248
255
@@ -252,24 +259,23 @@ \subsection{Hidden Markov state models}
252
259
We illustrate this point in Notebook~07.
253
260
254
261
An alternative, which is much less sensitive to poor discretization,
255
-
is to estimate a hidden Markov model (HMM)~\cite{hmm-baum-welch-alg,hmm-tutorial,jhp-spectral-rate-theory,bhmm-preprint}.
256
-
HMMs are less sensitive to the discretization error as they sidestep the assumption of Markovian dynamics in the discretized space.
262
+
is to estimate a hidden Markov model (HMM)~\cite{hmm-baum-welch-alg,hmm-tutorial,jhp-spectral-rate-theory,noe-proj-hid-msm,bhmm-preprint}.
263
+
HMMs are less sensitive to the discretization error as they sidestep the assumption of Markovian dynamics in the discretized space (illustrated in Fig.~\ref{fig:hmm-scheme}).
257
264
Instead, HMMs assume that there is an underlying (hidden) dynamic process which is Markovian
258
-
and gives rise to our observed data, e.g., the discretized trajectories (see Fig.~\ref{fig:hmm-scheme}).
265
+
and gives rise to our observed data, e.g., the ($n$~states) discretized trajectories $s(t)$.
259
266
This is a powerful principle as we know that there is indeed an underlying process which is Markovian:
260
267
our molecular dynamics trajectories.
261
268
262
-
To estimate an HMM, we need a spectral gap after the $m^\textrm{th}$eigenvalue;
269
+
To estimate an HMM, we need a spectral gap after the $m^\textrm{th}$timescale;
263
270
in practice, a timescale separation of $t_m \geq2t_{m+1}$ is sufficient~\cite{pyemma}.
264
-
Then, we can approximate the dynamics in the observed microstates ($\mathbf{P}$) at any lag time $k\tau$ via
Here, the $\bm{\Pi}=\left[ \pi_1,\dots,\pi_n \right]$ is a diagonal matrix of the $n$ microstates' stationary probabilities,
269
-
$\tilde{\bm{\Pi}}=\left[ \tilde{\pi}_1,\dots,\tilde{\pi}_m \right]$ is a diagonal matrix of the $m<n$ hidden states' stationary probabilities,
270
-
$\tilde{\mathbf{P}}(\tau)$ is a transition matrix between the $m<n$ hidden states at lag time $\tau$,
271
-
and $\bm{\chi}$ is an $m\times n$-dimensional row-stochastic matrix
272
-
where each row encodes the emission probabilities into the $n$ microstates conditioned on being in the corresponding hidden state~\cite{noe-proj-hid-msm}.
271
+
The HMM then consists of a transition matrix $\tilde{\mathbf{P}}(\tau)$ between $m<n$ hidden states
272
+
and a row-stochastic matrix ($\bm{\chi}$) of probabilities $\chi\left( s \middle| \tilde{s} \right)$
273
+
to emit the discrete state $s$ conditional on being in the hidden state $\tilde{s}$.
274
+
275
+
We can further compute a reversal of the emission matrix $\bm{\chi}\in\mathbb{R}^{m \times n}$:
276
+
the membership matrix $\mathbf{M}\in\mathbb{R}^{n \times m}$ which encodes
277
+
a fuzzy assignment of each of the $n$ observable microstates $s$ to the $m$ hidden states $\tilde{s}$ and,
278
+
thus, defines the \emph{coarse graining} of microstate.
273
279
274
280
An HMM estimation always yields a model with a small number of (hidden) states
275
281
where each state is considered to be metastable and,
@@ -311,15 +317,6 @@ \subsection{Software and installation}
311
317
312
318
\section{PyEMMA tutorials}
313
319
314
-
\begin{figure}[bt]
315
-
\includegraphics[width=0.48\textwidth]{figure_2}
316
-
\caption{The PyEMMA workflow: MD trajectories are processed and discretized (first row).
317
-
A Markov state model is estimated from the resulting discrete trajectories and validated (middle row).
318
-
By iterating between data processing and MSM estimation/validation,
319
-
a dynamical model is obtained that can be analyzed (last row).}
320
-
\label{fig:workflowchart}
321
-
\end{figure}
322
-
323
320
This tutorial consists of nine Jupyter notebooks which introduce the basic features of PyEMMA.
324
321
The first notebook (00), which we will summarize in the following, showcases the entire estimation,
325
322
validation, and analysis workflow for a small example system.
provide a systematic means to quantitatively compare multiple representations of the simulation data.
367
373
In particular, we can use a scalar score obtained using VAMP to directly compare the ability of certain features to capture slow dynamical modes in a particular molecular system.
368
374
369
-
\begin{figure}
370
-
\includegraphics{figure_3}
371
-
\caption{Example analysis of the conformational dynamics of a pentapeptide backbone:
372
-
(a)~The Trp-Leu-Ala-Leu-Leu pentapeptide in licorice representation~\cite{vmd}.
373
-
(b)~The VAMP-2 score indicates which of the tested featurizations contains the highest kinetic variance.
374
-
(c)~The sample free energy projected onto the first two time-lagged independent components (ICs) at lag time $\tau=0.5$~ns shows multiple minima and
375
-
(d)~the time series of the first two ICs of the first trajectory show rare jumps.}
376
-
\label{fig:io-to-tica}
377
-
\end{figure}
378
-
379
375
Here, we utilize the VAMP-2 score, which maximizes the kinetic variance contained in the features~\cite{kinetic-maps}.
380
376
We should always evaluate the score in a cross-validated manner to ensure that we neither include too few features (under-fitting) or too many features (over-fitting)~\cite{gmrq,vamp-preprint}.
381
377
To choose among three different molecular features reflecting protein structure,
Discrete jumps between the minima can be observed by visualizing the transformation of the first trajectory into these ICs (Fig.~\ref{fig:io-to-tica}d).
403
399
We thus assume that our TICA-transformed backbone torsion features describe one or more metastable processes.
404
400
401
+
\begin{figure}
402
+
\includegraphics{figure_3}
403
+
\caption{Example analysis of the conformational dynamics of a pentapeptide backbone:
404
+
(a)~The Trp-Leu-Ala-Leu-Leu pentapeptide in licorice representation~\cite{vmd}.
405
+
(b)~The VAMP-2 score indicates which of the tested featurizations contains the highest kinetic variance.
406
+
(c)~The sample free energy projected onto the first two time-lagged independent components (ICs) at lag time $\tau=0.5$~ns shows multiple minima and
407
+
(d)~the time series of the first two ICs of the first trajectory show rare jumps.}
408
+
\label{fig:io-to-tica}
409
+
\end{figure}
410
+
405
411
\subsection{Discretization}
406
412
407
413
TICA yields a representation of our molecular simulation data with a reduced dimensionality,
0 commit comments