Skip to content

Commit 769c9b0

Browse files
committed
addressing @franknoe's comments [ci skip]
1 parent 7f662de commit 769c9b0

File tree

2 files changed

+62
-44
lines changed

2 files changed

+62
-44
lines changed

manuscript/literature.bib

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -577,6 +577,18 @@ @article{oom-feliks
577577
URL = {https://doi.org/10.1063/1.4976518},
578578
DOI = {10.1063/1.4976518}
579579
}
580+
@article{hao-variational-koopman-models,
581+
Author = {Hao Wu and Feliks N\"{u}ske and Fabian Paul and Stefan Klus and P{\'{e}}ter Koltai and Frank No{\'{e}}},
582+
Title = {Variational Koopman models: Slow collective variables and molecular kinetics from short off-equilibrium simulations},
583+
Journal = {J. Chem. Phys.},
584+
Year = {2017},
585+
Volume = {146},
586+
Number = {15},
587+
Pages = {154104},
588+
Month = {apr},
589+
URL = {https://doi.org/10.1063/1.4979344},
590+
DOI = {10.1063/1.4979344}
591+
}
580592
@article{NoeClementiReview,
581593
Author = {Frank No{\'{e}} and Cecilia Clementi},
582594
Title = {Collective variables for the study of long-time kinetics from molecular trajectories: theory and methods},

manuscript/manuscript.tex

Lines changed: 50 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@
8181

8282
\section{Introduction}
8383

84-
PyEMMA~\cite{pyemma} (\url{http://emma-project.org}) is a software for the analysis of molecular dynamics (MD) simulations using Markov state models~\cite{schuette-msm,singhal-msm-naming} (MSMs).
84+
PyEMMA~\cite{pyemma} (\url{http://emma-project.org}) is a software for the analysis of molecular dynamics (MD) simulations using Markov state models~\cite{schuette-msm,singhal-msm-naming,noe2007jcp,chodera2007jcp,buchete-msm-2008} (MSMs).
8585
The package is written in Python (\url{http://python.org}), relies heavily on NumPy/SciPy~\cite{numpy,scipy}, and is compatible with the scikit-learn~\cite{sklearn} framework for machine learning.
8686

8787
\subsection{Scope}
@@ -202,17 +202,24 @@ \subsection{Variational approach and TICA}
202202
A commonly used method for dimensionality reduction, TICA, is a particular implementation of the VAC.
203203
To apply TICA, we need to compute instantaneous ($\mathbf{C}(0)$) and time-lagged ($\mathbf{C}(\tau)$) covariance matrices with elements
204204
\begin{eqnarray}
205-
c_{ij}(0) & = & \left\langle \tilde{x}_i(t) \; \tilde{x}_j(t) \right\rangle_t \\
206-
c_{ij}(\tau) & = & \left\langle \tilde{x}_i(t) \; \tilde{x}_j(t + \tau) \right\rangle_t,
205+
c_{ij}(0) & = & \left\langle \tilde{x}_i(t) \; \tilde{x}_j(t) \right\rangle_t \label{eq:c0}\\
206+
c_{ij}(\tau) & = & \left\langle \tilde{x}_i(t) \; \tilde{x}_j(t + \tau) \right\rangle_t, \label{eq:ct}
207207
\end{eqnarray}
208208
where $\tilde{x}_i(t)$ denotes the $i^\textrm{th}$ feature at time $t$ after the mean has been removed.
209-
Then, we can solve the generalized eigenvalue problem
209+
By default, PyEMMA estimates (\ref{eq:c0},\ref{eq:ct}) using symmetrization~\cite{tica}.
210+
This symmetrization induces a significant bias when using non-equilibrium data from short trajectories~\cite{hao-variational-koopman-models}.
211+
As an alternative, the so-called Koopman reweighting estimator is available which avoids this bias,
212+
but comes at the cost of a large variance~\cite{hao-variational-koopman-models}.
213+
214+
After estimating the covariance matrices, TICA solves the generalized eigenvalue problem
210215
\begin{equation}
211216
\mathbf{C}(\tau) \, \mathbf{u}_i = \mathbf{C}(0) \, \lambda_i(\tau) \, \mathbf{u}_i \,, \quad i=1,\dots,n,
212217
\end{equation}
213218
to obtain independent component directions $\mathbf{u}_i$ which approximate the reaction coordinates of the system,
214-
where the pairs of eigenvalues and independent components are sorted in descending order,
215-
and we define a cumulative kinetic variance fraction
219+
where the pairs of eigenvalues and independent components are sorted in descending order.
220+
A way to measure the contribution of each independent component to the kinetics
221+
is obtained by the kinetic distance~\cite{kinetic-maps}
222+
which assigns a cumulative variance fraction to the first $d$ independent components:
216223
\begin{equation}
217224
c_d = \frac{\sum_{i=2}^d \lambda_i^2(\tau)}{\textrm{TKV}},
218225
\end{equation}
@@ -223,12 +230,12 @@ \subsection{Variational approach and TICA}
223230
is the total kinetic variance explained by all $n$ features.
224231

225232
If we further scale the independent components $\mathbf{u}_i$ by the corresponding eigenvectors $\lambda_i(\tau)$,
226-
we obtain a \emph{kinetic map} which is the default behavior in PyEMMA.
233+
we obtain a \emph{kinetic map}~\cite{kinetic-maps} which is the default behavior in PyEMMA.
227234

228-
Note, though, that TICA requires the data to be in equilibrium.
229-
To use TICA with nonequilibrium data, we can either symmetrize the time-lagged covariance matrix $\mathbf{C}(\tau)$
230-
or apply a Koopman reweighting~\cite{vamp-preprint}.
231-
For short trajectories and nonequilibrium data we generally recommend to use VAMP~\cite{vamp-preprint}.
235+
Note, though, that TICA requires the dynamics to be simulated at equilibrium conditions.
236+
To use TICA with nonequilibrium MD, e.g., subject to external forces,
237+
or simply to perform dimension reduction on short trajectory data without worrying about reweighting,
238+
we recommend to use VAMP~\cite{vamp-preprint}.
232239

233240
For all these approaches,
234241
dimensionality reduction is performed by projecting the (mean free) features $\tilde{\mathbf{x}}(t)$
@@ -242,7 +249,7 @@ \subsection{Hidden Markov state models}
242249

243250
\begin{figure}
244251
\includegraphics[width=0.48\textwidth]{figure_1}
245-
\caption{The HMM transition matrix $\tilde{\mathbf{P}}(\tau)$ propagates the hidden state trajectory $\tilde{s}(t)$ (orange circles) and, at each time step $t$, the emission into the observable state $s(t)$ is governed by the emission probabilities $\bm{\chi}\left( s(t) \middle| \tilde{s}(t) \right)$.}
252+
\caption{The HMM transition matrix $\tilde{\mathbf{P}}(\tau)$ propagates the hidden state trajectory $\tilde{s}(t)$ (orange circles) and, at each time step $t$, the emission into the observable state $s(t)$ (cyan circles) is governed by the emission probabilities $\bm{\chi}\left( s(t) \middle| \tilde{s}(t) \right)$.}
246253
\label{fig:hmm-scheme}
247254
\end{figure}
248255

@@ -252,24 +259,23 @@ \subsection{Hidden Markov state models}
252259
We illustrate this point in Notebook~07.
253260

254261
An alternative, which is much less sensitive to poor discretization,
255-
is to estimate a hidden Markov model (HMM)~\cite{hmm-baum-welch-alg,hmm-tutorial,jhp-spectral-rate-theory,bhmm-preprint}.
256-
HMMs are less sensitive to the discretization error as they sidestep the assumption of Markovian dynamics in the discretized space.
262+
is to estimate a hidden Markov model (HMM)~\cite{hmm-baum-welch-alg,hmm-tutorial,jhp-spectral-rate-theory,noe-proj-hid-msm,bhmm-preprint}.
263+
HMMs are less sensitive to the discretization error as they sidestep the assumption of Markovian dynamics in the discretized space (illustrated in Fig.~\ref{fig:hmm-scheme}).
257264
Instead, HMMs assume that there is an underlying (hidden) dynamic process which is Markovian
258-
and gives rise to our observed data, e.g., the discretized trajectories (see Fig.~\ref{fig:hmm-scheme}).
265+
and gives rise to our observed data, e.g., the ($n$~states) discretized trajectories $s(t)$.
259266
This is a powerful principle as we know that there is indeed an underlying process which is Markovian:
260267
our molecular dynamics trajectories.
261268

262-
To estimate an HMM, we need a spectral gap after the $m^\textrm{th}$ eigenvalue;
269+
To estimate an HMM, we need a spectral gap after the $m^\textrm{th}$ timescale;
263270
in practice, a timescale separation of $t_m \geq 2t_{m+1}$ is sufficient~\cite{pyemma}.
264-
Then, we can approximate the dynamics in the observed microstates ($\mathbf{P}$) at any lag time $k\tau$ via
265-
\begin{equation}
266-
\mathbf{P}(k\tau) \approx \bm{\Pi}^{-1} \bm{\chi}^\top \tilde{\bm{\Pi}} \tilde{\mathbf{P}}^k(\tau) \, \bm{\chi}.
267-
\end{equation}
268-
Here, the $\bm{\Pi}=\left[ \pi_1,\dots,\pi_n \right]$ is a diagonal matrix of the $n$ microstates' stationary probabilities,
269-
$\tilde{\bm{\Pi}}=\left[ \tilde{\pi}_1,\dots,\tilde{\pi}_m \right]$ is a diagonal matrix of the $m<n$ hidden states' stationary probabilities,
270-
$\tilde{\mathbf{P}}(\tau)$ is a transition matrix between the $m<n$ hidden states at lag time $\tau$,
271-
and $\bm{\chi}$ is an $m\times n$-dimensional row-stochastic matrix
272-
where each row encodes the emission probabilities into the $n$ microstates conditioned on being in the corresponding hidden state~\cite{noe-proj-hid-msm}.
271+
The HMM then consists of a transition matrix $\tilde{\mathbf{P}}(\tau)$ between $m<n$ hidden states
272+
and a row-stochastic matrix ($\bm{\chi}$) of probabilities $\chi\left( s \middle| \tilde{s} \right)$
273+
to emit the discrete state $s$ conditional on being in the hidden state $\tilde{s}$.
274+
275+
We can further compute a reversal of the emission matrix $\bm{\chi}\in\mathbb{R}^{m \times n}$:
276+
the membership matrix $\mathbf{M}\in\mathbb{R}^{n \times m}$ which encodes
277+
a fuzzy assignment of each of the $n$ observable microstates $s$ to the $m$ hidden states $\tilde{s}$ and,
278+
thus, defines the \emph{coarse graining} of microstate.
273279

274280
An HMM estimation always yields a model with a small number of (hidden) states
275281
where each state is considered to be metastable and,
@@ -311,15 +317,6 @@ \subsection{Software and installation}
311317

312318
\section{PyEMMA tutorials}
313319

314-
\begin{figure}[bt]
315-
\includegraphics[width=0.48\textwidth]{figure_2}
316-
\caption{The PyEMMA workflow: MD trajectories are processed and discretized (first row).
317-
A Markov state model is estimated from the resulting discrete trajectories and validated (middle row).
318-
By iterating between data processing and MSM estimation/validation,
319-
a dynamical model is obtained that can be analyzed (last row).}
320-
\label{fig:workflowchart}
321-
\end{figure}
322-
323320
This tutorial consists of nine Jupyter notebooks which introduce the basic features of PyEMMA.
324321
The first notebook (00), which we will summarize in the following, showcases the entire estimation,
325322
validation, and analysis workflow for a small example system.
@@ -355,6 +352,15 @@ \subsection{The PyEMMA workflow}
355352
we chose to adopt a sequential approach where only the hyper-parameters of the current stage are optimized.
356353
This approach is not only computationally cheaper but allows us to discuss the significance of the necessary modeling choices.
357354

355+
\begin{figure}[bt]
356+
\includegraphics[width=0.48\textwidth]{figure_2}
357+
\caption{The PyEMMA workflow: MD trajectories are processed and discretized (first row).
358+
A Markov state model is estimated from the resulting discrete trajectories and validated (middle row).
359+
By iterating between data processing and MSM estimation/validation,
360+
a dynamical model is obtained that can be analyzed (last row).}
361+
\label{fig:workflowchart}
362+
\end{figure}
363+
358364
\subsection{Feature selection}
359365

360366
In Markov state modeling, our objective is to model the slow dynamics of a molecular process.
@@ -366,16 +372,6 @@ \subsection{Feature selection}
366372
provide a systematic means to quantitatively compare multiple representations of the simulation data.
367373
In particular, we can use a scalar score obtained using VAMP to directly compare the ability of certain features to capture slow dynamical modes in a particular molecular system.
368374

369-
\begin{figure}
370-
\includegraphics{figure_3}
371-
\caption{Example analysis of the conformational dynamics of a pentapeptide backbone:
372-
(a)~The Trp-Leu-Ala-Leu-Leu pentapeptide in licorice representation~\cite{vmd}.
373-
(b)~The VAMP-2 score indicates which of the tested featurizations contains the highest kinetic variance.
374-
(c)~The sample free energy projected onto the first two time-lagged independent components (ICs) at lag time $\tau=0.5$~ns shows multiple minima and
375-
(d)~the time series of the first two ICs of the first trajectory show rare jumps.}
376-
\label{fig:io-to-tica}
377-
\end{figure}
378-
379375
Here, we utilize the VAMP-2 score, which maximizes the kinetic variance contained in the features~\cite{kinetic-maps}.
380376
We should always evaluate the score in a cross-validated manner to ensure that we neither include too few features (under-fitting) or too many features (over-fitting)~\cite{gmrq,vamp-preprint}.
381377
To choose among three different molecular features reflecting protein structure,
@@ -402,6 +398,16 @@ \subsection{Dimensionality reduction}
402398
Discrete jumps between the minima can be observed by visualizing the transformation of the first trajectory into these ICs (Fig.~\ref{fig:io-to-tica}d).
403399
We thus assume that our TICA-transformed backbone torsion features describe one or more metastable processes.
404400

401+
\begin{figure}
402+
\includegraphics{figure_3}
403+
\caption{Example analysis of the conformational dynamics of a pentapeptide backbone:
404+
(a)~The Trp-Leu-Ala-Leu-Leu pentapeptide in licorice representation~\cite{vmd}.
405+
(b)~The VAMP-2 score indicates which of the tested featurizations contains the highest kinetic variance.
406+
(c)~The sample free energy projected onto the first two time-lagged independent components (ICs) at lag time $\tau=0.5$~ns shows multiple minima and
407+
(d)~the time series of the first two ICs of the first trajectory show rare jumps.}
408+
\label{fig:io-to-tica}
409+
\end{figure}
410+
405411
\subsection{Discretization}
406412

407413
TICA yields a representation of our molecular simulation data with a reduced dimensionality,

0 commit comments

Comments
 (0)