You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
and the more general variational approach for Markov processes (VAMP)~\cite{vamp-preprint}
372
372
provide a systematic means to quantitatively compare multiple representations of the simulation data.
373
373
In particular, we can use a scalar score obtained using VAMP to directly compare the ability of certain features to capture slow dynamical modes in a particular molecular system.
374
+
In Notebook (01), we present in detail how to extract features from MD datasets and how to systematically compare them.
374
375
375
-
Here, we utilize the VAMP-2 score, which maximizes the kinetic variance contained in the features~\cite{kinetic-maps}.
376
+
Throughout this tutorial, we utilize the VAMP-2 score, which maximizes the kinetic variance contained in the features~\cite{kinetic-maps}.
376
377
We should always evaluate the score in a cross-validated manner to ensure that we neither include too few features (under-fitting) or too many features (over-fitting)~\cite{gmrq,vamp-preprint}.
377
378
To choose among three different molecular features reflecting protein structure,
378
379
we compute the (cross-validated) VAMP-2 score (Notebook 00).
Discrete jumps between the minima can be observed by visualizing the transformation of the first trajectory into these ICs (Fig.~\ref{fig:io-to-tica}d).
399
400
We thus assume that our TICA-transformed backbone torsion features describe one or more metastable processes.
400
401
402
+
We demonstrate how to apply TICA, suggest how to interpret the projected coordinates, and compare the results to other dimension reduction techniques in Notebook (02).
403
+
401
404
\begin{figure}
402
405
\includegraphics{figure_3}
403
406
\caption{Example analysis of the conformational dynamics of a pentapeptide backbone:
@@ -414,7 +417,8 @@ \subsection{Discretization}
414
417
which can greatly facilitate the decomposition of our system into the discrete Markovian states necessary for MSM estimation.
415
418
Here, we use the $k$-means algorithm to segment the four dimensional TICA space into $k=75$ cluster centers.
416
419
The number of cluster centers has been chosen to optimize the VAMP-2 score in a manner identical to how the feature selection was carried out above,
417
-
which is shown in the showcase notebook (00).
420
+
which is shown in the showcase Notebook (00).
421
+
A detailed comparison between different clustering techniques is provided in Notebook (02).
418
422
419
423
\subsection{MSM estimation and validation}
420
424
@@ -447,6 +451,8 @@ \subsection{MSM estimation and validation}
447
451
and shows that the MSM we have estimated at lag time $\tau=0.5$~ns indeed predicts the
448
452
long-timescale behavior of our system within error (blue/shaded area).
449
453
454
+
In Notebook (03), we demonstrate in detail how to estimate and validate MSMs with PyEMMA.
455
+
450
456
\subsection{Analyzing the MSM}
451
457
452
458
\begin{figure}
@@ -526,6 +532,8 @@ \subsection{Analyzing the MSM}
526
532
The transition network can be additionally visualized by plotting representative structures of the five metastable states $\mathcal{S}_{(1-5)}$ according to their committor probability (Fig.~\ref{fig:tpt-network}).
527
533
It is easy to see from this depiction that the dominant pathway from $\mathcal{S}_2$ to $\mathcal{S}_4$ proceeds through $\mathcal{S}_5$.
528
534
535
+
More details about (spectral) properties of MSMs and how to analyze them with PyEMMA are discussed in Notebook (04) and Notebook (05).
536
+
529
537
\subsection{Connecting the MSM with experimental data}
530
538
531
539
\begin{figure}
@@ -560,6 +568,8 @@ \subsection{Connecting the MSM with experimental data}
560
568
We see that the predicted relaxation signal has a much larger amplitude for the nonequilibrium initialization,
561
569
making it more likely to be experimentally measurable.
562
570
571
+
In addition to a detailed demonstration of the above, Notebook (06) demonstrates how to compute J-couplings and dynamic fingerprints from MSMs.
572
+
563
573
\subsection{Summary}
564
574
565
575
In this section, we have summarized how to conduct an MSM-based analysis of biomolecular dynamics data using PyEMMA.
@@ -587,6 +597,7 @@ \subsection{Modeling large systems}
587
597
we explain how to deal with those in the tutorials (Notebook 00 and 02).
588
598
589
599
More details on how to model complex systems with the techniques presented here are described, e.g., by~\cite{plattner_protein_2015,plattner_complete_2017}.
600
+
We further examine some symptoms that may indicate problematic or difficult datasets, and demonstrate how to deal with them in Notebook (08).
Copy file name to clipboardExpand all lines: notebooks/01-data-io-and-featurization.ipynb
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -207,7 +207,7 @@
207
207
"metadata": {},
208
208
"source": [
209
209
"In PyEMMA, the `featurizer` is a central object that incorporates the system's topology.\n",
210
-
"We start by creating it object using the topology file.\n",
210
+
"We start by creating it using the topology file.\n",
211
211
"Features are now easily computed by adding the target feature.\n",
212
212
"If no feature is added, the featurizer will extract Cartesian coordinates."
213
213
]
@@ -395,7 +395,7 @@
395
395
"cell_type": "markdown",
396
396
"metadata": {},
397
397
"source": [
398
-
"We note that the distribution in backbone torsion space contains several basins that will be assigned to metastable states in follow-up notebooks.\n",
398
+
"We note that the distribution in backbone torsion space contains several basins that will be assigned to metastable states in follow-up notebooks ([Notebook 05 ➜ 📓](05-pcca-tpt.ipynb), [Notebook 07 ➜ 📓](07-hidden-markov-state-models.ipynb)).\n",
399
399
"Again, the free energy plot only depicts a pseudo free energy surface of the sampled data and was not re-weighted to equilibrium.\n",
400
400
"\n",
401
401
"Let us look at a different featurization example and load the positions of all heavy atoms instead.\n",
@@ -519,7 +519,7 @@
519
519
"Using `load()`, we put the full data into memory.\n",
520
520
"This is possible for all examples in this tutorial.\n",
521
521
"\n",
522
-
"Many real world apllications, though, require more memory than your workstation might provide.\n",
522
+
"Many real world applications, though, require more memory than your workstation might provide.\n",
523
523
"For these cases, you should use the `source()` function:"
Copy file name to clipboardExpand all lines: notebooks/02-dimension-reduction-and-discretization.ipynb
+5-4Lines changed: 5 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -8,8 +8,7 @@
8
8
"\n",
9
9
"<a rel=\"license\" href=\"http://creativecommons.org/licenses/by/4.0/\"><img alt=\"Creative Commons Licence\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by/4.0/88x31.png\" title='This work is licensed under a Creative Commons Attribution 4.0 International License.' align=\"right\"/></a>\n",
10
10
"\n",
11
-
"In this notebook, we will cover how to perform dimension reduction and discretization of molecular dynamics data. We also repeat data loading and visualization tasks from the previous notebook\n",
"In this notebook, we will cover how to perform dimension reduction and discretization of molecular dynamics data. We also repeat data loading and visualization tasks from [Notebook 01 ➜ 📓](01-data-io-and-featurization.ipynb).\n",
"⚠️ For large datasets we also offer a mini batch version of $k$-means which has the same semantics as the original method but trains the centers on subsets of your data.\n",
154
153
"This tutorial does not cover this case, but you should keep in mind that $k$-means requires your low dimensional space to fit into your main memory.\n",
155
154
"\n",
155
+
"As the number of cluster centers has to be chosen by the modeler, we will analyze its implications on the MSM analysis in [Notebook 03 ➜ 📓](03-msm-estimation-and-validation.ipynb). A systematic approach to choose this number is proposed in [Notebook 00 ➜ 📓](00-pentapeptide-showcase.ipynb).\n",
156
+
"\n",
156
157
"The main result of a discretization for Markov modeling, however,\n",
157
158
"is not the set of centers but the time series of discrete states.\n",
158
159
"These are accessible via the `dtrajs` attribute of any clustering object:"
@@ -387,7 +388,7 @@
387
388
"cell_type": "markdown",
388
389
"metadata": {},
389
390
"source": [
390
-
"Following the previous example, we perform a $k$-means ($100$ centers, stride of $5$) and a regspace clustering ($0.3$ radians center distance) on the full two-dimensional data set and visualize the obtained centers:"
391
+
"Following the previous example, we perform a $k$-means ($100$ centers, stride of $5$) and a regspace clustering ($0.3$ radians center distance) on the full two-dimensional data set and visualize the obtained centers. In [Notebook 03 ➜ 📓](03-msm-estimation-and-validation.ipynb), we show the effect of different numbers of cluster centers on MSM estimation."
Copy file name to clipboardExpand all lines: notebooks/03-msm-estimation-and-validation.ipynb
+4-3Lines changed: 4 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -240,7 +240,8 @@
240
240
"\n",
241
241
"Please note though that this ITS convergence analysis is based on the assumption that $200$ $k$-means centers are sufficient to discretize the dynamics.\n",
242
242
"In order to study the influence of the clustering on the ITS convergence,\n",
243
-
"we repeat the clustering and ITS convergence analysis for various number of cluster centers:"
243
+
"we repeat the clustering and ITS convergence analysis for various number of cluster centers.\n",
244
+
"For the sake of simplicity, we will restrict ourselves to the $k$-means algorithm; alternative clustering methods are presented in [Notebook 02 ➜ 📓](02-dimension-reduction-and-discretization.ipynb)."
244
245
]
245
246
},
246
247
{
@@ -291,7 +292,7 @@
291
292
"Now, let's continue with the alanine dipeptide system.\n",
292
293
"We estimate an MSM at lag time $10$ ps and, given that we have three slow processes, perform a CK test for four metastable states.\n",
293
294
"\n",
294
-
"⚠️ In general, the number of metastable states is a modeler's choice and will be explained in further notebooks."
295
+
"⚠️ In general, the number of metastable states is a modeler's choice and will be explained in more detail in [Notebook 04 ➜ 📓](04-msm-analysis.ipynb) and [Notebook 07 ➜ 📓](07-hidden-markov-state-models.ipynb)."
Copy file name to clipboardExpand all lines: notebooks/04-msm-analysis.ipynb
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -237,7 +237,7 @@
237
237
"cell_type": "markdown",
238
238
"metadata": {},
239
239
"source": [
240
-
"The stationary distribution can also be used to correct the `pyemma.plots.plot_free_energy()` function.\n",
240
+
"The stationary distribution can also be used to correct the `pyemma.plots.plot_free_energy()` function that we used to visualize this dataset in [Notebook 01 ➜ 📓](01-data-io-and-featurization.ipynb).\n",
241
241
"This might be necessary if the data points are not sampled from global equilibrium.\n",
242
242
"\n",
243
243
"In this case, we assign the weight of the corresponding discrete state to each data point and pass this information to the plotting function via its `weights` parameter:"
@@ -339,7 +339,7 @@
339
339
"cell_type": "markdown",
340
340
"metadata": {},
341
341
"source": [
342
-
"We now save the model to do more analyses with PCCA++ and TPT in the follow-up notebook:"
342
+
"We now save the model to do more analyses with PCCA++ and TPT in [Notebook 05 ➜ 📓](05-pcca-tpt.ipynb):"
Copy file name to clipboardExpand all lines: notebooks/05-pcca-tpt.ipynb
+6-4Lines changed: 6 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -136,7 +136,7 @@
136
136
"Since PCCA++, in simplified words, does a clustering in eigenvector space and the first eigenvector separated these states already, the nice separation comes to no surprise.\n",
137
137
"\n",
138
138
"It is important to note, though, that PCCA++ in general does not yield a coarse transition matrix.\n",
139
-
"How to obtain this will be covered in the HMM notebook.\n",
139
+
"How to obtain this will be covered in [Notebook 07 ➜ 📓](07-hidden-markov-state-models.ipynb).\n",
140
140
"However, we can compute mean first passage times (MFPTs) and equilibrium probabilities on the metastable sets and extract representative structures.\n",
141
141
"\n",
142
142
"The stationary probability of metastable states can simply be computed by summing over all of its micro-states\n",
@@ -436,7 +436,8 @@
436
436
"Have you noticed how well the metastable state coloring agrees with the eigenvector visualization of the three slowest processes?\n",
437
437
"\n",
438
438
"If we could afford a shorter lag time, we might even be able to resolve more processes and, thus,\n",
439
-
"subdivide the metastable states three (fifth slowest process) and zero (sixth slowest process).\n",
439
+
"subdivide the metastable states three and four.\n",
440
+
"We show how to do this with HMMs in [Notebook 07 ➜ 📓](07-hidden-markov-state-models.ipynb).\n",
440
441
"\n",
441
442
"Now we define a small function to visualize samples of metastable states with NGLView."
442
443
]
@@ -620,7 +621,7 @@
620
621
"#### Exercise 1\n",
621
622
"\n",
622
623
"Define a `featurizer` that loads the heavy atom coordinates and load the data into memory.\n",
623
-
"Also load the TICA object from the previous notebook to transform the featurized data.\n",
624
+
"Also load the TICA object from [Notebook 04 ➜ 📓](04-msm-analysis.ipynb) to transform the featurized data.\n",
624
625
"Further, the estimated MSM, Bayesian MSM, and Cluster objects should be loaded from disk. "
Copy file name to clipboardExpand all lines: notebooks/07-hidden-markov-state-models.ipynb
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -684,7 +684,7 @@
684
684
"metadata": {},
685
685
"source": [
686
686
"As we see, in addition to the properties described above, HMMs provide the same analysis tools as MSMs.\n",
687
-
"For example, eigenvectors and mean first passage times can be extracted as described in previous notebooks. \n",
687
+
"For example, eigenvectors and mean first passage times can be extracted as described in [Notebook 04 ➜ 📓](04-msm-analysis.ipynb) and [Notebook 05 ➜ 📓](05-pcca-tpt.ipynb). \n",
688
688
"\n",
689
689
"Let us now repeat this approach again for another featurization:\n",
690
690
"we already know that it is possible to resolve six metastable states (five slow processes) using an HMM estimated on a discretization of the backbone torsions.\n",
@@ -914,7 +914,7 @@
914
914
"\n",
915
915
"Let us now repeat the analysis of our pentapeptide using an HMM.\n",
916
916
"\n",
917
-
"We fetch the pentapeptide data set and prepare the discrete trajectories as explained in the showcase notebook.\n",
917
+
"We fetch the pentapeptide data set and prepare the discrete trajectories as explained in [Notebook 00 ➜ 📓](00-pentapeptide-showcase.ipynb).\n",
918
918
"There, we already learned that $5$ metastable states are a good choice for our model.\n",
919
919
"According to our implied timescales plot, we can resolve four processes for up to $2.5$ ns in our data. "
Copy file name to clipboardExpand all lines: notebooks/08-common-problems.ipynb
+5-5Lines changed: 5 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -700,7 +700,7 @@
700
700
"source": [
701
701
"Congratulations, we have estimated a well-validated MSM.\n",
702
702
"The only question remaining is: What does it actually describe?\n",
703
-
"For this, we usually extract representative structures as described in a previous notebook.\n",
703
+
"For this, we usually extract representative structures as described in [Notebook 00 ➜ 📓](00-pentapeptide-showcase.ipynb).\n",
704
704
"We will not do this here but look at the metastable trajectories instead.\n",
705
705
"\n",
706
706
"#### What could be wrong about it?\n",
@@ -781,7 +781,7 @@
781
781
"the TICA lag time was deliberately chosen way too high.\n",
782
782
"That's easy to fix.\n",
783
783
"\n",
784
-
"Let's now have a look at how the metastable trajectories should look like for a decent model such as the one estimated in the previous notebooks.\n",
784
+
"Let's now have a look at how the metastable trajectories should look for a decent model such as the one estimated in [Notebook 05 ➜ 📓](05-pcca-tpt.ipynb).\n",
785
785
"We will take the same input data,\n",
786
786
"do a TICA transform with a realistic lag time of $10$ ps,\n",
787
787
"and coarse grain into $2$ metastable states in order to compare with the example above."
@@ -883,9 +883,9 @@
883
883
"- connected but poorly sampled trajectories and how convergence looks in this case,\n",
884
884
"- ill-conducted TICA analysis and what it yields.\n",
885
885
"\n",
886
-
"The most important message from this tutorial is that histogramsare not a means of identifying metastability or connectedness.\n",
887
-
"One should not forget about the underlying trajectories which should play the role of the ground truth to be modeled.\n",
888
-
"Histograms only help us to understand this ground truth but are not necessarily meaningful."
886
+
"The most important lesson from this tutorial is that histograms, which are usually calculated in a projected space, are not a sufficient means of identifying metastability or connectedness.\n",
887
+
"It is crucial to remember that the underlying trajectories play the role of ground truth for the model. \n",
888
+
"Ultimately, histograms only help us to understand this ground truth but cannot provide a complete picture."
0 commit comments