Merge pull request #173 from markovmodel/th_rev

marscher · web-flow · commit 3ab81617e37b · 2018-11-14T15:57:07.000+01:00
linking between NB 01-08 &amp; from manuscript to NB 01-08

[ci skip]
diff --git a/manuscript/manuscript.tex b/manuscript/manuscript.tex
@@ -371,8 +371,9 @@ \subsection{Feature selection}
 and the more general variational approach for Markov processes (VAMP)~\cite{vamp-preprint}
 provide a systematic means to quantitatively compare multiple representations of the simulation data.
 In particular, we can use a scalar score obtained using VAMP to directly compare the ability of certain features to capture slow dynamical modes in a particular molecular system.
+In Notebook (01), we present in detail how to extract features from MD datasets and how to systematically compare them.
 
-Here, we utilize the VAMP-2 score, which maximizes the kinetic variance contained in the features~\cite{kinetic-maps}.
+Throughout this tutorial, we utilize the VAMP-2 score, which maximizes the kinetic variance contained in the features~\cite{kinetic-maps}.
 We should always evaluate the score in a cross-validated manner to ensure that we neither include too few features (under-fitting) or too many features (over-fitting)~\cite{gmrq,vamp-preprint}.
 To choose among three different molecular features reflecting protein structure,
 we compute the (cross-validated) VAMP-2 score (Notebook 00).
@@ -398,6 +399,8 @@ \subsection{Dimensionality reduction}
 Discrete jumps between the minima can be observed by visualizing the transformation of the first trajectory into these ICs (Fig.~\ref{fig:io-to-tica}d).
 We thus assume that our TICA-transformed backbone torsion features describe one or more metastable processes.
 
+We demonstrate how to apply TICA, suggest how to interpret the projected coordinates, and compare the results to other dimension reduction techniques in Notebook (02).
+
 \begin{figure}
 \includegraphics{figure_3}
 \caption{Example analysis of the conformational dynamics of a pentapeptide backbone:
@@ -414,7 +417,8 @@ \subsection{Discretization}
 which can greatly facilitate the decomposition of our system into the discrete Markovian states necessary for MSM estimation.
 Here, we use the $k$-means algorithm to segment the four dimensional TICA space into $k=75$ cluster centers.
 The number of cluster centers has been chosen to optimize the VAMP-2 score in a manner identical to how the feature selection was carried out above,
-which is shown in the showcase notebook (00).
+which is shown in the showcase Notebook (00).
+A detailed comparison between different clustering techniques is provided in Notebook (02).
 
 \subsection{MSM estimation and validation}
 
@@ -447,6 +451,8 @@ \subsection{MSM estimation and validation}
 and shows that the MSM we have estimated at lag time $\tau=0.5$~ns indeed predicts the
 long-timescale behavior of our system within error (blue/shaded area).
 
+In Notebook (03), we demonstrate in detail how to estimate and validate MSMs with PyEMMA.
+
 \subsection{Analyzing the MSM}
 
 \begin{figure}
@@ -526,6 +532,8 @@ \subsection{Analyzing the MSM}
 The transition network can be additionally visualized by plotting representative structures of the five metastable states $\mathcal{S}_{(1-5)}$ according to their committor probability (Fig.~\ref{fig:tpt-network}).
 It is easy to see from this depiction that the dominant pathway from $\mathcal{S}_2$ to $\mathcal{S}_4$ proceeds through $\mathcal{S}_5$.
 
+More details about (spectral) properties of MSMs and how to analyze them with PyEMMA are discussed in Notebook (04) and Notebook (05).
+
 \subsection{Connecting the MSM with experimental data}
 
 \begin{figure}
@@ -560,6 +568,8 @@ \subsection{Connecting the MSM with experimental data}
 We see that the predicted relaxation signal has a much larger amplitude for the nonequilibrium initialization,
 making it more likely to be experimentally measurable.
 
+In addition to a detailed demonstration of the above, Notebook (06) demonstrates how to compute J-couplings and dynamic fingerprints from MSMs.
+
 \subsection{Summary}
 
 In this section, we have summarized how to conduct an MSM-based analysis of biomolecular dynamics data using PyEMMA.
@@ -587,6 +597,7 @@ \subsection{Modeling large systems}
 we explain how to deal with those in the tutorials (Notebook 00 and 02).
 
 More details on how to model complex systems with the techniques presented here are described, e.g., by~\cite{plattner_protein_2015,plattner_complete_2017}.
+We further examine some symptoms that may indicate problematic or difficult datasets, and demonstrate how to deal with them in Notebook (08).
 
 \subsection{Advanced Methods}
 
diff --git a/notebooks/00-pentapeptide-showcase.ipynb b/notebooks/00-pentapeptide-showcase.ipynb
@@ -1778,7 +1778,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.6"
+   "version": "3.6.5"
   },
   "toc": {
    "base_numbering": 1,
diff --git a/notebooks/01-data-io-and-featurization.ipynb b/notebooks/01-data-io-and-featurization.ipynb
@@ -207,7 +207,7 @@
    "metadata": {},
    "source": [
     "In PyEMMA, the `featurizer` is a central object that incorporates the system's topology.\n",
-    "We start by creating it object using the topology file.\n",
+    "We start by creating it using the topology file.\n",
     "Features are now easily computed by adding the target feature.\n",
     "If no feature is added, the featurizer will extract Cartesian coordinates."
    ]
@@ -395,7 +395,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We note that the distribution in backbone torsion space contains several basins that will be assigned to metastable states in follow-up notebooks.\n",
+    "We note that the distribution in backbone torsion space contains several basins that will be assigned to metastable states in follow-up notebooks ([Notebook 05 ➜ 📓](05-pcca-tpt.ipynb), [Notebook 07 ➜ 📓](07-hidden-markov-state-models.ipynb)).\n",
     "Again, the free energy plot only depicts a pseudo free energy surface of the sampled data and was not re-weighted to equilibrium.\n",
     "\n",
     "Let us look at a different featurization example and load the positions of all heavy atoms instead.\n",
@@ -519,7 +519,7 @@
     "Using `load()`, we put the full data into memory.\n",
     "This is possible for all examples in this tutorial.\n",
     "\n",
-    "Many real world apllications, though, require more memory than your workstation might provide.\n",
+    "Many real world applications, though, require more memory than your workstation might provide.\n",
     "For these cases, you should use the `source()` function:"
    ]
   },
@@ -1086,7 +1086,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.6"
+   "version": "3.6.5"
   },
   "toc": {
    "base_numbering": 1,
diff --git a/notebooks/02-dimension-reduction-and-discretization.ipynb b/notebooks/02-dimension-reduction-and-discretization.ipynb
@@ -8,8 +8,7 @@
     "\n",
     "<a rel=\"license\" href=\"http://creativecommons.org/licenses/by/4.0/\"><img alt=\"Creative Commons Licence\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by/4.0/88x31.png\" title='This work is licensed under a Creative Commons Attribution 4.0 International License.' align=\"right\"/></a>\n",
     "\n",
-    "In this notebook, we will cover how to perform dimension reduction and discretization of molecular dynamics data. We also repeat data loading and visualization tasks from the previous notebook\n",
-    "([01 ➜ 📓](01-data-io-and-featurization.ipynb)).\n",
+    "In this notebook, we will cover how to perform dimension reduction and discretization of molecular dynamics data. We also repeat data loading and visualization tasks from [Notebook 01 ➜ 📓](01-data-io-and-featurization.ipynb).\n",
     "\n",
     "Maintainers: [@cwehmeyer](https://github.com/cwehmeyer), [@marscher](https://github.com/marscher), [@thempel](https://github.com/thempel), [@psolsson](https://github.com/psolsson)\n",
     "\n",
@@ -153,6 +152,8 @@
     "⚠️ For large datasets we also offer a mini batch version of $k$-means which has the same semantics as the original method but trains the centers on subsets of your data.\n",
     "This tutorial does not cover this case, but you should keep in mind that $k$-means requires your low dimensional space to fit into your main memory.\n",
     "\n",
+    "As the number of cluster centers has to be chosen by the modeler, we will analyze its implications on the MSM analysis in [Notebook 03 ➜ 📓](03-msm-estimation-and-validation.ipynb). A systematic approach to choose this number is proposed in [Notebook 00 ➜ 📓](00-pentapeptide-showcase.ipynb).\n",
+    "\n",
     "The main result of a discretization for Markov modeling, however,\n",
     "is not the set of centers but the time series of discrete states.\n",
     "These are accessible via the `dtrajs` attribute of any clustering object:"
@@ -387,7 +388,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Following the previous example, we perform a $k$-means ($100$ centers, stride of $5$) and a regspace clustering ($0.3$ radians center distance) on the full two-dimensional data set and visualize the obtained centers:"
+    "Following the previous example, we perform a $k$-means ($100$ centers, stride of $5$) and a regspace clustering ($0.3$ radians center distance) on the full two-dimensional data set and visualize the obtained centers. In [Notebook 03 ➜ 📓](03-msm-estimation-and-validation.ipynb), we show the effect of different numbers of cluster centers on MSM estimation."
    ]
   },
   {
@@ -1338,7 +1339,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.6"
+   "version": "3.6.5"
   },
   "toc": {
    "base_numbering": 1,
diff --git a/notebooks/03-msm-estimation-and-validation.ipynb b/notebooks/03-msm-estimation-and-validation.ipynb
@@ -240,7 +240,8 @@
     "\n",
     "Please note though that this ITS convergence analysis is based on the assumption that $200$ $k$-means centers are sufficient to discretize the dynamics.\n",
     "In order to study the influence of the clustering on the ITS convergence,\n",
-    "we repeat the clustering and ITS convergence analysis for various number of cluster centers:"
+    "we repeat the clustering and ITS convergence analysis for various number of cluster centers.\n",
+    "For the sake of simplicity, we will restrict ourselves to the $k$-means algorithm; alternative clustering methods are presented in [Notebook 02 ➜ 📓](02-dimension-reduction-and-discretization.ipynb)."
    ]
   },
   {
@@ -291,7 +292,7 @@
     "Now, let's continue with the alanine dipeptide system.\n",
     "We estimate an MSM at lag time $10$ ps and, given that we have three slow processes, perform a CK test for four metastable states.\n",
     "\n",
-    "⚠️ In general, the number of metastable states is a modeler's choice and will be explained in further notebooks."
+    "⚠️ In general, the number of metastable states is a modeler's choice and will be explained in more detail in [Notebook 04 ➜ 📓](04-msm-analysis.ipynb) and [Notebook 07 ➜ 📓](07-hidden-markov-state-models.ipynb)."
    ]
   },
   {
@@ -621,7 +622,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.6"
+   "version": "3.6.5"
   },
   "toc": {
    "base_numbering": 1,
diff --git a/notebooks/04-msm-analysis.ipynb b/notebooks/04-msm-analysis.ipynb
@@ -237,7 +237,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The stationary distribution can also be used to correct the `pyemma.plots.plot_free_energy()` function.\n",
+    "The stationary distribution can also be used to correct the `pyemma.plots.plot_free_energy()` function that we used to visualize this dataset in [Notebook 01 ➜ 📓](01-data-io-and-featurization.ipynb).\n",
     "This might be necessary if the data points are not sampled from global equilibrium.\n",
     "\n",
     "In this case, we assign the weight of the corresponding discrete state to each data point and pass this information to the plotting function via its `weights` parameter:"
@@ -339,7 +339,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We now save the model to do more analyses with PCCA++ and TPT in the follow-up notebook:"
+    "We now save the model to do more analyses with PCCA++ and TPT in [Notebook 05 ➜ 📓](05-pcca-tpt.ipynb):"
    ]
   },
   {
@@ -841,7 +841,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.6"
+   "version": "3.6.5"
   },
   "toc": {
    "base_numbering": 1,
diff --git a/notebooks/05-pcca-tpt.ipynb b/notebooks/05-pcca-tpt.ipynb
@@ -136,7 +136,7 @@
     "Since PCCA++, in simplified words, does a clustering in eigenvector space and the first eigenvector separated these states already, the nice separation comes to no surprise.\n",
     "\n",
     "It is important to note, though, that PCCA++ in general does not yield a coarse transition matrix.\n",
-    "How to obtain this will be covered in the HMM notebook.\n",
+    "How to obtain this will be covered in [Notebook 07 ➜ 📓](07-hidden-markov-state-models.ipynb).\n",
     "However, we can compute mean first passage times (MFPTs) and equilibrium probabilities on the metastable sets and extract representative structures.\n",
     "\n",
     "The stationary probability of metastable states can simply be computed by summing over all of its micro-states\n",
@@ -436,7 +436,8 @@
     "Have you noticed how well the metastable state coloring agrees with the eigenvector visualization of the three slowest processes?\n",
     "\n",
     "If we could afford a shorter lag time, we might even be able to resolve more processes and, thus,\n",
-    "subdivide the metastable states three (fifth slowest process) and zero (sixth slowest process).\n",
+    "subdivide the metastable states three and four.\n",
+    "We show how to do this with HMMs in [Notebook 07 ➜ 📓](07-hidden-markov-state-models.ipynb).\n",
     "\n",
     "Now we define a small function to visualize samples of metastable states with NGLView."
    ]
@@ -620,7 +621,7 @@
     "#### Exercise 1\n",
     "\n",
     "Define a `featurizer` that loads the heavy atom coordinates and load the data into memory.\n",
-    "Also load the TICA object from the previous notebook to transform the featurized data.\n",
+    "Also load the TICA object from [Notebook 04 ➜ 📓](04-msm-analysis.ipynb) to transform the featurized data.\n",
     "Further, the estimated MSM, Bayesian MSM, and Cluster objects should be loaded from disk. "
    ]
   },
@@ -695,6 +696,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
+    "scrolled": true,
     "solution2": "hidden",
     "solution2_first": true
    },
@@ -993,7 +995,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.6"
+   "version": "3.6.5"
   },
   "toc": {
    "base_numbering": 1,
diff --git a/notebooks/06-expectations-and-observables.ipynb b/notebooks/06-expectations-and-observables.ipynb
@@ -966,7 +966,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.6"
+   "version": "3.6.5"
   },
   "toc": {
    "base_numbering": 1,
diff --git a/notebooks/07-hidden-markov-state-models.ipynb b/notebooks/07-hidden-markov-state-models.ipynb
@@ -684,7 +684,7 @@
    "metadata": {},
    "source": [
     "As we see, in addition to the properties described above, HMMs provide the same analysis tools as MSMs.\n",
-    "For example, eigenvectors and mean first passage times can be extracted as described in previous notebooks. \n",
+    "For example, eigenvectors and mean first passage times can be extracted as described in [Notebook 04 ➜ 📓](04-msm-analysis.ipynb) and [Notebook 05 ➜ 📓](05-pcca-tpt.ipynb). \n",
     "\n",
     "Let us now repeat this approach again for another featurization:\n",
     "we already know that it is possible to resolve six metastable states (five slow processes) using an HMM estimated on a discretization of the backbone torsions.\n",
@@ -914,7 +914,7 @@
     "\n",
     "Let us now repeat the analysis of our pentapeptide using an HMM.\n",
     "\n",
-    "We fetch the pentapeptide data set and prepare the discrete trajectories as explained in the showcase notebook.\n",
+    "We fetch the pentapeptide data set and prepare the discrete trajectories as explained in [Notebook 00 ➜ 📓](00-pentapeptide-showcase.ipynb).\n",
     "There, we already learned that $5$ metastable states are a good choice for our model.\n",
     "According to our implied timescales plot, we can resolve four processes for up to $2.5$ ns in our data. "
    ]
@@ -1253,7 +1253,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.6"
+   "version": "3.6.5"
   },
   "toc": {
    "base_numbering": 1,
diff --git a/notebooks/08-common-problems.ipynb b/notebooks/08-common-problems.ipynb
@@ -700,7 +700,7 @@
    "source": [
     "Congratulations, we have estimated a well-validated MSM.\n",
     "The only question remaining is: What does it actually describe?\n",
-    "For this, we usually extract representative structures as described in a previous notebook.\n",
+    "For this, we usually extract representative structures as described in [Notebook 00 ➜ 📓](00-pentapeptide-showcase.ipynb).\n",
     "We will not do this here but look at the metastable trajectories instead.\n",
     "\n",
     "#### What could be wrong about it?\n",
@@ -781,7 +781,7 @@
     "the TICA lag time was deliberately chosen way too high.\n",
     "That's easy to fix.\n",
     "\n",
-    "Let's now have a look at how the metastable trajectories should look like for a decent model such as the one estimated in the previous notebooks.\n",
+    "Let's now have a look at how the metastable trajectories should look for a decent model such as the one estimated in [Notebook 05 ➜ 📓](05-pcca-tpt.ipynb).\n",
     "We will take the same input data,\n",
     "do a TICA transform with a realistic lag time of $10$ ps,\n",
     "and coarse grain into $2$ metastable states in order to compare with the example above."
@@ -883,9 +883,9 @@
     "- connected but poorly sampled trajectories and how convergence looks in this case,\n",
     "- ill-conducted TICA analysis and what it yields.\n",
     "\n",
-    "The most important message from this tutorial is that histograms are not a means of identifying metastability or connectedness.\n",
-    "One should not forget about the underlying trajectories which should play the role of the ground truth to be modeled.\n",
-    "Histograms only help us to understand this ground truth but are not necessarily meaningful."
+    "The most important lesson from this tutorial is that histograms, which are usually calculated in a projected space, are not a sufficient means of identifying metastability or connectedness.\n",
+    "It is crucial to remember that the underlying trajectories play the role of ground truth for the model. \n",
+    "Ultimately, histograms only help us to understand this ground truth but cannot provide a complete picture."
    ]
   }
  ],