Merge pull request #165 from markovmodel/revision-cw

cwehmeyer · web-flow · commit e60d0f41fc27 · 2018-09-07T14:23:58.000+02:00
Revision cw
diff --git a/notebooks/03-msm-estimation-and-validation.ipynb b/notebooks/03-msm-estimation-and-validation.ipynb
@@ -85,7 +85,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The first step after obtaining the discretized dynamics is finding a suitable lag time. The systematic approach is to estimate MSMs at various lag times and observe how the implied timescales (ITSs) of these models behave. To this aim, PyEMMA provides the `its()` function which we use to track the first three (`nits=3`) implied timescales:"
+    "The first step after obtaining the discretized dynamics is finding a suitable lag time.\n",
+    "The systematic approach is to estimate MSMs at various lag times and observe how the implied timescales (ITSs) of these models behave.\n",
+    "In particular, we are looking for lag time ranges in which the implied timescales are constant\n",
+    "(i.e., lag time independent as described in the manuscript in Section 2.1).\n",
+    "To this aim, PyEMMA provides the `its()` function which we use to track the first three (`nits=3`) implied timescales:"
    ]
   },
   {
@@ -117,11 +121,20 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The above plot tells us that there is one resolved process with an ITS of approximately $8.5$ steps (blue) which is largely invariant to the MSM lag time. The other two ITSs (green, red) are smaller than the lag time (black line, grey-shaded area); they correspond to processes which are faster than the lag time and, thus, are not resolved.\n",
+    "The above plot tells us that there is one resolved process with an ITS of approximately $8.5$ steps (blue) which is largely invariant to the MSM lag time.\n",
+    "The other two ITSs (green, red) are smaller than the lag time (black line, grey-shaded area);\n",
+    "they correspond to processes which are faster than the lag time and, thus, are not resolved.\n",
+    "Since the implied timescales are, like the corresponding eigenvalues, sorted in decreasing order, we know that all other remaining processes must be even faster.\n",
     "\n",
-    "As MSMs tend to underestimate the true ITSs, we are looking for a converged maximum in the ITS plot. In our case, any lag time before the slow process (blue line) crosses the lag time threshold (black line) would work. To maximize the kinetic resolution, we choose the lag time $1$ step.\n",
+    "As MSMs tend to underestimate the true ITSs, we are looking for a converged maximum in the ITS plot.\n",
+    "In our case, any lag time before the slow process (blue line) crosses the lag time threshold (black line) would work.\n",
+    "To maximize the kinetic resolution, we choose the lag time $1$ step.\n",
     "\n",
-    "To see whether our model satisfies Markovianity, we perform (and visualize) a Chapman-Kolmogorow (CK) test. Since we aim at modeling the dynamics between metastable states rather than between microstates, this will be conducted in the space of metastable states. The latter are identified automatically using PCCA++ (which is explained in a later notebook). We usually choose the number of metastable states according to the implied timescales plot by identifying a gap between the ITS. For a single process, we can assume that there are two metastable states between which the process occurs."
+    "To see whether our model satisfies Markovianity, we perform (and visualize) a Chapman-Kolmogorow (CK) test.\n",
+    "Since we aim at modeling the dynamics between metastable states rather than between microstates, this will be conducted in the space of metastable states.\n",
+    "The latter are identified automatically using PCCA++ (which is explained in [Notebook 05 📓](05-pcca-tpt.ipynb)).\n",
+    "We usually choose the number of metastable states according to the implied timescales plot by identifying a gap between the ITS.\n",
+    "For a single process, we can assume that there are two metastable states between which the process occurs."
    ]
   },
   {
diff --git a/notebooks/04-msm-analysis.ipynb b/notebooks/04-msm-analysis.ipynb
@@ -179,11 +179,18 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The attribute `msm.pi` tells us, for each discrete state, the absolute probability of observing said state in global equilibrium. Mathematically speaking, the stationary distribution $\\pi$ is the left eigenvector of the transition matrix $\\mathbf{T}$ to the eigenvalue $1$:\n",
+    "The attribute `msm.pi` tells us, for each discrete state, the absolute probability of observing said state in global equilibrium.\n",
+    "Mathematically speaking, the stationary distribution $\\pi$ is the left eigenvector of the transition matrix $\\mathbf{T}$ to the eigenvalue $1$:\n",
     "\n",
     "$$\\pi^\\top \\mathbf{T} = \\pi^\\top.$$\n",
     "\n",
-    "We can use the stationary distribution to, e.g., visualize the weight of the dicrete states and, thus, to highlight which areas of our feature space are most probable. Here, we show all data points in a two dimensional scatter plot and color/weight them according to their discrete state membership:"
+    "Please note that the $\\pi$ is fundamentaly different from a normalized histogram of states:\n",
+    "for the histogram of states to accurately describe the stationary distribution, the data needs to be sampled from global equilibrium, i.e, the data points need to be statistically independent.\n",
+    "The MSM approach, on the other hand, only requires local equilibrium, i.e., statistical independence of state transitions.\n",
+    "Thus, the MSM approach requires a much weaker and, in practice, much easier to satisfy condition than simply counting state visits.\n",
+    "\n",
+    "We can use the stationary distribution to, e.g., visualize the weight of the dicrete states and, thus, to highlight which areas of our feature space are most probable.\n",
+    "Here, we show all data points in a two dimensional scatter plot and color/weight them according to their discrete state membership:"
    ]
   },
   {
@@ -236,7 +243,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We will see further uses of the stationary distribution later. But for now, we continue the analysis of our model by visualizing its (right) eigenvectors. First, we notice that the first right eigenvector is a constant $1$."
+    "We will see further uses of the stationary distribution later.\n",
+    "But for now, we continue the analysis of our model by visualizing its (right) eigenvectors which encode the dynamical processes.\n",
+    "First, we notice that the first right eigenvector is a constant $1$."
    ]
   },
   {
@@ -281,11 +290,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The right eigenvectors can be used to visualize the processes governed by the corresponding implied timescales. The first right eigenvector (always) is $(1,\\dots,1)^\\top$ for an MSM transition matrix and it corresponds to the stationary process (infinite implied timescale).\n",
+    "The right eigenvectors can be used to visualize the processes governed by the corresponding implied timescales.\n",
+    "The first right eigenvector (always) is $(1,\\dots,1)^\\top$ for an MSM transition matrix and it corresponds to the stationary process (infinite implied timescale).\n",
     "\n",
-    "The second right eigenvector corresponds to the slowest process; its entries are negative for one group of discrete states and positive for the other group. This tells us that the slowest process happens between these two groups and that the process relaxes on the slowest ITS ($\\approx 8.5$ steps).\n",
+    "The second right eigenvector corresponds to the slowest process;\n",
+    "its entries are negative for one group of discrete states and positive for the other group.\n",
+    "This tells us that the slowest process happens between these two groups and that the process relaxes on the slowest ITS ($\\approx 8.5$ steps).\n",
     "\n",
-    "The third and fourth eigenvectors show a larger spread of values and no clear grouping. In combination with the ITS convergence plot, we can safely assume that these eigenvectors contain just noise and do not indicate any resolved processes.\n",
+    "The third and fourth eigenvectors show a larger spread of values and no clear grouping.\n",
+    "In combination with the ITS convergence plot, we can safely assume that these eigenvectors contain just noise and do not indicate any resolved processes.\n",
     "\n",
     "We then continue to validate our MSM with a CK test for $2$ metastable states which are already indicated by the second right eigenvector."
    ]
@@ -423,7 +436,11 @@
    "source": [
     "Again, we have the $(1,\\dots,1)^\\top$ first right eigenvector of the stationary process.\n",
     "\n",
-    "The second to fourth right eigenvectors illustrate the three slowest processes.\n",
+    "The second to fourth right eigenvectors illustrate the three slowest processes which are (in that order):\n",
+    "\n",
+    "- rotation of the $\\Phi$ dihedral,\n",
+    "- rotation of the $\\Psi$ dihedral when $\\Phi\\approx-2$ rad, and\n",
+    "- rotation of the $\\Psi$ dihedral when $\\Phi\\approx1$ rad.\n",
     "\n",
     "Eigenvectors five, six, and seven indicate further processes which, however, relax faster than the lag time and cannot be resolved clearly.\n",
     "\n",
diff --git a/notebooks/08-common-problems.ipynb b/notebooks/08-common-problems.ipynb
@@ -18,7 +18,10 @@
     "- you can find the full documentation at [PyEMMA.org](http://www.pyemma.org).\n",
     "---\n",
     "\n",
-    "Most problems arise from bad sampling combined with a poor discretization. For estimating a Markov model, it is required to have a connected data set, i.e. we must have observed each process we want to describe in both directions. PyEMMA checks if this requirement is fulfilled, however in certain situations this might be less obvious."
+    "Most problems in Markov modeling of MD data arise from bad sampling combined with a poor discretization.\n",
+    "For estimating a Markov model, it is required to have a connected data set,\n",
+    "i.e., we must have observed each process we want to describe in both directions.\n",
+    "PyEMMA checks if this requirement is fulfilled but, however, in certain situations this might be less obvious."
    ]
   },
   {
@@ -40,7 +43,8 @@
    "source": [
     "## Case 1: preprocessed, two-dimensional data (toy model)\n",
     "### well-sampled double-well potential\n",
-    "Let's again have a look at the double-well potential. Since we are only interested in the problematic situations here, we will simplify our data a bit and work with a 1D projection."
+    "Let's again have a look at the double-well potential.\n",
+    "Since we are only interested in the problematic situations here, we will simplify our data a bit and work with a 1D projection."
    ]
   },
   {
@@ -97,7 +101,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We visualize the histogram along with the first part of the trajectory (left panel) and the MSM implied timescales (right panel):"
+    "As a reference, we visualize the histogram of this well-sampled trajectory along with the first $200$ steps (left panel) and the MSM implied timescales (right panel):"
    ]
   },
   {
@@ -122,13 +126,19 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We see a nice, reversibly connected trajectory. That means we have sampled transitions between the basins in both directions that are correctly resolved by the discretization. As we see from the almost perfect overlay of discrete and continuous trajectory, nearly no discretization error is made. \n",
+    "We see a nice, reversibly connected trajectory.\n",
+    "That means we have sampled transitions between the basins in both directions that are correctly resolved by the discretization.\n",
+    "As we see from the almost perfect overlay of discrete and continuous trajectory, nearly no discretization error is made. \n",
     "\n",
     "###  irreversibly connected double-well trajectories\n",
     "\n",
-    "In MD simulations, we often face the problem that a process is sampled only in one direction. For example, consider protein-protein binding. The unbinding might take on the order of seconds to minutes and is thus difficult to sample. We will have a look what happens with the MSM in this case. \n",
+    "In MD simulations, we often face the problem that a process is sampled only in one direction.\n",
+    "For example, consider protein-protein binding.\n",
+    "The unbinding might take on the order of seconds to minutes and is thus difficult to sample.\n",
+    "We will have a look what happens with the MSM in this case. \n",
     "\n",
-    "Our example are two trajectories sampled from a double-well potential, each started in a different basin. They will be color coded."
+    "Our example are two trajectories sampled from a double-well potential, each started in a different basin.\n",
+    "They will be color coded."
    ]
   },
   {