Merge pull request #160 from marscher/improvements_nb01

marscher · web-flow · commit 9353cd3367af · 2018-09-06T14:58:59.000+02:00
[nb01] Better motiviation why to compare different features.
diff --git a/notebooks/01-data-io-and-featurization.ipynb b/notebooks/01-data-io-and-featurization.ipynb
@@ -8,7 +8,7 @@
     "\n",
     "<a rel=\"license\" href=\"http://creativecommons.org/licenses/by/4.0/\"><img alt=\"Creative Commons Licence\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by/4.0/88x31.png\" title='This work is licensed under a Creative Commons Attribution 4.0 International License.' align=\"right\"/></a>\n",
     "\n",
-    "In this notebook, we will cover how to load (and visualize) data with PyEMMA. \n",
+    "In this notebook, we will cover how to load (and visualize) data with PyEMMA. We are going to extract different features (collective variables) and compare how well these are suited for Markov state model building. Further we will introduce the concept of streaming data, which is mandatory to work with large data sets.\n",
     "\n",
     "As with the other notebooks, we employ multiple examples. The idea is, first, to highlight the fundamental ideas with a non-physical test system (diffusion in a 2D double-well potential) and, second, to demonstrate real-world applications with molecular dynamics data.\n",
     "\n",
@@ -177,7 +177,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In PyEMMA, the `featurizer` is a central object that incorporates the system's topology. We start by creating it object using the topology file.  Features are now easily computed by adding the target feature, e.g., with `featurizer.add_backbone_torsions()`. "
+    "In PyEMMA, the `featurizer` is a central object that incorporates the system's topology. We start by creating it object using the topology file. Features are now easily computed by adding the target feature. If no feature is added, the featurizer will extract Cartesian coordinates. "
    ]
   },
   {
@@ -193,7 +193,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Next, we start adding features which we want to extract from the simulation data. Here, we want to load the backbone torsions:"
+    "Now we pass the featurizer to the load function to extract the Cartesian coordinates from the trajectory files into memory. For real world examples one would prefer the `source` function, because usually one has more data available than main memory in the workstation.\n",
+    "\n",
+    "The warning about **plain coordinates** is triggered, because these coordinates will include diffusion as a dynamical process, which might not be what one is interested in. If the molecule of interest has been aligned to a reference prior the analysis, it is fine to use these coordinates, but we will see that there are better choices. "
    ]
   },
   {
@@ -202,16 +204,18 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "feat.add_backbone_torsions(periodic=False)"
+    "data = pyemma.coordinates.load(files, features=feat)\n",
+    "print('type of data:', type(data))\n",
+    "print('lengths:', len(data))\n",
+    "print('shape of elements:', data[0].shape)\n",
+    "print('n_atoms:', feat.topology.n_atoms)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Please note that the structures have been aligned before. Since in that case we loose track of the periodic box, we have to switch off the `periodic` flag for the torsion angle computations.\n",
-    "\n",
-    "We can always call the featurizer's `describe()` method to show which features are requested:"
+    "Next, we start adding features which we want to extract from the simulation data. Here, we want to load the backbone torsions, because these angles are known to describe all flexibility in the system. Since this feature is two dimensional, it is also easier to visualize. Please note that in complex systems it is not trivial to visualize plain input features."
    ]
   },
   {
@@ -220,14 +224,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "print(feat.describe())"
+    "feat = pyemma.coordinates.featurizer(pdb)\n",
+    "feat.add_backbone_torsions(periodic=False)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "After we have selected all desired features, we can call the `load()` function to load all features into memory or, alternatively, the `source()` function to create a streamed feature reader. For now, we will use `load()`:"
+    "Please note that the trajectories have been aligned to a reference structure before. Since in that case we loose track of the periodic box, we have to switch off the `periodic` flag for the torsion angle computations. By default PyEMMA assumes your simulation uses periodic boundary conditions.\n",
+    "\n",
+    "We can always call the featurizer's `describe()` method to show which features are requested. You might have noticed that you can combine arbitrary features by having multiple calls to `add_` methods of the featurizer."
    ]
   },
   {
@@ -243,6 +250,22 @@
     "print('shape of elements:', data[0].shape)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After we have selected all desired features, we can call the `load()` function to load all features into memory or, alternatively, the `source()` function to create a streamed feature reader. For now, we will use `load()`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(feat.describe())"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -264,9 +287,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can now measure the quantity of kinetic variance of the just selected feature by computing a VAMP-2 score <a id=\"ref-1\" href=\"#cite-vamp-preprint\">wu-17</a>. This score gives us information on the kinetic variance contained in the feature. The minimum value of this score is 1, which corresponds to the invariant measure or the observed equilibrium.\n",
+    "We can now measure the quantity of kinetic variance of the just selected feature by computing a VAMP-2 score <a id=\"ref-1\" href=\"#cite-vamp-preprint\">wu-17</a>. This enables us to distinguish features on how well they might be suited for MSM building. The minimum value of this score is 1, which corresponds to the invariant measure or the observed equilibrium.\n",
     "\n",
-    "With the dimension parameter we specify the amount of dynamic processes that we want to score. This is of importance later on, when we want to compare different input features. If we did not fix this number, we would not have an upper bound for the score."
+    "\n",
+    "With the dimension parameter we specify the amount of dynamic processes that we want to score. This is of importance later on, when we want to compare different input features. If we did not fix this number, we would not have an upper bound for the score.\n",
+    "Please also note that we split our available data into training and test sets, where we leave out the last file in training and then use it as test. This is an important aspect in practice to avoid over-fitting the score."
    ]
   },
   {
@@ -275,18 +300,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "score_phi_psi = pyemma.coordinates.vamp(\n",
-    "    data[:-1], dim=2).score(\n",
-    "        test_data=data[-1:],\n",
+    "score_phi_psi = pyemma.coordinates.vamp(data[:-1], dim=2).score(\n",
+    "        test_data=data[-1],\n",
     "        score_method='VAMP2')\n",
-    "print('VAMP2-score backbone torsions: {:f}'.format(score_phi_psi))"
+    "print('VAMP2-score backbone torsions: {:.2f}'.format(score_phi_psi))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The score of $\\approx1.5$ means that we have the constant of $1$ plus a total contribution of $0.5$ from the other dynamic process.\n",
+    "The score of $1.5$ means that we have the constant of $1$ plus a total contribution of $0.5$ from the other dynamic processes.\n",
     "\n",
     "We now use PyEMMA's `plot_density()` and `plot_free_energy()` functions to create Ramachandran plots of our system:"
    ]
@@ -365,9 +389,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "score_heavy_atoms = pyemma.coordinates.vamp(\n",
-    "    data[:-1], dim=2).score(\n",
-    "        test_data=data[-1:],\n",
+    "score_heavy_atoms = pyemma.coordinates.vamp(data[:-1], dim=2).score(\n",
+    "        test_data=data[:-1],\n",
     "        score_method='VAMP2')\n",
     "print('VAMP2-score backbone torsions: {:f}'.format(score_phi_psi))\n",
     "print('VAMP2-score xyz: {:f}'.format(score_heavy_atoms))"
@@ -377,9 +400,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As we see, the score for the heavy atom positions is much higher as the one for the $\\phi/\\psi$ torsion angles. The feature with a higher score should be favored for further analysis, because it means that this feature contains more information about slow processes. If you are already digging deeper into your system of interest, you can of course restrict the analysis to a set of features you already know describes your process of interest, regardless of its VAMP score.\n",
+    "As we see, the score for the heavy atom positions is much higher as the one for the $\\phi/\\psi$ torsion angles. The feature with a higher score should be favored for further analysis, because it means that this feature contains more information about slow processes. If you are already digging deeper into your system of interest, you can of course restrict the analysis to a set of features you already know describes your processes of interest, regardless of its VAMP score.\n",
     "\n",
-    "Another featurization that is interesting especially for proteins is heavy atom distances:"
+    "Another featurization that is interesting especially for proteins are pairwise heavy atom distances:"
    ]
   },
   {
@@ -409,37 +432,22 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "score_pair_dists_ca = pyemma.coordinates.vamp(\n",
-    "    data[:-1]).score(\n",
-    "        test_data=data[-1:],\n",
-    "        score_method='VAMP2')\n",
-    "print('VAMP2-score: {:f}'.format(score_pair_dists_ca))"
+    "score_pair_heavy_atom_dists = pyemma.coordinates.vamp(data[:-1], dim=2).score(\n",
+    "    test_data=data[-1],\n",
+    "    score_method='VAMP2')\n",
+    "print('VAMP2-score: {:.2f}'.format(score_pair_heavy_atom_dists))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Apparently, the heavy atom distance pairs cover an amount of kinetic variance which is comparable to the coordinates themselves.\n",
+    "Apparently, the heavy atom distance pairs cover an amount of kinetic variance which is comparable to the coordinates themselves. However we probably would not use much information by using this internal degree of freedom, while avoiding the need to align our trajectories first.\n",
     "\n",
     "### `load()` versus `source()`\n",
     "\n",
-    "Using `load()`, we put the full data into memory. This is possible for all examples in this tutorial."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(data)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "Using `load()`, we put the full data into memory. This is possible for all examples in this tutorial.\n",
+    "\n",
     "Many real world apllications, though, require more memory than your workstation might provide. For these cases, you should use the `source()` function:"
    ]
   },
@@ -449,15 +457,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "data = pyemma.coordinates.source(files, features=feat)\n",
-    "print(data)"
+    "reader = pyemma.coordinates.source(files, features=feat)\n",
+    "print(reader)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This function allows to stream the data and work on chunks instead of the full set. Most of the functions in the `coordinates` sub-package accept data in memory as well as streamed feature readers. However, some plotting functions require the data to be in memory. To load a (sub-sampled) subset into memory, we can use the `get_output()` method with a stride parameter:"
+    "This function creates a reader, wich allows to stream the underlying data in chunks instead of the full set. Most of the functions in the `coordinates` sub-package accept data in memory as well as readers.\n",
+    "However, some plotting functions require the data to be in memory. To load a (sub-sampled) subset into memory, we can use the `get_output()` method with a stride parameter:"
    ]
   },
   {
@@ -466,9 +475,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "data_output = data.get_output(stride=5)\n",
+    "data_output = reader.get_output(stride=5)\n",
     "len(data_output)\n",
-    "print('number of frames in first file: {}'.format(data.trajectory_length(0)))\n",
+    "print('number of frames in first file: {}'.format(reader.trajectory_length(0)))\n",
     "print('number of frames after striding: {}'.format(len(data_output[0])))"
    ]
   },
@@ -504,11 +513,11 @@
     "\n",
     "Exercise cells come with a button (**Show Solution**) to reveal the solution.\n",
     "\n",
-    "#### Exercise 1: heavy atom distances\n",
+    "#### Exercise 1a: inverse heavy atom distances\n",
     "\n",
-    "Please fix the following code block such that the distances between all heavy atoms are loaded and visualized.\n",
+    "Please fix the following code block such that the inverse distances between all heavy atoms are loaded and visualized.\n",
     "\n",
-    "**Hint**: you might find the `add_distances()` method of the featurizer object helpful."
+    "**Hint**: try to use the auto-complete feature on the feat object to gain some insight. Also take a look at the previous demonstrations."
    ]
   },
   {
@@ -531,6 +540,89 @@
     "fig.tight_layout()"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "solution2": "hidden"
+   },
+   "outputs": [],
+   "source": [
+    "feat = pyemma.coordinates.featurizer(pdb)\n",
+    "pairs = feat.pairs(feat.select_Heavy())\n",
+    "feat.add_inverse_distances(pairs)\n",
+    "\n",
+    "data = pyemma.coordinates.load(files, features=feat)\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(10, 7))\n",
+    "pyemma.plots.plot_feature_histograms(\n",
+    "    np.concatenate(data), feature_labels=feat, ax=ax)\n",
+    "\n",
+    "fig.tight_layout()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Excercise 1b: compare the inverse distances feature to the previously computed ones\n",
+    "\n",
+    "Compute and discuss a cross-validated VAMP score for the inverse pairwise heavy atom distances and plot the result. What do you observe and which feature would you choose for further processing?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "solution2": "hidden",
+    "solution2_first": true
+   },
+   "outputs": [],
+   "source": [
+    "score_inv_dist = pyemma.coordinates. #FIXME\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(10, 7))\n",
+    "score_mapping = dict(score_heavy_atoms=score_heavy_atoms,\n",
+    "                     score_phi_psi=score_phi_psi,\n",
+    "                     score_pair_heavy_atom_dists=score_pair_heavy_atom_dists,\n",
+    "                     score_inv_dist=score_inv_dist)\n",
+    "lbl = []\n",
+    "for i, (key, value) in enumerate(sorted(score_mapping.items(), key=lambda x: x[1])):\n",
+    "    ax.bar(i, height=value)\n",
+    "    lbl.append(key)\n",
+    "ax.set_xticks(np.arange(0, len(score_mapping), 1))\n",
+    "ax.set_xticklabels(lbl)\n",
+    "fig.tight_layout()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "solution2": "hidden"
+   },
+   "outputs": [],
+   "source": [
+    "score_inv_dist = pyemma.coordinates.vamp(\n",
+    "    data[:-1], dim=2).score(test_data=data[-1])\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(10, 7))\n",
+    "score_mapping = dict(score_heavy_atoms=score_heavy_atoms,\n",
+    "                     score_phi_psi=score_phi_psi,\n",
+    "                     score_pair_heavy_atom_dists=score_pair_heavy_atom_dists,\n",
+    "                     score_inv_dist=score_inv_dist)\n",
+    "lbl = []\n",
+    "for i, (key, value) in enumerate(sorted(score_mapping.items(), key=lambda x: x[1])):\n",
+    "    ax.bar(i, height=value)\n",
+    "    lbl.append(key)\n",
+    "ax.set_xticks(np.arange(0, len(score_mapping), 1))\n",
+    "ax.set_xticklabels(lbl)\n",
+    "fig.tight_layout()\n",
+    "\n",
+    "# inversing the feature preserves the amount of kinetic variance, we should continue\n",
+    "# with the pairwise heavy atom distances."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {
@@ -922,7 +1014,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.5"
+   "version": "3.6.6"
   },
   "toc": {
    "base_numbering": 1,