Merge pull request #156 from marscher/review

marscher · web-flow · commit f83b31c05977 · 2018-09-06T11:18:48.000+02:00
Review [ci skip]
diff --git a/manuscript/literature.bib b/manuscript/literature.bib
@@ -677,3 +677,14 @@ @article{banushkina_nonparametric_2015
         year = {2015},
         pages = {184108}
 }
+
+@article{husic-optimized,
+  title={Optimized parameter selection reveals trends in Markov state models for protein folding},
+  author={Husic, Brooke E and McGibbon, Robert T and Sultan, Mohammad M and Pande, Vijay S},
+  journal={J. Chem. Phys.},
+  volume={145},
+  number={19},
+  pages={194103},
+  year={2016},
+  publisher={AIP Publishing}
+}
diff --git a/manuscript/manuscript.tex b/manuscript/manuscript.tex
@@ -241,6 +241,14 @@ \subsection{The PyEMMA workflow}
 We present the results obtained in this notebook, thereby providing an example of how results generated using PyEMMA can be integrated into research publications.
 The figures that will be displayed in the following are created in the showcase notebook (00) and can be easily reproduced.
 
+Note that the modeler has to select hyper-parameters at most stages throughout the workflow.
+This selection must be done carefully as poor choices make it hard, or even impossible, to build a good MSM.
+
+While there exist automated schemes~\cite{husic-optimized} for cross-validated optimization in the full hyper-parameter 
+space, we chose to adopt a sequential approach where only the hyper-parameters of the current stage are optimized. This 
+approach is not only computationally cheaper but allows us to discuss the significance of the necessary modeling 
+choices.
+
 \subsection{Feature selection}
 
 \begin{figure}
diff --git a/notebooks/00-pentapeptide-showcase.ipynb b/notebooks/00-pentapeptide-showcase.ipynb
@@ -231,9 +231,13 @@
     "\n",
     "### TICA\n",
     "\n",
-    "The goal of the next step is to find a function that maps the usually high-dimensional input space into some lower dimensional space that captures the important dynamics. The recommended way of doing so is a time-lagged independent component analysis (TICA), <a id=\"ref-4\" href=\"#cite-tica2\">molgedey-94</a>, <a id=\"ref-5\" href=\"#cite-tica\">perez-hernandez-13</a>. We perform TICA (with kinetic map scaling) using the lag time obtained from the VAMP-2 score. \n",
+    "The goal of the next step is to find a function that maps the usually high-dimensional input space into some lower-dimensional space that captures the important dynamics. The recommended way of doing so is a time-lagged independent component analysis (TICA), <a id=\"ref-4\" href=\"#cite-tica2\">molgedey-94</a>, <a id=\"ref-5\" href=\"#cite-tica\">perez-hernandez-13</a>. We perform TICA (with kinetic map scaling) using the lag time obtained from the VAMP-2 score.\n",
     "\n",
-    "Please note that the general `PyEMMA` API is consistant for all estimators. By calling the TICA estimator with the data (`tica = pyemma.coordinates.tica(torsions_data)`), the estimation is done and an estimator instance returned (`tica`); this object contains all the information about the specific transformation. For small systems, we can access the transformed data by calling `tica.get_output()`. For large systems, we recommend to pass the `tica` object itself into the subsequent stages, e.g., clustering."
+    "By using the tica() function's default parameters, we will use as many dimensions as necessary in order to preserve $95\\%$ of the kinetic variance. By default, tica() also applies a kinetic map scaling.\n",
+    "This scaling ensures that Euclidean distances in the projected space approximate kinetic distances,\n",
+    "which is beneficial during the subsequent discretization.\n",
+    "\n",
+    "Please note that the general `PyEMMA` API is consistant for all estimators. By calling the TICA estimator with the data (`tica = pyemma.coordinates.tica(torsions_data)`), the estimation is done and an estimator instance returned (`tica`); this object contains all the information about the specific transformation. For small systems, we can access the transformed data by calling `tica.get_output()`. For large systems, we recommend to pass the `tica` object itself into the subsequent stages, e.g., clustering, in order to avoid loading all transformed data into memory."
    ]
   },
   {
@@ -1636,7 +1640,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.3"
+   "version": "3.6.6"
   },
   "toc": {
    "base_numbering": 1,
diff --git a/notebooks/01-data-io-and-featurization.ipynb b/notebooks/01-data-io-and-featurization.ipynb
@@ -279,7 +279,7 @@
     "    data[:-1], dim=2).score(\n",
     "        test_data=data[-1:],\n",
     "        score_method='VAMP2')\n",
-    "print('VAMP2-score: {:f}'.format(score_phi_psi))"
+    "print('VAMP2-score backbone torsions: {:f}'.format(score_phi_psi))"
    ]
   },
   {
@@ -369,14 +369,15 @@
     "    data[:-1], dim=2).score(\n",
     "        test_data=data[-1:],\n",
     "        score_method='VAMP2')\n",
-    "print('VAMP2-score: {:f}'.format(score_heavy_atoms))"
+    "print('VAMP2-score backbone torsions: {:f}'.format(score_phi_psi))\n",
+    "print('VAMP2-score xyz: {:f}'.format(score_heavy_atoms))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As we see, the score for the heavy atom positions is much higher as the one for the $\\phi/\\psi$ torsion angles. We will learn later what this means.\n",
+    "As we see, the score for the heavy atom positions is much higher as the one for the $\\phi/\\psi$ torsion angles. The feature with a higher score should be favored for further analysis, because it means that this feature contains more information about slow processes. If you are already digging deeper into your system of interest, you can of course restrict the analysis to a set of features you already know describes your process of interest, regardless of its VAMP score.\n",
     "\n",
     "Another featurization that is interesting especially for proteins is heavy atom distances:"
    ]
@@ -456,7 +457,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This function allows to stream the data and work on chunks instead of the full set. Most of the functions in the `coordinates` sub-package accept data in memory as well as streamed feature readers. However, some plotting functions require the data to be in memory. To load a (strided) subset into memory, we can use the `get_output()` method with a stride parameter:"
+    "This function allows to stream the data and work on chunks instead of the full set. Most of the functions in the `coordinates` sub-package accept data in memory as well as streamed feature readers. However, some plotting functions require the data to be in memory. To load a (sub-sampled) subset into memory, we can use the `get_output()` method with a stride parameter:"
    ]
   },
   {
diff --git a/pyemma_tutorials/cli.py b/pyemma_tutorials/cli.py
@@ -20,8 +20,10 @@ def main():
 
     _nglview_pip_installed_workaround()
 
-    argv = ['--config=%s' % notebook_cfg, '--config=%s' % notebook_cfg_json]
-    print('arguments:', argv)
+    # extend passed arguments with our config files
+    import sys
+    argv = sys.argv[1:] + ['--config=%s' % notebook_cfg, '--config=%s' % notebook_cfg_json]
+    print('invoking notebook server with arguments:', argv)
     main_(argv=argv)
 
 

Original file line number	Diff line number	Diff line change
`@@ -279,7 +279,7 @@`
`279`	`279`	`" data[:-1], dim=2).score(\n",`
`280`	`280`	`" test_data=data[-1:],\n",`
`281`	`281`	`" score_method='VAMP2')\n",`
`282`		`- "print('VAMP2-score: {:f}'.format(score_phi_psi))"`
	`282`	`+ "print('VAMP2-score backbone torsions: {:f}'.format(score_phi_psi))"`
`283`	`283`	`]`
`284`	`284`	`},`
`285`	`285`	`{`
`@@ -369,14 +369,15 @@`
`369`	`369`	`" data[:-1], dim=2).score(\n",`
`370`	`370`	`" test_data=data[-1:],\n",`
`371`	`371`	`" score_method='VAMP2')\n",`
`372`		`- "print('VAMP2-score: {:f}'.format(score_heavy_atoms))"`
	`372`	`+ "print('VAMP2-score backbone torsions: {:f}'.format(score_phi_psi))\n",`
	`373`	`+ "print('VAMP2-score xyz: {:f}'.format(score_heavy_atoms))"`
`373`	`374`	`]`
`374`	`375`	`},`
`375`	`376`	`{`
`376`	`377`	`"cell_type": "markdown",`
`377`	`378`	`"metadata": {},`
`378`	`379`	`"source": [`
`379`		`- "As we see, the score for the heavy atom positions is much higher as the one for the $\\phi/\\psi$ torsion angles. We will learn later what this means.\n",`
	`380`	`+ "As we see, the score for the heavy atom positions is much higher as the one for the $\\phi/\\psi$ torsion angles. The feature with a higher score should be favored for further analysis, because it means that this feature contains more information about slow processes. If you are already digging deeper into your system of interest, you can of course restrict the analysis to a set of features you already know describes your process of interest, regardless of its VAMP score.\n",`
`380`	`381`	`"\n",`
`381`	`382`	`"Another featurization that is interesting especially for proteins is heavy atom distances:"`
`382`	`383`	`]`
`@@ -456,7 +457,7 @@`
`456`	`457`	`"cell_type": "markdown",`
`457`	`458`	`"metadata": {},`
`458`	`459`	`"source": [`
`459`		- "This function allows to stream the data and work on chunks instead of the full set. Most of the functions in the `coordinates` sub-package accept data in memory as well as streamed feature readers. However, some plotting functions require the data to be in memory. To load a (strided) subset into memory, we can use the `get_output()` method with a stride parameter:"
	`460`	+ "This function allows to stream the data and work on chunks instead of the full set. Most of the functions in the `coordinates` sub-package accept data in memory as well as streamed feature readers. However, some plotting functions require the data to be in memory. To load a (sub-sampled) subset into memory, we can use the `get_output()` method with a stride parameter:"
`460`	`461`	`]`
`461`	`462`	`},`
`462`	`463`	`{`