Merge pull request #163 from marscher/nb02_impr

cwehmeyer · web-flow · commit d3e0f9f7b491 · 2018-09-06T16:18:13.000+02:00
[nb02] motivate usage of kmeans and reg space (use case, run time etc)
diff --git a/notebooks/02-dimension-reduction-and-discretization.ipynb b/notebooks/02-dimension-reduction-and-discretization.ipynb
@@ -72,7 +72,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Given the low dimensionality of this data set, we can directly attempt to discretize, e.g. with $k$-means with $100$ centers and a stride of $5$ to reduce the computational effort..."
+    "Given the low dimensionality of this data set, we can directly attempt to discretize, e.g. with $k$-means with $100$ centers and a stride of $5$ to reduce the computational effort. In real world examples we also might encounter low dimensional feature spaces, which do not require further dimension reduction techniques to be clustered efficiently."
    ]
   },
   {
@@ -129,7 +129,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Have you noticed how the $k$-means centers follow the density of the data points while the regspace centers are spread uniformly over the whole area?\n",
+    "Have you noticed how the $k$-means centers follow the density of the data points while the regspace centers are spread uniformly over the whole area? \n",
+    "\n",
+    "If your are only interested in well sampled states, you should use a density based method to discretize. If exploring new states is one of your objectives, it might be of advantage to place states also in rarely observed regions. The latter is especially useful in adaptive sampling approaches, because in the initial phase you want to explore the phase space as much as possible. The downside of placing states in areas of low density is that we will have poor statistics on these states. \n",
+    "\n",
+    "Another advantage of regular space clustering is that it is very fast in comparison to $k$-means: regspace clustering runs in linear time while $k$-means is superpolynomial in time.\n",
+    "\n",
+    "For very large datasets we also offer a mini batch version of $k$-means, which has the same semantics as the original method, but trains the centers on subsets of your data. This tutorial does not cover this case, but you should keep in mind, that $k$-means requires your low dimensional space to fit into your main memory.\n",
     "\n",
     "The main result of a discretization for Markov modeling, however, is not the set of centers but the time series of discrete states. These are accessible via the `dtrajs` attribute of any clustering object:"
    ]
@@ -372,7 +378,7 @@
    "source": [
     "Again, notice the difference between $k$-means and regspace clustering.\n",
     "\n",
-    "Now, we use a different featurization for the same data set and revisit how to use PCA and TICA.\n",
+    "Now, we use a different featurization for the same data set and revisit how to use PCA and TICA. In practice you almost never would like to use PCA as dimension reduction method in MSM building, as it does not preserve kinetic variance. We are showing it here in these exercises to make this point clear.\n",
     "\n",
     "#### Exercise 1: data loading \n",
     "\n",
@@ -1180,7 +1186,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.5"
+   "version": "3.6.6"
   },
   "toc": {
    "base_numbering": 1,