add missing citation and finish loose end in NB02 markdown [ci skip]

thempel · thempel · commit 76d7d5244231 · 2018-08-31T11:46:21.000+02:00
diff --git a/manuscript/literature.bib b/manuscript/literature.bib
@@ -642,3 +642,14 @@ @article{plattner_complete_2017
         year = {2017},
         pages = {1005}
 }
+@inproceedings{aggarwal_surprising_2001,
+	series = {Lecture {Notes} in {Computer} {Science}},
+	title = {On the {Surprising} {Behavior} of {Distance} {Metrics} in {High} {Dimensional} {Space}},
+	isbn = {978-3-540-44503-6},
+	booktitle = {Database {Theory} — {ICDT} 2001},
+	publisher = {Springer Berlin Heidelberg},
+	author = {Aggarwal, Charu C. and Hinneburg, Alexander and Keim, Daniel A.},
+	editor = {Van den Bussche, Jan and Vianu, Victor},
+	year = {2001},
+	pages = {420--434},
+}
diff --git a/notebooks/02-dimension-reduction-and-discretization.ipynb b/notebooks/02-dimension-reduction-and-discretization.ipynb
@@ -760,7 +760,7 @@
     "\n",
     "## Case 3: another molecular dynamics data set (pentapeptide)\n",
     "\n",
-    "Before we start to load and discretize the pentapeptide data set, let us discuss what the difficulties with larger protein systems are. The goal of this notebook is to find a state space discretization for MSM estimation. This means that an algorithm such as $k$-means has to be able to find a meaningful state space partitioning. In general, this works better in lower dimensional spaces. The modeler should be aware that a discretization of hundreds of dimensions will most likely yield unsatisfactory results due to the . \n",
+    "Before we start to load and discretize the pentapeptide data set, let us discuss what the difficulties with larger protein systems are. The goal of this notebook is to find a state space discretization for MSM estimation. This means that an algorithm such as $k$-means has to be able to find a meaningful state space partitioning. In general, this works better in lower dimensional spaces because Euclidean distances become less meaningful with increasing dimensionality <a id=\"ref-4\" href=\"#cite-aggarwal_surprising_2001\">aggarwal-01</a>. The modeler should be aware that a discretization of hundreds of dimensions will be computationally expensive and most likely yield unsatisfactory results. \n",
     "\n",
     "The first goal is thus to map the data to a reasonable number of dimensions, e.g. with a smart choice of features and/or by using TICA. Large systems often require significant parts of the kinetic variance to be discarded in order to obtain a balance between capturing as much of the kinetic variance as possible and achieving a reasonable discretization.\n",
     "\n",