Merge branch 'master' into th_rev

thempel · thempel · commit 617e34d530e6 · 2018-11-12T19:37:46.000+01:00
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -14,14 +14,13 @@ jobs:
       - run:
           name: conda_config
           command: |
-            conda config --add channels conda-forge
             conda config --set always_yes true
             conda config --set quiet true
       - run: conda install conda-build
       - run: mkdir $NBVAL_OUTPUT
       - run:
           name: build_test
-          command: conda build .
+          command: conda build -c conda-forge .
           no_output_timeout: 20m
       - store_test_results:
           path: ~/junit
diff --git a/manuscript/manuscript.tex b/manuscript/manuscript.tex
@@ -594,7 +594,7 @@ \subsection{Modeling large systems}
 This problem may be mitigated by choosing a more specific set of features.
 
 Additional technical challenges for large systems include high demands on memory and computation time;
-we explain how to deal with those in the tutorials.
+we explain how to deal with those in the tutorials (Notebook 00 and 02).
 
 More details on how to model complex systems with the techniques presented here are described, e.g., by~\cite{plattner_protein_2015,plattner_complete_2017}.
 We further demonstrate the symptoms of difficult data situations and how to deal with them in Notebook (08).
diff --git a/notebooks/02-dimension-reduction-and-discretization.ipynb b/notebooks/02-dimension-reduction-and-discretization.ipynb
@@ -348,7 +348,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this simple example, we clearly see a significant correlation between the $y$ component of the input data and the first independent component.\n",
+    "In this simple example, we clearly see a significant correlation between the $y$ component of the input data and the first independent component."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "\n",
     "## Case 2: low-dimensional molecular dynamics data (alanine dipeptide)\n",
     "\n",
@@ -412,8 +418,74 @@
     "Now, we use a different featurization for the same data set and revisit how to use PCA, TICA, and VAMP.\n",
     "\n",
     "⚠️ In practice you almost never would like to use PCA as dimension reduction method in MSM building,\n",
-    "as it does not preserve kinetic variance. We are showing it here in these exercises to make this point clear.\n",
-    "\n",
+    "as it does not preserve kinetic variance. We are showing it here in these exercises to make this point clear."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Streaming memory discretization\n",
+    "For real world case examples it is often not possible to load entire datasets into main memory. We can perform the whole discretization step without the need of having the dataset fit into memory. Keep in mind that this is not as efficient as loading into memory, because certain calculations (e.g. featurization), will have to be recomputed during iterations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "reader = pyemma.coordinates.source(files, top=pdb) # create reader\n",
+    "reader.featurizer.add_backbone_torsions(periodic=False) # add feature\n",
+    "tica = pyemma.coordinates.tica(reader) # perform tica on feature space\n",
+    "cluster = pyemma.coordinates.cluster_mini_batch_kmeans(tica, k=10, batch_size=0.1, max_iter=3) # cluster in tica space\n",
+    "# get result\n",
+    "dtrajs = cluster.dtrajs\n",
+    "print('discrete trajectories:', dtrajs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We should mention that regular space clustering does not require to load the TICA output into memory, while $k$-means does. Use the minibatch version if your TICA output does not fit memory. Since the minibatch version takes more time to converge, it is therefore desirable to to shrink the TICA output to fit into memory. We split the pipeline for cluster estimation, and re-use the reader to for the assignment of the full dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cluster = pyemma.coordinates.cluster_kmeans(tica, k=10, stride=3) # use only 1/3 of the input data to find centers"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Have you noticed how fast this converged compared to the minibatch version?\n",
+    "We can now just obtain the discrete trajectories by accessing the property on the cluster instance.\n",
+    "This will get all the TICA projected trajectories and assign them to the centers computed on the reduced data set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dtrajs = cluster.dtrajs\n",
+    "print('Assignment:', dtrajs)\n",
+    "dtrajs_len = [len(d) for d in dtrajs]\n",
+    "for dtraj_len, input_len in zip(dtrajs_len, reader.trajectory_lengths()):\n",
+    "    print('Input length:', input_len, '\\tdtraj length:', dtraj_len)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "#### Exercise 1: data loading \n",
     "\n",
     "Load the heavy atoms' positions into memory."