Merge pull request #172 from marscher/treat_high_mem

marscher · web-flow · commit e5def2074ade · 2018-11-12T19:35:45.000+01:00
added subsection on how to deal with huge datasets
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -14,14 +14,13 @@ jobs:
       - run:
           name: conda_config
           command: |
-            conda config --add channels conda-forge
             conda config --set always_yes true
             conda config --set quiet true
       - run: conda install conda-build
       - run: mkdir $NBVAL_OUTPUT
       - run:
           name: build_test
-          command: conda build .
+          command: conda build -c conda-forge .
           no_output_timeout: 20m
       - store_test_results:
           path: ~/junit
diff --git a/manuscript/manuscript.tex b/manuscript/manuscript.tex
@@ -584,7 +584,7 @@ \subsection{Modeling large systems}
 This problem may be mitigated by choosing a more specific set of features.
 
 Additional technical challenges for large systems include high demands on memory and computation time;
-we explain how to deal with those in the tutorials.
+we explain how to deal with those in the tutorials (Notebook 00 and 02).
 
 More details on how to model complex systems with the techniques presented here are described, e.g., by~\cite{plattner_protein_2015,plattner_complete_2017}.
 
diff --git a/notebooks/02-dimension-reduction-and-discretization.ipynb b/notebooks/02-dimension-reduction-and-discretization.ipynb
@@ -347,7 +347,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this simple example, we clearly see a significant correlation between the $y$ component of the input data and the first independent component.\n",
+    "In this simple example, we clearly see a significant correlation between the $y$ component of the input data and the first independent component."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "\n",
     "## Case 2: low-dimensional molecular dynamics data (alanine dipeptide)\n",
     "\n",
@@ -411,8 +417,74 @@
     "Now, we use a different featurization for the same data set and revisit how to use PCA, TICA, and VAMP.\n",
     "\n",
     "⚠️ In practice you almost never would like to use PCA as dimension reduction method in MSM building,\n",
-    "as it does not preserve kinetic variance. We are showing it here in these exercises to make this point clear.\n",
-    "\n",
+    "as it does not preserve kinetic variance. We are showing it here in these exercises to make this point clear."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Streaming memory discretization\n",
+    "For real world case examples it is often not possible to load entire datasets into main memory. We can perform the whole discretization step without the need of having the dataset fit into memory. Keep in mind that this is not as efficient as loading into memory, because certain calculations (e.g. featurization), will have to be recomputed during iterations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "reader = pyemma.coordinates.source(files, top=pdb) # create reader\n",
+    "reader.featurizer.add_backbone_torsions(periodic=False) # add feature\n",
+    "tica = pyemma.coordinates.tica(reader) # perform tica on feature space\n",
+    "cluster = pyemma.coordinates.cluster_mini_batch_kmeans(tica, k=10, batch_size=0.1, max_iter=3) # cluster in tica space\n",
+    "# get result\n",
+    "dtrajs = cluster.dtrajs\n",
+    "print('discrete trajectories:', dtrajs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We should mention that regular space clustering does not require to load the TICA output into memory, while $k$-means does. Use the minibatch version if your TICA output does not fit memory. Since the minibatch version takes more time to converge, it is therefore desirable to to shrink the TICA output to fit into memory. We split the pipeline for cluster estimation, and re-use the reader to for the assignment of the full dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cluster = pyemma.coordinates.cluster_kmeans(tica, k=10, stride=3) # use only 1/3 of the input data to find centers"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Have you noticed how fast this converged compared to the minibatch version?\n",
+    "We can now just obtain the discrete trajectories by accessing the property on the cluster instance.\n",
+    "This will get all the TICA projected trajectories and assign them to the centers computed on the reduced data set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dtrajs = cluster.dtrajs\n",
+    "print('Assignment:', dtrajs)\n",
+    "dtrajs_len = [len(d) for d in dtrajs]\n",
+    "for dtraj_len, input_len in zip(dtrajs_len, reader.trajectory_lengths()):\n",
+    "    print('Input length:', input_len, '\\tdtraj length:', dtraj_len)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "#### Exercise 1: data loading \n",
     "\n",
     "Load the heavy atoms' positions into memory."