Skip to content

Commit e5def20

Browse files
authored
Merge pull request #172 from marscher/treat_high_mem
added subsection on how to deal with huge datasets
2 parents 70e13b1 + 19b5876 commit e5def20

File tree

3 files changed

+77
-6
lines changed

3 files changed

+77
-6
lines changed

.circleci/config.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,14 +14,13 @@ jobs:
1414
- run:
1515
name: conda_config
1616
command: |
17-
conda config --add channels conda-forge
1817
conda config --set always_yes true
1918
conda config --set quiet true
2019
- run: conda install conda-build
2120
- run: mkdir $NBVAL_OUTPUT
2221
- run:
2322
name: build_test
24-
command: conda build .
23+
command: conda build -c conda-forge .
2524
no_output_timeout: 20m
2625
- store_test_results:
2726
path: ~/junit

manuscript/manuscript.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -584,7 +584,7 @@ \subsection{Modeling large systems}
584584
This problem may be mitigated by choosing a more specific set of features.
585585

586586
Additional technical challenges for large systems include high demands on memory and computation time;
587-
we explain how to deal with those in the tutorials.
587+
we explain how to deal with those in the tutorials (Notebook 00 and 02).
588588

589589
More details on how to model complex systems with the techniques presented here are described, e.g., by~\cite{plattner_protein_2015,plattner_complete_2017}.
590590

notebooks/02-dimension-reduction-and-discretization.ipynb

Lines changed: 75 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -347,7 +347,13 @@
347347
"cell_type": "markdown",
348348
"metadata": {},
349349
"source": [
350-
"In this simple example, we clearly see a significant correlation between the $y$ component of the input data and the first independent component.\n",
350+
"In this simple example, we clearly see a significant correlation between the $y$ component of the input data and the first independent component."
351+
]
352+
},
353+
{
354+
"cell_type": "markdown",
355+
"metadata": {},
356+
"source": [
351357
"\n",
352358
"## Case 2: low-dimensional molecular dynamics data (alanine dipeptide)\n",
353359
"\n",
@@ -411,8 +417,74 @@
411417
"Now, we use a different featurization for the same data set and revisit how to use PCA, TICA, and VAMP.\n",
412418
"\n",
413419
"⚠️ In practice you almost never would like to use PCA as dimension reduction method in MSM building,\n",
414-
"as it does not preserve kinetic variance. We are showing it here in these exercises to make this point clear.\n",
415-
"\n",
420+
"as it does not preserve kinetic variance. We are showing it here in these exercises to make this point clear."
421+
]
422+
},
423+
{
424+
"cell_type": "markdown",
425+
"metadata": {},
426+
"source": [
427+
"### Streaming memory discretization\n",
428+
"For real world case examples it is often not possible to load entire datasets into main memory. We can perform the whole discretization step without the need of having the dataset fit into memory. Keep in mind that this is not as efficient as loading into memory, because certain calculations (e.g. featurization), will have to be recomputed during iterations."
429+
]
430+
},
431+
{
432+
"cell_type": "code",
433+
"execution_count": null,
434+
"metadata": {},
435+
"outputs": [],
436+
"source": [
437+
"reader = pyemma.coordinates.source(files, top=pdb) # create reader\n",
438+
"reader.featurizer.add_backbone_torsions(periodic=False) # add feature\n",
439+
"tica = pyemma.coordinates.tica(reader) # perform tica on feature space\n",
440+
"cluster = pyemma.coordinates.cluster_mini_batch_kmeans(tica, k=10, batch_size=0.1, max_iter=3) # cluster in tica space\n",
441+
"# get result\n",
442+
"dtrajs = cluster.dtrajs\n",
443+
"print('discrete trajectories:', dtrajs)"
444+
]
445+
},
446+
{
447+
"cell_type": "markdown",
448+
"metadata": {},
449+
"source": [
450+
"We should mention that regular space clustering does not require to load the TICA output into memory, while $k$-means does. Use the minibatch version if your TICA output does not fit memory. Since the minibatch version takes more time to converge, it is therefore desirable to to shrink the TICA output to fit into memory. We split the pipeline for cluster estimation, and re-use the reader to for the assignment of the full dataset."
451+
]
452+
},
453+
{
454+
"cell_type": "code",
455+
"execution_count": null,
456+
"metadata": {},
457+
"outputs": [],
458+
"source": [
459+
"cluster = pyemma.coordinates.cluster_kmeans(tica, k=10, stride=3) # use only 1/3 of the input data to find centers"
460+
]
461+
},
462+
{
463+
"cell_type": "markdown",
464+
"metadata": {},
465+
"source": [
466+
"Have you noticed how fast this converged compared to the minibatch version?\n",
467+
"We can now just obtain the discrete trajectories by accessing the property on the cluster instance.\n",
468+
"This will get all the TICA projected trajectories and assign them to the centers computed on the reduced data set."
469+
]
470+
},
471+
{
472+
"cell_type": "code",
473+
"execution_count": null,
474+
"metadata": {},
475+
"outputs": [],
476+
"source": [
477+
"dtrajs = cluster.dtrajs\n",
478+
"print('Assignment:', dtrajs)\n",
479+
"dtrajs_len = [len(d) for d in dtrajs]\n",
480+
"for dtraj_len, input_len in zip(dtrajs_len, reader.trajectory_lengths()):\n",
481+
" print('Input length:', input_len, '\\tdtraj length:', dtraj_len)"
482+
]
483+
},
484+
{
485+
"cell_type": "markdown",
486+
"metadata": {},
487+
"source": [
416488
"#### Exercise 1: data loading \n",
417489
"\n",
418490
"Load the heavy atoms' positions into memory."

0 commit comments

Comments
 (0)