Skip to content

Commit 617e34d

Browse files
committed
Merge branch 'master' into th_rev
2 parents 02fa188 + e5def20 commit 617e34d

File tree

3 files changed

+77
-6
lines changed

3 files changed

+77
-6
lines changed

.circleci/config.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,14 +14,13 @@ jobs:
1414
- run:
1515
name: conda_config
1616
command: |
17-
conda config --add channels conda-forge
1817
conda config --set always_yes true
1918
conda config --set quiet true
2019
- run: conda install conda-build
2120
- run: mkdir $NBVAL_OUTPUT
2221
- run:
2322
name: build_test
24-
command: conda build .
23+
command: conda build -c conda-forge .
2524
no_output_timeout: 20m
2625
- store_test_results:
2726
path: ~/junit

manuscript/manuscript.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -594,7 +594,7 @@ \subsection{Modeling large systems}
594594
This problem may be mitigated by choosing a more specific set of features.
595595

596596
Additional technical challenges for large systems include high demands on memory and computation time;
597-
we explain how to deal with those in the tutorials.
597+
we explain how to deal with those in the tutorials (Notebook 00 and 02).
598598

599599
More details on how to model complex systems with the techniques presented here are described, e.g., by~\cite{plattner_protein_2015,plattner_complete_2017}.
600600
We further demonstrate the symptoms of difficult data situations and how to deal with them in Notebook (08).

notebooks/02-dimension-reduction-and-discretization.ipynb

Lines changed: 75 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -348,7 +348,13 @@
348348
"cell_type": "markdown",
349349
"metadata": {},
350350
"source": [
351-
"In this simple example, we clearly see a significant correlation between the $y$ component of the input data and the first independent component.\n",
351+
"In this simple example, we clearly see a significant correlation between the $y$ component of the input data and the first independent component."
352+
]
353+
},
354+
{
355+
"cell_type": "markdown",
356+
"metadata": {},
357+
"source": [
352358
"\n",
353359
"## Case 2: low-dimensional molecular dynamics data (alanine dipeptide)\n",
354360
"\n",
@@ -412,8 +418,74 @@
412418
"Now, we use a different featurization for the same data set and revisit how to use PCA, TICA, and VAMP.\n",
413419
"\n",
414420
"⚠️ In practice you almost never would like to use PCA as dimension reduction method in MSM building,\n",
415-
"as it does not preserve kinetic variance. We are showing it here in these exercises to make this point clear.\n",
416-
"\n",
421+
"as it does not preserve kinetic variance. We are showing it here in these exercises to make this point clear."
422+
]
423+
},
424+
{
425+
"cell_type": "markdown",
426+
"metadata": {},
427+
"source": [
428+
"### Streaming memory discretization\n",
429+
"For real world case examples it is often not possible to load entire datasets into main memory. We can perform the whole discretization step without the need of having the dataset fit into memory. Keep in mind that this is not as efficient as loading into memory, because certain calculations (e.g. featurization), will have to be recomputed during iterations."
430+
]
431+
},
432+
{
433+
"cell_type": "code",
434+
"execution_count": null,
435+
"metadata": {},
436+
"outputs": [],
437+
"source": [
438+
"reader = pyemma.coordinates.source(files, top=pdb) # create reader\n",
439+
"reader.featurizer.add_backbone_torsions(periodic=False) # add feature\n",
440+
"tica = pyemma.coordinates.tica(reader) # perform tica on feature space\n",
441+
"cluster = pyemma.coordinates.cluster_mini_batch_kmeans(tica, k=10, batch_size=0.1, max_iter=3) # cluster in tica space\n",
442+
"# get result\n",
443+
"dtrajs = cluster.dtrajs\n",
444+
"print('discrete trajectories:', dtrajs)"
445+
]
446+
},
447+
{
448+
"cell_type": "markdown",
449+
"metadata": {},
450+
"source": [
451+
"We should mention that regular space clustering does not require to load the TICA output into memory, while $k$-means does. Use the minibatch version if your TICA output does not fit memory. Since the minibatch version takes more time to converge, it is therefore desirable to to shrink the TICA output to fit into memory. We split the pipeline for cluster estimation, and re-use the reader to for the assignment of the full dataset."
452+
]
453+
},
454+
{
455+
"cell_type": "code",
456+
"execution_count": null,
457+
"metadata": {},
458+
"outputs": [],
459+
"source": [
460+
"cluster = pyemma.coordinates.cluster_kmeans(tica, k=10, stride=3) # use only 1/3 of the input data to find centers"
461+
]
462+
},
463+
{
464+
"cell_type": "markdown",
465+
"metadata": {},
466+
"source": [
467+
"Have you noticed how fast this converged compared to the minibatch version?\n",
468+
"We can now just obtain the discrete trajectories by accessing the property on the cluster instance.\n",
469+
"This will get all the TICA projected trajectories and assign them to the centers computed on the reduced data set."
470+
]
471+
},
472+
{
473+
"cell_type": "code",
474+
"execution_count": null,
475+
"metadata": {},
476+
"outputs": [],
477+
"source": [
478+
"dtrajs = cluster.dtrajs\n",
479+
"print('Assignment:', dtrajs)\n",
480+
"dtrajs_len = [len(d) for d in dtrajs]\n",
481+
"for dtraj_len, input_len in zip(dtrajs_len, reader.trajectory_lengths()):\n",
482+
" print('Input length:', input_len, '\\tdtraj length:', dtraj_len)"
483+
]
484+
},
485+
{
486+
"cell_type": "markdown",
487+
"metadata": {},
488+
"source": [
417489
"#### Exercise 1: data loading \n",
418490
"\n",
419491
"Load the heavy atoms' positions into memory."

0 commit comments

Comments
 (0)