|
348 | 348 | "cell_type": "markdown", |
349 | 349 | "metadata": {}, |
350 | 350 | "source": [ |
351 | | - "In this simple example, we clearly see a significant correlation between the $y$ component of the input data and the first independent component.\n", |
| 351 | + "In this simple example, we clearly see a significant correlation between the $y$ component of the input data and the first independent component." |
| 352 | + ] |
| 353 | + }, |
| 354 | + { |
| 355 | + "cell_type": "markdown", |
| 356 | + "metadata": {}, |
| 357 | + "source": [ |
352 | 358 | "\n", |
353 | 359 | "## Case 2: low-dimensional molecular dynamics data (alanine dipeptide)\n", |
354 | 360 | "\n", |
|
412 | 418 | "Now, we use a different featurization for the same data set and revisit how to use PCA, TICA, and VAMP.\n", |
413 | 419 | "\n", |
414 | 420 | "⚠️ In practice you almost never would like to use PCA as dimension reduction method in MSM building,\n", |
415 | | - "as it does not preserve kinetic variance. We are showing it here in these exercises to make this point clear.\n", |
416 | | - "\n", |
| 421 | + "as it does not preserve kinetic variance. We are showing it here in these exercises to make this point clear." |
| 422 | + ] |
| 423 | + }, |
| 424 | + { |
| 425 | + "cell_type": "markdown", |
| 426 | + "metadata": {}, |
| 427 | + "source": [ |
| 428 | + "### Streaming memory discretization\n", |
| 429 | + "For real world case examples it is often not possible to load entire datasets into main memory. We can perform the whole discretization step without the need of having the dataset fit into memory. Keep in mind that this is not as efficient as loading into memory, because certain calculations (e.g. featurization), will have to be recomputed during iterations." |
| 430 | + ] |
| 431 | + }, |
| 432 | + { |
| 433 | + "cell_type": "code", |
| 434 | + "execution_count": null, |
| 435 | + "metadata": {}, |
| 436 | + "outputs": [], |
| 437 | + "source": [ |
| 438 | + "reader = pyemma.coordinates.source(files, top=pdb) # create reader\n", |
| 439 | + "reader.featurizer.add_backbone_torsions(periodic=False) # add feature\n", |
| 440 | + "tica = pyemma.coordinates.tica(reader) # perform tica on feature space\n", |
| 441 | + "cluster = pyemma.coordinates.cluster_mini_batch_kmeans(tica, k=10, batch_size=0.1, max_iter=3) # cluster in tica space\n", |
| 442 | + "# get result\n", |
| 443 | + "dtrajs = cluster.dtrajs\n", |
| 444 | + "print('discrete trajectories:', dtrajs)" |
| 445 | + ] |
| 446 | + }, |
| 447 | + { |
| 448 | + "cell_type": "markdown", |
| 449 | + "metadata": {}, |
| 450 | + "source": [ |
| 451 | + "We should mention that regular space clustering does not require to load the TICA output into memory, while $k$-means does. Use the minibatch version if your TICA output does not fit memory. Since the minibatch version takes more time to converge, it is therefore desirable to to shrink the TICA output to fit into memory. We split the pipeline for cluster estimation, and re-use the reader to for the assignment of the full dataset." |
| 452 | + ] |
| 453 | + }, |
| 454 | + { |
| 455 | + "cell_type": "code", |
| 456 | + "execution_count": null, |
| 457 | + "metadata": {}, |
| 458 | + "outputs": [], |
| 459 | + "source": [ |
| 460 | + "cluster = pyemma.coordinates.cluster_kmeans(tica, k=10, stride=3) # use only 1/3 of the input data to find centers" |
| 461 | + ] |
| 462 | + }, |
| 463 | + { |
| 464 | + "cell_type": "markdown", |
| 465 | + "metadata": {}, |
| 466 | + "source": [ |
| 467 | + "Have you noticed how fast this converged compared to the minibatch version?\n", |
| 468 | + "We can now just obtain the discrete trajectories by accessing the property on the cluster instance.\n", |
| 469 | + "This will get all the TICA projected trajectories and assign them to the centers computed on the reduced data set." |
| 470 | + ] |
| 471 | + }, |
| 472 | + { |
| 473 | + "cell_type": "code", |
| 474 | + "execution_count": null, |
| 475 | + "metadata": {}, |
| 476 | + "outputs": [], |
| 477 | + "source": [ |
| 478 | + "dtrajs = cluster.dtrajs\n", |
| 479 | + "print('Assignment:', dtrajs)\n", |
| 480 | + "dtrajs_len = [len(d) for d in dtrajs]\n", |
| 481 | + "for dtraj_len, input_len in zip(dtrajs_len, reader.trajectory_lengths()):\n", |
| 482 | + " print('Input length:', input_len, '\\tdtraj length:', dtraj_len)" |
| 483 | + ] |
| 484 | + }, |
| 485 | + { |
| 486 | + "cell_type": "markdown", |
| 487 | + "metadata": {}, |
| 488 | + "source": [ |
417 | 489 | "#### Exercise 1: data loading \n", |
418 | 490 | "\n", |
419 | 491 | "Load the heavy atoms' positions into memory." |
|
0 commit comments