Skip to content

Unknown MTD bottleneck #467

@chertianser

Description

@chertianser

Describe the bug
I am running crest on relatively large transition-metal complexes (>100 atoms) using GFN2-xTB. I noticed that some jobs can take days on 48 cores, with the bottleneck being at the MTD step.

One example:

 -----------------
 Wall Time Summary
 -----------------
 CREST runtime (total)               4 d, 10 h, 16 min,  1.196 sec
 ------------------------------------------------------------------
 Trial metadynamics (MTD)   ...       20 min, 57.004 sec (  0.329%)
 Metadynamics (MTD)         ...     3809 min, 42.823 sec ( 59.751%)
 Geometry optimization      ...     1244 min, 51.169 sec ( 19.524%)
 Molecular dynamics (MD)    ...     1275 min, 33.403 sec ( 20.006%)
 Genetic crossing (GC)      ...       24 min, 40.559 sec (  0.387%)
 I/O and setup              ...        0 min, 16.238 sec (  0.004%)
 ------------------------------------------------------------------
 * wall-time:     4 d, 10 h, 16 min,  1.196 sec
 *  cpu-time:   107 d, 23 h, 49 min, 24.465 sec
 * ratio c/w:    24.390 speedup
 ------------------------------------------------------------------
 * Total number of energy+grad calls: 4512415

The MTDs themselves seem to be quick:

*MTD   9 completed successfully ...       15 min,  4.139 sec
*MTD  11 completed successfully ...       15 min, 22.552 sec
*MTD   8 completed successfully ...       18 min, 59.209 sec
*MTD  12 completed successfully ...       27 min, 34.236 sec
*MTD   6 completed successfully ...       31 min, 41.620 sec
*MTD  13 completed successfully ...       32 min, 11.275 sec
*MTD   2 completed successfully ...       38 min, 13.662 sec
*MTD   5 completed successfully ...       39 min, 32.141 sec
*MTD   3 completed successfully ...       40 min, 51.536 sec
*MTD  10 completed successfully ...       51 min,  4.949 sec
*MTD   7 completed successfully ...       57 min, 24.290 sec
*MTD   1 completed successfully ...       57 min, 56.692 sec
*MTD   4 completed successfully ...       59 min, 36.460 sec
*MTD  14 completed successfully ...       16 min,  1.645 sec

In particular, I notice that the jobs typically stall in this part of the MTD simulations, as I sporadically track the stdout:

...
========================================
           MTD Simulations done
========================================
 Collecting ensmbles.
CREGEN> running RMSDs ... done.
CREGEN> E lowest :  -192.88698
 45 structures remain within     6.00 kcal/mol window
 init_shake: metal bond            1           2 not constrained
 init_shake: metal bond            1           4 not constrained
 init_shake: metal bond            1          21 not constrained
 init_shake: metal bond            1          61 not constrained
 init_shake: metal bond            2          49 not constrained
 init_shake: metal bond           45          49 not constrained
 init_shake: metal bond           49          51 not constrained

>>> the jobs will stall here <<<
 ===============================================
 Additional regular MDs on lowest 4 conformer(s)
 ===============================================
...

top reports high CPU utilization of approximately 6000%, so crest is apparently still running. Structures are also being added to MD_FILES/crest_trj.*, so it appears crest is still doing something despite the lack of printout in stdout, but the additions are relatively slow (perhaps 2-3 minutes per addition?)

I am removing the structure from the stdout log that I've attached (happy to provide via DM for testing). I am also considering swapping over to just using --gfn2//gfnff for this workflow instead which seems to work much quicker on some other test cases.

stdout.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions