Consecutive deepmd-kit runs fail on Summit supercomputer while doing hyperparameter optimization #1430
Replies: 3 comments 2 replies
-
|
It may be an MPI issue. If you install from conda, we only provide horovod built against MPICH2 and MPICH2 will be also installed into the conda environment. It's unclear whether it supports |
Beta Was this translation helpful? Give feedback.
-
|
What this actually turns out to be, @njzjz , is unique to our way of using it inside a Dask workflow, which calls |
Beta Was this translation helpful? Give feedback.
-
|
We can run horovod just fine with jsrun. The problem is the we would like
to use the gloo back end. Our issue is that we need to run deepmd more than
one time per jsrun. There is a problem with that for any code that calls
MPI_Init.
…On Fri, Jan 21, 2022, 9:17 PM Jinzhe Zeng ***@***.***> wrote:
I think hovorod does support jsrun (horovod/horovod#1805
<horovod/horovod#1805>, horovod/horovod#1441
<horovod/horovod#1441>) but you need to rebuild
hovorod against the corresponding MPI.
—
Reply to this email directly, view it on GitHub
<#1430 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFCCSQVOAJYHD22TJKAUPQTUXIHS7ANCNFSM5MKKTWUA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem Statement
We are using an evolutionary algorithm to optimize
deepmd-kithyperparameters on the Oak Ridge National Laboratory's Summit supercomputer, and are using the water example as a test-bed for our code. We have observed that one run ofdeepmd-kithappens on a given Summit node and then subsequentdeepmd-kitruns hang.Isolating the Problem
We were able to isolate this problem whereby we have a single batch job run
deepmd-kitin one directory and then immediately try to run in a second directory on a copy of the sameinput.json. Once the process hangs, we logged into the Summit node and attachedgdbto the running process. There we observed these top execution stack frames:We are guessing that the first
deepmd-kitinvocation didn't do a proper shutdown ofhorovod, which meant that a second execution ofdeepmd-kitwould start in a pathological hung state forever waiting on one or more resources that will never be freed up.To better show this test case, we run the following script on a single Summit node:
I.e., we just run
dpback-to-back in different directories, which emulates how the evolutionary algorithm (EA) works -- the EA has a UUID for each individual representing a particular set of hyperparameter values, so will create a directory with that UUID as its name, and then rundpin that directory with an associatedinput.jsoncrafted from gene values. In any case, the firstdpinvocation runs fine infirst_dir, but the seconddprun gets into that busy-wait state insecond_dirand does nothing.Program execution version and state
This is the output of the first successful run to share software version and invocation state:
Batch submission for reproducibility
I've attached the corresponding batch LSF file used to run our experiment that produced this pathological result, wedged-
deepmdkit.lsf.gz.wedged-deepmdkit.lsf.gz
Possible solutions
horovod.shutdown()at the end of training to clean up dangling resources to prepare for next runhorovodsomehow and run on single GPUWe've tried adding a
horovod.shutdown()todeepmd-kit/deepmd/train/run_options.pyin a new destructor,__del__(), for theRunOptionsclass, but that didn't appear to get called when training finishes. We are looking into ways of ensuring that shutdown is called to see if that frees up resources for subsequent invocations; if that fails we may look into building a singularity container. However, if someone here understands the problem for which they have a good solution, we'd love to hear it.Beta Was this translation helpful? Give feedback.
All reactions