Consecutive deepmd-kit runs fail on Summit supercomputer while doing hyperparameter optimization #1430
Replies: 3 comments 2 replies
-
It may be an MPI issue. If you install from conda, we only provide horovod built against MPICH2 and MPICH2 will be also installed into the conda environment. It's unclear whether it supports |
Beta Was this translation helpful? Give feedback.
-
What this actually turns out to be, @njzjz , is unique to our way of using it inside a Dask workflow, which calls |
Beta Was this translation helpful? Give feedback.
-
We can run horovod just fine with jsrun. The problem is the we would like
to use the gloo back end. Our issue is that we need to run deepmd more than
one time per jsrun. There is a problem with that for any code that calls
MPI_Init.
…On Fri, Jan 21, 2022, 9:17 PM Jinzhe Zeng ***@***.***> wrote:
I think hovorod does support jsrun (horovod/horovod#1805
<horovod/horovod#1805>, horovod/horovod#1441
<horovod/horovod#1441>) but you need to rebuild
hovorod against the corresponding MPI.
—
Reply to this email directly, view it on GitHub
<#1430 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFCCSQVOAJYHD22TJKAUPQTUXIHS7ANCNFSM5MKKTWUA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem Statement
We are using an evolutionary algorithm to optimize
deepmd-kit
hyperparameters on the Oak Ridge National Laboratory's Summit supercomputer, and are using the water example as a test-bed for our code. We have observed that one run ofdeepmd-kit
happens on a given Summit node and then subsequentdeepmd-kit
runs hang.Isolating the Problem
We were able to isolate this problem whereby we have a single batch job run
deepmd-kit
in one directory and then immediately try to run in a second directory on a copy of the sameinput.json
. Once the process hangs, we logged into the Summit node and attachedgdb
to the running process. There we observed these top execution stack frames:We are guessing that the first
deepmd-kit
invocation didn't do a proper shutdown ofhorovod
, which meant that a second execution ofdeepmd-kit
would start in a pathological hung state forever waiting on one or more resources that will never be freed up.To better show this test case, we run the following script on a single Summit node:
I.e., we just run
dp
back-to-back in different directories, which emulates how the evolutionary algorithm (EA) works -- the EA has a UUID for each individual representing a particular set of hyperparameter values, so will create a directory with that UUID as its name, and then rundp
in that directory with an associatedinput.json
crafted from gene values. In any case, the firstdp
invocation runs fine infirst_dir
, but the seconddp
run gets into that busy-wait state insecond_dir
and does nothing.Program execution version and state
This is the output of the first successful run to share software version and invocation state:
Batch submission for reproducibility
I've attached the corresponding batch LSF file used to run our experiment that produced this pathological result, wedged-
deepmdkit.lsf.gz
.wedged-deepmdkit.lsf.gz
Possible solutions
horovod.shutdown()
at the end of training to clean up dangling resources to prepare for next runhorovod
somehow and run on single GPUWe've tried adding a
horovod.shutdown()
todeepmd-kit/deepmd/train/run_options.py
in a new destructor,__del__()
, for theRunOptions
class, but that didn't appear to get called when training finishes. We are looking into ways of ensuring that shutdown is called to see if that frees up resources for subsequent invocations; if that fails we may look into building a singularity container. However, if someone here understands the problem for which they have a good solution, we'd love to hear it.Beta Was this translation helpful? Give feedback.
All reactions