Consecutive deepmd-kit runs fail on Summit supercomputer while doing hyperparameter optimization #1430

markcoletti · 2022-01-19T16:02:28Z

markcoletti
Jan 19, 2022

Problem Statement

We are using an evolutionary algorithm to optimize deepmd-kit hyperparameters on the Oak Ridge National Laboratory's Summit supercomputer, and are using the water example as a test-bed for our code. We have observed that one run of deepmd-kit happens on a given Summit node and then subsequent deepmd-kit runs hang.

Isolating the Problem

We were able to isolate this problem whereby we have a single batch job run deepmd-kit in one directory and then immediately try to run in a second directory on a copy of the same input.json. Once the process hangs, we logged into the Summit node and attached gdb to the running process. There we observed these top execution stack frames:

(gdb) bt
#0  0x00002000000c7f24 in nanosleep () from /lib64/power9/libpthread.so.0
#1  0x00002000a789e25c in std::this_thread::sleep_for<long, std::ratio<1l, 1000l> > (__rtime=...)
   from /gpfs/alpine/proj-shared/csc396/DPMD_CONDA_1-13-21/deepmd_env/lib/python3.8/site-packages/horovod/tensorflow/mpi_lib.cpython-38-powerpc64le-linux-gnu.so
#2  horovod::common::(anonymous namespace)::InitializeHorovodOnce (ranks=<optimized out>, nranks=<optimized out>)
    at /gpfs/alpine/stf007/world-shared/davismj/open-ce-builds/rhel8-oce-1.4.0/python-env/conda-bld/horovod_1633054368948/work/horovod/common/operations.cc:696

We are guessing that the first deepmd-kit invocation didn't do a proper shutdown of horovod, which meant that a second execution of deepmd-kit would start in a pathological hung state forever waiting on one or more resources that will never be freed up.

To better show this test case, we run the following script on a single Summit node:

#!/usr/bin/env sh

echo running in first_dir
(cd first_dir; dp train input.json)

echo running in second_dir
(cd second_dir; dp train input.json)

I.e., we just run dp back-to-back in different directories, which emulates how the evolutionary algorithm (EA) works -- the EA has a UUID for each individual representing a particular set of hyperparameter values, so will create a directory with that UUID as its name, and then run dp in that directory with an associated input.json crafted from gene values. In any case, the first dp invocation runs fine in first_dir, but the second dp run gets into that busy-wait state in second_dir and does nothing.

Program execution version and state

This is the output of the first successful run to share software version and invocation state:

DEEPMD INFO    training data with min nbor dist: 0.8854385688525511
DEEPMD INFO    training data with max nbor size: [151, 293]
DEEPMD WARNING sel of type 0 is not enough! The expected value is not less than 151, but you set it to 46. The accuracy of your model may get worse.
DEEPMD WARNING sel of type 1 is not enough! The expected value is not less than 293, but you set it to 92. The accuracy of your model may get worse.
DEEPMD INFO     _____               _____   __  __  _____           _     _  _
DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
DEEPMD INFO    Please read and cite:
DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO    installed to:         /tmp/pip-req-build-vuh14mww/_skbuild/linux-ppc64le-3.8/cmake-install
DEEPMD INFO    source :              v2.0.3-dirty
DEEPMD INFO    source brach:         master
DEEPMD INFO    source commit:        159e45d
DEEPMD INFO    source commit at:     2021-10-15 10:31:09 +0800
DEEPMD INFO    build float prec:     double
DEEPMD INFO    build with tf inc:    /gpfs/alpine/proj-shared/csc396/DPMD_CONDA_1-13-21/deepmd_env/lib/python3.8/site-packages/tensorflow/include
DEEPMD INFO    build with tf lib:
DEEPMD INFO    ---Summary of the training---------------------------------------
DEEPMD INFO    running on:           b21n18
DEEPMD INFO    computing device:     gpu:0
DEEPMD INFO    CUDA_VISIBLE_DEVICES: 0,1,2,3,4,5
DEEPMD INFO    Count of visible GPU: 6
DEEPMD INFO    num_intra_threads:    0
DEEPMD INFO    num_inter_threads:    0
DEEPMD INFO    -----------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
DEEPMD INFO    found 3 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO    -- /deepmd-kit/examples/water/data/data_0/     192       1      80  0.250    T
DEEPMD INFO    -- /deepmd-kit/examples/water/data/data_1/     192       1     160  0.500    T
DEEPMD INFO    -- /deepmd-kit/examples/water/data/data_2/     192       1      80  0.250    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: validation   -----------------------------------------------
DEEPMD INFO    found 1 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO    -- i/deepmd-kit/examples/water/data/data_3     192       1      80  1.000    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    training without frame parameter
DEEPMD INFO    built lr
DEEPMD INFO    built network
DEEPMD INFO    built training
DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 5.13e-04 (== 5.13e-04), decay_step 3290, decay_rate 0.434620, final lr will be 4.21e-05
DEEPMD INFO    batch     100 training time 1.76 s, testing time 0.02 s
DEEPMD INFO    batch     200 training time 1.09 s, testing time 0.02 s
...

Batch submission for reproducibility

I've attached the corresponding batch LSF file used to run our experiment that produced this pathological result, wedged-deepmdkit.lsf.gz.

wedged-deepmdkit.lsf.gz

Possible solutions

adding a horovod.shutdown() at the end of training to clean up dangling resources to prepare for next run
running in a singularity container so that container tear-down implicitly cleans up dangling resources
manually remove/reset dangling resources between invocations in some way
disable horovod somehow and run on single GPU
???

We've tried adding a horovod.shutdown() to deepmd-kit/deepmd/train/run_options.py in a new destructor, __del__(), for the RunOptions class, but that didn't appear to get called when training finishes. We are looking into ways of ensuring that shutdown is called to see if that frees up resources for subsequent invocations; if that fails we may look into building a singularity container. However, if someone here understands the problem for which they have a good solution, we'd love to hear it.

njzjz · 2022-01-19T18:48:03Z

njzjz
Jan 19, 2022
Maintainer

It may be an MPI issue. If you install from conda, we only provide horovod built against MPICH2 and MPICH2 will be also installed into the conda environment. It's unclear whether it supports jsrun on your cluster.

0 replies

asedova · 2022-01-21T23:27:31Z

asedova
Jan 21, 2022

What this actually turns out to be, @njzjz , is unique to our way of using it inside a Dask workflow, which calls jsrun (a version of mpirun) for launching workers, and then to run repeated instances of dp we are stuck with trying to run them inside the same mpirun call. It turns out that most implementations of MPI do not support multiple calls to an MPI-containing executable inside the same mpirun call. What is the problem is that Horovod is using MPI_init in dp. We are thinking that one solution (besides reconfiguring our Dask workflow) is to use the gloo backend in Horovod. Would that be relatively easy to do in a dp distributed training run?

1 reply

njzjz Jan 22, 2022
Maintainer

I think hovorod does support jsrun (horovod/horovod#1805, horovod/horovod#1441) but you need to rebuild hovorod against the corresponding MPI. The hovorod you installed from conda is a pre-compiled package.

asedova · 2022-01-22T03:33:30Z

asedova
Jan 22, 2022

We can run horovod just fine with jsrun. The problem is the we would like to use the gloo back end. Our issue is that we need to run deepmd more than one time per jsrun. There is a problem with that for any code that calls MPI_Init.

…

On Fri, Jan 21, 2022, 9:17 PM Jinzhe Zeng ***@***.***> wrote: I think hovorod does support jsrun (horovod/horovod#1805 <horovod/horovod#1805>, horovod/horovod#1441 <horovod/horovod#1441>) but you need to rebuild hovorod against the corresponding MPI. — Reply to this email directly, view it on GitHub <#1430 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFCCSQVOAJYHD22TJKAUPQTUXIHS7ANCNFSM5MKKTWUA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.*** com>

1 reply

njzjz Jan 22, 2022
Maintainer

Got it, and I think it's more like a horovod issue. You can also open an issue there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consecutive deepmd-kit runs fail on Summit supercomputer while doing hyperparameter optimization #1430

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Consecutive deepmd-kit runs fail on Summit supercomputer while doing hyperparameter optimization #1430

Uh oh!

markcoletti Jan 19, 2022

Problem Statement

Isolating the Problem

Program execution version and state

Batch submission for reproducibility

Possible solutions

Replies: 3 comments · 2 replies

Uh oh!

njzjz Jan 19, 2022 Maintainer

Uh oh!

asedova Jan 21, 2022

Uh oh!

Uh oh!

njzjz Jan 22, 2022 Maintainer

Uh oh!

asedova Jan 22, 2022

Uh oh!

njzjz Jan 22, 2022 Maintainer

markcoletti
Jan 19, 2022

Replies: 3 comments 2 replies

njzjz
Jan 19, 2022
Maintainer

asedova
Jan 21, 2022

njzjz Jan 22, 2022
Maintainer

asedova
Jan 22, 2022

njzjz Jan 22, 2022
Maintainer