Out of memory in parallel training #1089

chazeon · 2021-09-02T19:55:16Z

chazeon
Sep 2, 2021

Hi everyone,

I am trying to use deepmd-kit 2.0.0 to practice parallel training, I am able to train with CUDA_VISIBLE_DEVICES=0,1 horovodrun -np 1 dp train params.yml but when I set -np to 2, I get the following error.

It seems that both processes from mpi are trying to use the same GPU?

(base) [chazeon@exp-7-59 000.1]$ CUDA_VISIBLE_DEVICES=0,1 horovodrun -np 2 dp train params.yml
2021-09-02 12:42:50.054023: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
[1,0]<stderr>:2021-09-02 12:44:11.109143: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
[1,1]<stderr>:2021-09-02 12:44:11.121574: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
[1,0]<stderr>:WARNING:tensorflow:From /expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
[1,0]<stderr>:Instructions for updating:
[1,0]<stderr>:non-resource variables are not supported in the long term
[1,1]<stderr>:WARNING:tensorflow:From /expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
[1,1]<stderr>:Instructions for updating:
[1,1]<stderr>:non-resource variables are not supported in the long term
[1,0]<stderr>:WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0
[1,0]<stderr>:WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0
[1,1]<stderr>:WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0
[1,1]<stderr>:WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0
[1,1]<stderr>:/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/utils/compat.py:316: UserWarning: It seems that you are using a deepmd-kit input of version 1.x.x, which is deprecated. we have converted the input to >2.0.0 compatible, and output it to file input_v2_compat.json
[1,1]<stderr>:  warnings.warn(msg)
[1,0]<stderr>:/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/utils/compat.py:316: UserWarning: It seems that you are using a deepmd-kit input of version 1.x.x, which is deprecated. we have converted the input to >2.0.0 compatible, and output it to file input_v2_compat.json
[1,0]<stderr>:  warnings.warn(msg)
[1,1]<stderr>:2021-09-02 12:44:43.922975: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
[1,1]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,0]<stderr>:2021-09-02 12:44:43.923073: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
[1,0]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,0]<stderr>:2021-09-02 12:44:43.925676: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
[1,1]<stderr>:2021-09-02 12:44:43.926085: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
[1,0]<stderr>:2021-09-02 12:44:44.258726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,0]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2021-09-02 12:44:44.261339: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
[1,0]<stderr>:pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2021-09-02 12:44:44.261414: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
[1,1]<stderr>:2021-09-02 12:44:44.270950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,1]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,1]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]<stderr>:2021-09-02 12:44:44.272435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
[1,1]<stderr>:pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,1]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]<stderr>:2021-09-02 12:44:44.272498: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
[1,0]<stderr>:2021-09-02 12:44:44.274217: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
[1,0]<stderr>:2021-09-02 12:44:44.274274: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
[1,0]<stderr>:2021-09-02 12:44:44.276485: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
[1,1]<stderr>:2021-09-02 12:44:44.278378: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
[1,1]<stderr>:2021-09-02 12:44:44.278455: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
[1,1]<stderr>:2021-09-02 12:44:44.280158: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
[1,0]<stderr>:2021-09-02 12:44:44.288279: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
[1,1]<stderr>:2021-09-02 12:44:44.289146: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
[1,0]<stderr>:2021-09-02 12:44:44.293728: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
[1,0]<stderr>:2021-09-02 12:44:44.296202: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
[1,1]<stderr>:2021-09-02 12:44:44.296510: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
[1,0]<stderr>:2021-09-02 12:44:44.299719: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.7
[1,0]<stderr>:2021-09-02 12:44:44.304933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
[1,0]<stderr>:2021-09-02 12:44:44.304980: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
[1,1]<stderr>:2021-09-02 12:44:44.306878: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
[1,1]<stderr>:2021-09-02 12:44:44.313240: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.7
[1,1]<stderr>:2021-09-02 12:44:44.324327: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
[1,1]<stderr>:2021-09-02 12:44:44.325268: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
[1,0]<stderr>:2021-09-02 12:44:45.801409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]<stderr>:2021-09-02 12:44:45.801480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 1
[1,0]<stderr>:2021-09-02 12:44:45.801512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N Y
[1,0]<stderr>:2021-09-02 12:44:45.801525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1:   Y N
[1,1]<stderr>:2021-09-02 12:44:45.814254: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]<stderr>:2021-09-02 12:44:45.814305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 1
[1,1]<stderr>:2021-09-02 12:44:45.814320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N Y
[1,1]<stderr>:2021-09-02 12:44:45.814332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1:   Y N
[1,0]<stderr>:2021-09-02 12:44:45.814715: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,1]<stderr>:2021-09-02 12:44:45.821670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,0]<stderr>:2021-09-02 12:44:45.823658: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
[1,0]<stderr>:2021-09-02 12:44:45.823963: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
[1,1]<stderr>:2021-09-02 12:44:45.825872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
[1,1]<stderr>:2021-09-02 12:44:45.826946: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
[1,1]<stderr>:2021-09-02 12:44:45.955666: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2500000000 Hz
[1,0]<stderr>:2021-09-02 12:44:45.955888: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2500000000 Hz
[1,1]<stderr>:OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 0
[1,1]<stderr>:OMP: Info #216: KMP_AFFINITY: decoding x2APIC ids.
[1,1]<stderr>:OMP: Info #157: KMP_AFFINITY: 1 available OS procs
[1,1]<stderr>:OMP: Info #158: KMP_AFFINITY: Uniform topology
[1,1]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
[1,1]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
[1,1]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "L2 cache" is equivalent to "core".
[1,1]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
[1,1]<stderr>:OMP: Info #192: KMP_AFFINITY: 1 socket x 1 core/socket x 1 thread/core (1 total cores)
[1,1]<stderr>:OMP: Info #218: KMP_AFFINITY: OS proc to physical thread map:
[1,1]<stderr>:OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0
[1,1]<stderr>:OMP: Info #254: KMP_AFFINITY: pid 12453 tid 12565 thread 1 bound to OS proc set 0
[1,0]<stderr>:OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 0
[1,0]<stderr>:OMP: Info #216: KMP_AFFINITY: decoding x2APIC ids.
[1,0]<stderr>:OMP: Info #157: KMP_AFFINITY: 1 available OS procs
[1,0]<stderr>:OMP: Info #158: KMP_AFFINITY: Uniform topology
[1,0]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
[1,0]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
[1,0]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "L2 cache" is equivalent to "core".
[1,0]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
[1,0]<stderr>:OMP: Info #192: KMP_AFFINITY: 1 socket x 1 core/socket x 1 thread/core (1 total cores)
[1,0]<stderr>:OMP: Info #218: KMP_AFFINITY: OS proc to physical thread map:
[1,0]<stderr>:OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0
[1,0]<stderr>:OMP: Info #254: KMP_AFFINITY: pid 12452 tid 12561 thread 1 bound to OS proc set 0
[1,1]<stderr>:OMP: Info #254: KMP_AFFINITY: pid 12453 tid 12566 thread 2 bound to OS proc set 0
[1,0]<stderr>:OMP: Info #254: KMP_AFFINITY: pid 12452 tid 12562 thread 2 bound to OS proc set 0
[1,0]<stderr>:DEEPMD INFO    training data with min nbor dist: 1.0231442946289506
[1,0]<stderr>:DEEPMD INFO    training data with max nbor size: [64, 134, 64]
[1,1]<stderr>:DEEPMD INFO    training data with min nbor dist: 1.0231442946289506
[1,1]<stderr>:DEEPMD INFO    training data with max nbor size: [64, 134, 64]
[1,1]<stderr>:DEEPMD INFO     _____               _____   __  __  _____           _     _  _
[1,1]<stderr>:DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
[1,1]<stderr>:DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
[1,1]<stderr>:DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
[1,1]<stderr>:DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
[1,1]<stderr>:DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
[1,1]<stderr>:DEEPMD INFO    Please read and cite:
[1,1]<stderr>:DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[1,1]<stderr>:DEEPMD INFO    installed to:         /tmp/pip-req-build-m0vk2rxi/_skbuild/linux-x86_64-3.9/cmake-install
[1,1]<stderr>:DEEPMD INFO    source :              v2.0.0
[1,1]<stderr>:DEEPMD INFO    source brach:         HEAD
[1,1]<stderr>:DEEPMD INFO    source commit:        1a25414
[1,1]<stderr>:DEEPMD INFO    source commit at:     2021-08-28 08:15:38 +0800
[1,1]<stderr>:DEEPMD INFO    build float prec:     double
[1,1]<stderr>:DEEPMD INFO    build with tf inc:    /expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/include
[1,1]<stderr>:DEEPMD INFO    build with tf lib:
[1,1]<stderr>:DEEPMD INFO    ---Summary of the training---------------------------------------
[1,1]<stderr>:DEEPMD INFO    running on:           exp-7-59
[1,1]<stderr>:DEEPMD INFO    computing device:     gpu:0
[1,1]<stderr>:DEEPMD INFO    CUDA_VISIBLE_DEVICES: 0,1
[1,1]<stderr>:DEEPMD INFO    Count of visible GPU: 2
[1,1]<stderr>:DEEPMD INFO    num_intra_threads:    0
[1,1]<stderr>:DEEPMD INFO    num_inter_threads:    0
[1,1]<stderr>:DEEPMD INFO    -----------------------------------------------------------------
[1,0]<stderr>:DEEPMD INFO     _____               _____   __  __  _____           _     _  _
[1,0]<stderr>:DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
[1,0]<stderr>:DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
[1,0]<stderr>:DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
[1,0]<stderr>:DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
[1,0]<stderr>:DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
[1,0]<stderr>:DEEPMD INFO    Please read and cite:
[1,0]<stderr>:DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[1,0]<stderr>:DEEPMD INFO    installed to:         /tmp/pip-req-build-m0vk2rxi/_skbuild/linux-x86_64-3.9/cmake-install
[1,0]<stderr>:DEEPMD INFO    source :              v2.0.0
[1,0]<stderr>:DEEPMD INFO    source brach:         HEAD
[1,0]<stderr>:DEEPMD INFO    source commit:        1a25414
[1,0]<stderr>:DEEPMD INFO    source commit at:     2021-08-28 08:15:38 +0800
[1,0]<stderr>:DEEPMD INFO    build float prec:     double
[1,0]<stderr>:DEEPMD INFO    build with tf inc:    /expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/include
[1,0]<stderr>:DEEPMD INFO    build with tf lib:
[1,0]<stderr>:DEEPMD INFO    ---Summary of the training---------------------------------------
[1,0]<stderr>:DEEPMD INFO    running on:           exp-7-59
[1,0]<stderr>:DEEPMD INFO    computing device:     gpu:0
[1,0]<stderr>:DEEPMD INFO    CUDA_VISIBLE_DEVICES: 0,1
[1,0]<stderr>:DEEPMD INFO    Count of visible GPU: 2
[1,0]<stderr>:DEEPMD INFO    num_intra_threads:    0
[1,0]<stderr>:DEEPMD INFO    num_inter_threads:    0
[1,0]<stderr>:DEEPMD INFO    -----------------------------------------------------------------
[1,0]<stderr>:2021-09-02 12:44:52.760624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,0]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]<stderr>:2021-09-02 12:44:52.762073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,1]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,1]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2021-09-02 12:44:52.763304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
[1,0]<stderr>:pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]<stderr>:2021-09-02 12:44:52.764668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
[1,1]<stderr>:pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,1]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2021-09-02 12:44:52.773133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
[1,0]<stderr>:2021-09-02 12:44:52.773207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]<stderr>:2021-09-02 12:44:52.773230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 1
[1,0]<stderr>:2021-09-02 12:44:52.773258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N Y
[1,0]<stderr>:2021-09-02 12:44:52.773270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1:   Y N
[1,1]<stderr>:2021-09-02 12:44:52.774708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
[1,1]<stderr>:2021-09-02 12:44:52.774759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]<stderr>:2021-09-02 12:44:52.774774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 1
[1,1]<stderr>:2021-09-02 12:44:52.774795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N Y
[1,1]<stderr>:2021-09-02 12:44:52.774822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1:   Y N
[1,1]<stderr>:2021-09-02 12:44:52.781309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,0]<stderr>:2021-09-02 12:44:52.782716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,0]<stderr>:2021-09-02 12:44:52.784111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
[1,1]<stderr>:2021-09-02 12:44:52.784288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
[1,1]<stderr>:DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
[1,0]<stderr>:DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
[1,1]<stderr>:DEEPMD INFO    found 24 system(s):
[1,1]<stderr>:DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
[1,1]<stderr>:DEEPMD INFO                       ../data/init/004/V112.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO    found 24 system(s):
[1,1]<stderr>:DEEPMD INFO                       ../data/init/004/V114.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
[1,1]<stderr>:DEEPMD INFO                       ../data/init/004/V110.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/004/V112.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                       ../data/init/004/V116.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/004/V114.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                       ../data/init/003/V112.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/004/V110.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                       ../data/init/003/V110.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/004/V116.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                       ../data/init/003/V114.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/003/V112.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                       ../data/init/003/V116.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/003/V110.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                       ../data/init/001/V116.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/003/V114.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                       ../data/init/001/V110.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/003/V116.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                       ../data/init/001/V114.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/001/V116.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                       ../data/init/001/V112.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/001/V110.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                       ../data/init/002/V110.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/001/V114.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                       ../data/init/002/V116.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/001/V112.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                       ../data/init/002/V112.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/002/V110.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                       ../data/init/002/V114.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/002/V116.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                        ../data/init/005/V44.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/002/V112.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                        ../data/init/005/V46.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                        ../data/init/005/V36.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                       ../data/init/002/V114.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                        ../data/init/005/V48.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                        ../data/init/005/V44.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                        ../data/init/005/V38.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                        ../data/init/005/V46.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                        ../data/init/005/V42.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                        ../data/init/005/V36.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                        ../data/init/005/V50.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                        ../data/init/005/V48.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO                        ../data/init/005/V40.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                        ../data/init/005/V38.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO    --------------------------------------------------------------------------------------
[1,0]<stderr>:DEEPMD INFO                        ../data/init/005/V42.0     128       1      40  0.042    T
[1,1]<stderr>:DEEPMD INFO    training without frame parameter
[1,0]<stderr>:DEEPMD INFO                        ../data/init/005/V50.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO                        ../data/init/005/V40.0     128       1      40  0.042    T
[1,0]<stderr>:DEEPMD INFO    --------------------------------------------------------------------------------------
[1,0]<stderr>:DEEPMD INFO    training without frame parameter
[1,0]<stderr>:2021-09-02 12:44:54.392611: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 29.84G (32039239680 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.393996: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 26.85G (28835315712 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.395343: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 24.17G (25951782912 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.396857: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 21.75G (23356604416 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.398347: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 19.58G (21020944384 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.399878: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 17.62G (18918848512 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.401414: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 15.86G (17026963456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.402936: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 14.27G (15324266496 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.406219: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 12.84G (13791839232 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.409880: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 11.56G (12412654592 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.412509: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 10.40G (11171388416 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.416159: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 9.36G (10054249472 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.418764: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 8.43G (9048823808 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.422393: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 7.58G (8143941120 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.424980: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 6.83G (7329546752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.428616: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 6.14G (6596592128 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.431204: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 5.53G (5936932864 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.433794: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 4.98G (5343239680 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.437447: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 4.48G (4808915456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.440039: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 4.03G (4328023552 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.443686: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 3.63G (3895220992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.446268: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 3.26G (3505698816 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.449884: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.94G (3155128832 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.452390: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.64G (2839616000 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.454955: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.38G (2555654400 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.458543: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.14G (2300088832 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.461126: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.93G (2070080000 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.464731: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.73G (1863072000 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.467280: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.56G (1676764928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.470896: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.41G (1509088512 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.473391: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.26G (1358179584 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.475956: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.14G (1222361600 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.479581: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.02G (1100125440 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:OMP: Info #254: KMP_AFFINITY: pid 12452 tid 12452 thread 0 bound to OS proc set 0
[1,0]<stderr>:2021-09-02 12:44:55.882946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,0]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2021-09-02 12:44:55.884261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
[1,0]<stderr>:pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2021-09-02 12:44:55.887535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
[1,0]<stderr>:2021-09-02 12:44:55.887600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]<stderr>:2021-09-02 12:44:55.887616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 1
[1,0]<stderr>:2021-09-02 12:44:55.887635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N Y
[1,0]<stderr>:2021-09-02 12:44:55.887648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1:   Y N
[1,0]<stderr>:2021-09-02 12:44:55.889699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,0]<stderr>:2021-09-02 12:44:55.891118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
[1,1]<stderr>:OMP: Info #254: KMP_AFFINITY: pid 12453 tid 12453 thread 0 bound to OS proc set 0
[1,1]<stderr>:2021-09-02 12:44:56.427814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,1]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,1]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]<stderr>:2021-09-02 12:44:56.433542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
[1,1]<stderr>:pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,1]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]<stderr>:2021-09-02 12:44:56.438922: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
[1,1]<stderr>:2021-09-02 12:44:56.438972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]<stderr>:2021-09-02 12:44:56.439854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 1
[1,1]<stderr>:2021-09-02 12:44:56.439871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N Y
[1,1]<stderr>:2021-09-02 12:44:56.439885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1:   Y N
[1,1]<stderr>:2021-09-02 12:44:56.447266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,1]<stderr>:2021-09-02 12:44:56.449673: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
[1,0]<stderr>:DEEPMD INFO    training data with min nbor dist: 1.0231442946289506
[1,0]<stderr>:DEEPMD INFO    training data with max nbor size: [64, 134, 64]
[1,0]<stderr>:DEEPMD INFO    built lr
[1,1]<stderr>:DEEPMD INFO    training data with min nbor dist: 1.0231442946289506
[1,1]<stderr>:DEEPMD INFO    training data with max nbor size: [64, 134, 64]
[1,1]<stderr>:DEEPMD INFO    built lr
[1,0]<stderr>:DEEPMD INFO    built network
[1,1]<stderr>:DEEPMD INFO    built network
[1,0]<stderr>:DEEPMD INFO    built training
[1,0]<stderr>:2021-09-02 12:45:12.125558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,0]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2021-09-02 12:45:12.127459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
[1,0]<stderr>:2021-09-02 12:45:12.127529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]<stderr>:2021-09-02 12:45:12.127546: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0
[1,0]<stderr>:2021-09-02 12:45:12.127560: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N
[1,0]<stderr>:2021-09-02 12:45:12.128376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,1]<stderr>:DEEPMD INFO    built training
[1,1]<stderr>:2021-09-02 12:45:12.196720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,1]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,1]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]<stderr>:2021-09-02 12:45:12.198628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
[1,1]<stderr>:2021-09-02 12:45:12.198704: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]<stderr>:2021-09-02 12:45:12.198729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0
[1,1]<stderr>:2021-09-02 12:45:12.198742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N
[1,1]<stderr>:2021-09-02 12:45:12.200548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,0]<stderr>:DEEPMD INFO    initialize model from scratch
[1,1]<stderr>:DEEPMD INFO    initialize model from scratch
[1,0]<stderr>:DEEPMD INFO    start training at lr 5.00e-04 (== 5.00e-04), decay_step 5000, decay_rate 0.763002, final lr will be 1.00e-08
[1,1]<stderr>:DEEPMD INFO    start training at lr 5.00e-04 (== 5.00e-04), decay_step 5000, decay_rate 0.763002, final lr will be 1.00e-08
[1,1]<stderr>:2021-09-02 12:45:15.559436: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
[1,0]<stderr>:2021-09-02 12:45:15.559362: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
[1,0]<stderr>:2021-09-02 12:45:16.024942: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-09-02 12:45:16.025026: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
[1,1]<stderr>:2021-09-02 12:45:16.033066: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,1]<stderr>:2021-09-02 12:45:16.033134: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
[1,0]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
[1,0]<stderr>:    return fn(*args)
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
[1,0]<stderr>:    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
[1,0]<stderr>:    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
[1,0]<stderr>:tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
[1,0]<stderr>:  (0) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,0]<stderr>:	 [[{{node filter_type_1/MatMul_8}}]]
[1,0]<stderr>:	 [[l2_force_test/_39]]
[1,0]<stderr>:  (1) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,0]<stderr>:	 [[{{node filter_type_1/MatMul_8}}]]
[1,0]<stderr>:0 successful operations.
[1,0]<stderr>:0 derived errors ignored.[1,0]<stderr>:
[1,0]<stderr>:
[1,0]<stderr>:During handling of the above exception, another exception occurred:
[1,0]<stderr>:
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/bin/dp", line 10, in <module>
[1,0]<stderr>:    sys.exit(main())
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 437, in main
[1,0]<stderr>:    train_dp(**dict_args)
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 102, in train
[1,0]<stderr>:    _do_work(jdata, run_opt, is_compress)
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 163, in _do_work
[1,0]<stderr>:    model.train(train_data, valid_data)
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/train/trainer.py", line 506, in train
[1,0]<stderr>:    self.valid_on_the_fly(fp, [train_batch], valid_batches, print_header=True)
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/train/trainer.py", line 600, in valid_on_the_fly
[1,0]<stderr>:    train_results = self.get_evaluation_results(train_batches)
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/train/trainer.py", line 652, in get_evaluation_results
[1,0]<stderr>:    results = self.loss.eval(self.sess, feed_dict, natoms)
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/loss/ener.py", line 140, in eval
[1,1]<stderr>:    return fn(*args)
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
[1,1]<stderr>:    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
[1,1]<stderr>:    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
[1,1]<stderr>:tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
[1,1]<stderr>:  (0) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,1]<stderr>:	 [[{{node filter_type_1/MatMul_8}}]]
[1,1]<stderr>:	 [[l2_force_test/_39]]
[1,1]<stderr>:  (1) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,1]<stderr>:	 [[{{node filter_type_1/MatMul_8}}]]
[1,1]<stderr>:0 successful operations.
[1,1]<stderr>:0 derived errors ignored.[1,1]<stderr>:
[1,1]<stderr>:
[1,1]<stderr>:During handling of the above exception, another exception occurred:
[1,1]<stderr>:
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/bin/dp", line 10, in <module>
[1,0]<stderr>:    error, error_e, error_f, error_v, error_ae, error_pf = run_sess(sess, run_data, feed_dict=feed_dict)
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/utils/sess.py", line 20, in run_sess
[1,1]<stderr>:    sys.exit(main())
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 437, in main
[1,1]<stderr>:    train_dp(**dict_args)
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 102, in train
[1,1]<stderr>:    _do_work(jdata, run_opt, is_compress)
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 163, in _do_work
[1,1]<stderr>:    model.train(train_data, valid_data)
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/train/trainer.py", line 506, in train
[1,1]<stderr>:    self.valid_on_the_fly(fp, [train_batch], valid_batches, print_header=True)
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/train/trainer.py", line 600, in valid_on_the_fly
[1,1]<stderr>:    train_results = self.get_evaluation_results(train_batches)
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/train/trainer.py", line 652, in get_evaluation_results
[1,1]<stderr>:    results = self.loss.eval(self.sess, feed_dict, natoms)
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/loss/ener.py", line 140, in eval
[1,1]<stderr>:    error, error_e, error_f, error_v, error_ae, error_pf = run_sess(sess, run_data, feed_dict=feed_dict)
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/utils/sess.py", line 20, in run_sess
[1,0]<stderr>:    return sess.run(*args, **kwargs)
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 967, in run
[1,0]<stderr>:    result = self._run(None, fetches, feed_dict, options_ptr,
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1190, in _run
[1,0]<stderr>:    results = self._do_run(handle, final_targets, final_fetches,
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1368, in _do_run
[1,0]<stderr>:    return self._do_call(_run_fn, feeds, fetches, targets, options,
[1,0]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
[1,0]<stderr>:    raise type(e)(node_def, op, message)
[1,0]<stderr>:tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
[1,0]<stderr>:  (0) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,0]<stderr>:	 [[node filter_type_1/MatMul_8 (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:176) ]]
[1,0]<stderr>:	 [[l2_force_test/_39]]
[1,0]<stderr>:  (1) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,0]<stderr>:	 [[node filter_type_1/MatMul_8 (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:176) ]]
[1,0]<stderr>:0 successful operations.
[1,0]<stderr>:0 derived errors ignored.
[1,0]<stderr>:
[1,0]<stderr>:Errors may have originated from an input operation.
[1,0]<stderr>:Input Source operations connected to node filter_type_1/MatMul_8:
[1,0]<stderr>: filter_type_1/matrix_1_2/read (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:155)
[1,0]<stderr>: filter_type_1/Reshape_15 (defined at /lib/python3.9/site-packages/deepmd/descriptor/se_a.py:715)
[1,0]<stderr>:
[1,0]<stderr>:Input Source operations connected to node filter_type_1/MatMul_8:
[1,0]<stderr>: filter_type_1/matrix_1_2/read (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:155)
[1,0]<stderr>: filter_type_1/Reshape_15 (defined at /lib/python3.9/site-packages/deepmd/descriptor/se_a.py:715)
[1,0]<stderr>:
[1,0]<stderr>:Original stack trace for 'filter_type_1/MatMul_8':
[1,0]<stderr>:  File "/bin/dp", line 10, in <module>
[1,0]<stderr>:    sys.exit(main())
[1,0]<stderr>:  File "/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 437, in main
[1,0]<stderr>:    train_dp(**dict_args)
[1,0]<stderr>:  File "/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 102, in train
[1,0]<stderr>:    _do_work(jdata, run_opt, is_compress)
[1,0]<stderr>:  File "/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 158, in _do_work
[1,0]<stderr>:    model.build(train_data, stop_batch)
[1,0]<stderr>:  File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 338, in build
[1,0]<stderr>:    self._build_network(data)
[1,0]<stderr>:  File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 362, in _build_network
[1,0]<stderr>:    = self.model.build (self.place_holders['coord'],
[1,0]<stderr>:  File "/lib/python3.9/site-packages/deepmd/model/ener.py", line 159, in build
[1,0]<stderr>:    = self.descrpt.build(coord_,
[1,0]<stderr>:  File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 433, in build
[1,0]<stderr>:    self.dout, self.qmat = self._pass_filter(self.descrpt_reshape,
[1,0]<stderr>:  File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 590, in _pass_filter
[1,0]<stderr>:    layer, qmat = self._filter(tf.cast(inputs_i, self.filter_precision), type_i, name='filter_type_'+str(type_i)+suffix, natoms=natoms, reuse=reuse, trainable = trainable, activation_fn = self.filter_activation_fn)
[1,0]<stderr>:  File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 793, in _filter
[1,0]<stderr>:    ret = self._filter_lower(
[1,0]<stderr>:  File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 732, in _filter_lower
[1,0]<stderr>:    xyz_scatter = embedding_net(
[1,0]<stderr>:  File "/lib/python3.9/site-packages/deepmd/utils/network.py", line 176, in embedding_net
[1,0]<stderr>:    hidden = tf.reshape(activation_fn(tf.matmul(xx, w) + b), [-1, outputs_size[ii]])
[1,0]<stderr>:  File "/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
[1,0]<stderr>:    return target(*args, **kwargs)
[1,0]<stderr>:  File "/lib/python3.9/site-packages/tensorflow/python/ops/math_ops.py", line 3489, in matmul
[1,0]<stderr>:    return gen_math_ops.mat_mul(
[1,0]<stderr>:  File "/lib/python3.9/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5716, in mat_mul
[1,0]<stderr>:    _, _, _op, _outputs = _op_def_library._apply_op_helper(
[1,0]<stderr>:  File "/lib/python3.9/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
[1,0]<stderr>:    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
[1,0]<stderr>:  File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 3557, in _create_op_internal
[1,0]<stderr>:    ret = Operation(
[1,0]<stderr>:  File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
[1,0]<stderr>:    self._traceback = tf_stack.extract_stack_for_node(self._c_op)
[1,0]<stderr>:
[1,1]<stderr>:    return sess.run(*args, **kwargs)
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 967, in run
[1,1]<stderr>:    result = self._run(None, fetches, feed_dict, options_ptr,
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1190, in _run
[1,1]<stderr>:    results = self._do_run(handle, final_targets, final_fetches,
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1368, in _do_run
[1,1]<stderr>:    return self._do_call(_run_fn, feeds, fetches, targets, options,
[1,1]<stderr>:  File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
[1,1]<stderr>:    raise type(e)(node_def, op, message)
[1,1]<stderr>:tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
[1,1]<stderr>:  (0) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,1]<stderr>:	 [[node filter_type_1/MatMul_8 (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:176) ]]
[1,1]<stderr>:	 [[l2_force_test/_39]]
[1,1]<stderr>:  (1) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,1]<stderr>:	 [[node filter_type_1/MatMul_8 (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:176) ]]
[1,1]<stderr>:0 successful operations.
[1,1]<stderr>:0 derived errors ignored.
[1,1]<stderr>:
[1,1]<stderr>:Errors may have originated from an input operation.
[1,1]<stderr>:Input Source operations connected to node filter_type_1/MatMul_8:
[1,1]<stderr>: filter_type_1/matrix_1_2/read (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:155)
[1,1]<stderr>: filter_type_1/Reshape_15 (defined at /lib/python3.9/site-packages/deepmd/descriptor/se_a.py:715)
[1,1]<stderr>:
[1,1]<stderr>:Input Source operations connected to node filter_type_1/MatMul_8:
[1,1]<stderr>: filter_type_1/matrix_1_2/read (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:155)
[1,1]<stderr>: filter_type_1/Reshape_15 (defined at /lib/python3.9/site-packages/deepmd/descriptor/se_a.py:715)
[1,1]<stderr>:
[1,1]<stderr>:Original stack trace for 'filter_type_1/MatMul_8':
[1,1]<stderr>:  File "/bin/dp", line 10, in <module>
[1,1]<stderr>:    sys.exit(main())
[1,1]<stderr>:  File "/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 437, in main
[1,1]<stderr>:    train_dp(**dict_args)
[1,1]<stderr>:  File "/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 102, in train
[1,1]<stderr>:    _do_work(jdata, run_opt, is_compress)
[1,1]<stderr>:  File "/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 158, in _do_work
[1,1]<stderr>:    model.build(train_data, stop_batch)
[1,1]<stderr>:  File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 338, in build
[1,1]<stderr>:    self._build_network(data)
[1,1]<stderr>:  File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 362, in _build_network
[1,1]<stderr>:    = self.model.build (self.place_holders['coord'],
[1,1]<stderr>:  File "/lib/python3.9/site-packages/deepmd/model/ener.py", line 159, in build
[1,1]<stderr>:    = self.descrpt.build(coord_,
[1,1]<stderr>:  File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 433, in build
[1,1]<stderr>:    self.dout, self.qmat = self._pass_filter(self.descrpt_reshape,
[1,1]<stderr>:  File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 590, in _pass_filter
[1,1]<stderr>:    layer, qmat = self._filter(tf.cast(inputs_i, self.filter_precision), type_i, name='filter_type_'+str(type_i)+suffix, natoms=natoms, reuse=reuse, trainable = trainable, activation_fn = self.filter_activation_fn)
[1,1]<stderr>:  File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 793, in _filter
[1,1]<stderr>:    ret = self._filter_lower(
[1,1]<stderr>:  File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 732, in _filter_lower
[1,1]<stderr>:    xyz_scatter = embedding_net(
[1,1]<stderr>:  File "/lib/python3.9/site-packages/deepmd/utils/network.py", line 176, in embedding_net
[1,1]<stderr>:    hidden = tf.reshape(activation_fn(tf.matmul(xx, w) + b), [-1, outputs_size[ii]])
[1,1]<stderr>:  File "/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
[1,1]<stderr>:    return target(*args, **kwargs)
[1,1]<stderr>:  File "/lib/python3.9/site-packages/tensorflow/python/ops/math_ops.py", line 3489, in matmul
[1,1]<stderr>:    return gen_math_ops.mat_mul(
[1,1]<stderr>:  File "/lib/python3.9/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5716, in mat_mul
[1,1]<stderr>:    _, _, _op, _outputs = _op_def_library._apply_op_helper(
[1,1]<stderr>:  File "/lib/python3.9/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
[1,1]<stderr>:    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
[1,1]<stderr>:  File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 3557, in _create_op_internal
[1,1]<stderr>:    ret = Operation(
[1,1]<stderr>:  File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
[1,1]<stderr>:    self._traceback = tf_stack.extract_stack_for_node(self._c_op)
[1,1]<stderr>:
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[21147,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Here is the nvidia-smi output:

(base) [chazeon@exp-7-59 000.1]$ nvidia-smi
Thu Sep  2 12:54:49 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:18:00.0 Off |                    0 |
| N/A   32C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   30C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Seems like the situation described on the horovod’s troubleshooting page:

If you notice that your program is running out of GPU memory and multiple processes are being placed on the same GPU, it’s likely that your program (or its dependencies) create a tf.Session that does not use the config that pins specific GPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Out of memory in parallel training #1089

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Out of memory in parallel training #1089

Uh oh!

Uh oh!

chazeon Sep 2, 2021

Replies: 0 comments

chazeon
Sep 2, 2021