You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to use deepmd-kit 2.0.0 to practice parallel training, I am able to train with CUDA_VISIBLE_DEVICES=0,1 horovodrun -np 1 dp train params.yml but when I set -np to 2, I get the following error.
It seems that both processes from mpi are trying to use the same GPU?
(base) [chazeon@exp-7-59 000.1]$ CUDA_VISIBLE_DEVICES=0,1 horovodrun -np 2 dp train params.yml
2021-09-02 12:42:50.054023: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
[1,0]<stderr>:2021-09-02 12:44:11.109143: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
[1,1]<stderr>:2021-09-02 12:44:11.121574: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
[1,0]<stderr>:WARNING:tensorflow:From /expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
[1,0]<stderr>:Instructions for updating:
[1,0]<stderr>:non-resource variables are not supported in the long term
[1,1]<stderr>:WARNING:tensorflow:From /expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
[1,1]<stderr>:Instructions for updating:
[1,1]<stderr>:non-resource variables are not supported in the long term
[1,0]<stderr>:WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0
[1,0]<stderr>:WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0
[1,1]<stderr>:WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0
[1,1]<stderr>:WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0
[1,1]<stderr>:/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/utils/compat.py:316: UserWarning: It seems that you are using a deepmd-kit input of version 1.x.x, which is deprecated. we have converted the input to >2.0.0 compatible, and output it to file input_v2_compat.json
[1,1]<stderr>: warnings.warn(msg)
[1,0]<stderr>:/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/utils/compat.py:316: UserWarning: It seems that you are using a deepmd-kit input of version 1.x.x, which is deprecated. we have converted the input to >2.0.0 compatible, and output it to file input_v2_compat.json
[1,0]<stderr>: warnings.warn(msg)
[1,1]<stderr>:2021-09-02 12:44:43.922975: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
[1,1]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,0]<stderr>:2021-09-02 12:44:43.923073: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
[1,0]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,0]<stderr>:2021-09-02 12:44:43.925676: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
[1,1]<stderr>:2021-09-02 12:44:43.926085: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
[1,0]<stderr>:2021-09-02 12:44:44.258726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,0]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2021-09-02 12:44:44.261339: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
[1,0]<stderr>:pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2021-09-02 12:44:44.261414: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
[1,1]<stderr>:2021-09-02 12:44:44.270950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,1]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,1]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]<stderr>:2021-09-02 12:44:44.272435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
[1,1]<stderr>:pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,1]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]<stderr>:2021-09-02 12:44:44.272498: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
[1,0]<stderr>:2021-09-02 12:44:44.274217: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
[1,0]<stderr>:2021-09-02 12:44:44.274274: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
[1,0]<stderr>:2021-09-02 12:44:44.276485: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
[1,1]<stderr>:2021-09-02 12:44:44.278378: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
[1,1]<stderr>:2021-09-02 12:44:44.278455: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
[1,1]<stderr>:2021-09-02 12:44:44.280158: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
[1,0]<stderr>:2021-09-02 12:44:44.288279: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
[1,1]<stderr>:2021-09-02 12:44:44.289146: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
[1,0]<stderr>:2021-09-02 12:44:44.293728: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
[1,0]<stderr>:2021-09-02 12:44:44.296202: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
[1,1]<stderr>:2021-09-02 12:44:44.296510: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
[1,0]<stderr>:2021-09-02 12:44:44.299719: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.7
[1,0]<stderr>:2021-09-02 12:44:44.304933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
[1,0]<stderr>:2021-09-02 12:44:44.304980: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
[1,1]<stderr>:2021-09-02 12:44:44.306878: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
[1,1]<stderr>:2021-09-02 12:44:44.313240: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.7
[1,1]<stderr>:2021-09-02 12:44:44.324327: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
[1,1]<stderr>:2021-09-02 12:44:44.325268: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
[1,0]<stderr>:2021-09-02 12:44:45.801409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]<stderr>:2021-09-02 12:44:45.801480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1
[1,0]<stderr>:2021-09-02 12:44:45.801512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Y
[1,0]<stderr>:2021-09-02 12:44:45.801525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: Y N
[1,1]<stderr>:2021-09-02 12:44:45.814254: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]<stderr>:2021-09-02 12:44:45.814305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1
[1,1]<stderr>:2021-09-02 12:44:45.814320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Y
[1,1]<stderr>:2021-09-02 12:44:45.814332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: Y N
[1,0]<stderr>:2021-09-02 12:44:45.814715: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,1]<stderr>:2021-09-02 12:44:45.821670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,0]<stderr>:2021-09-02 12:44:45.823658: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
[1,0]<stderr>:2021-09-02 12:44:45.823963: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
[1,1]<stderr>:2021-09-02 12:44:45.825872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
[1,1]<stderr>:2021-09-02 12:44:45.826946: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
[1,1]<stderr>:2021-09-02 12:44:45.955666: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2500000000 Hz
[1,0]<stderr>:2021-09-02 12:44:45.955888: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2500000000 Hz
[1,1]<stderr>:OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 0
[1,1]<stderr>:OMP: Info #216: KMP_AFFINITY: decoding x2APIC ids.
[1,1]<stderr>:OMP: Info #157: KMP_AFFINITY: 1 available OS procs
[1,1]<stderr>:OMP: Info #158: KMP_AFFINITY: Uniform topology
[1,1]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
[1,1]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
[1,1]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "L2 cache" is equivalent to "core".
[1,1]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
[1,1]<stderr>:OMP: Info #192: KMP_AFFINITY: 1 socket x 1 core/socket x 1 thread/core (1 total cores)
[1,1]<stderr>:OMP: Info #218: KMP_AFFINITY: OS proc to physical thread map:
[1,1]<stderr>:OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0
[1,1]<stderr>:OMP: Info #254: KMP_AFFINITY: pid 12453 tid 12565 thread 1 bound to OS proc set 0
[1,0]<stderr>:OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 0
[1,0]<stderr>:OMP: Info #216: KMP_AFFINITY: decoding x2APIC ids.
[1,0]<stderr>:OMP: Info #157: KMP_AFFINITY: 1 available OS procs
[1,0]<stderr>:OMP: Info #158: KMP_AFFINITY: Uniform topology
[1,0]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
[1,0]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
[1,0]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "L2 cache" is equivalent to "core".
[1,0]<stderr>:OMP: Info #287: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
[1,0]<stderr>:OMP: Info #192: KMP_AFFINITY: 1 socket x 1 core/socket x 1 thread/core (1 total cores)
[1,0]<stderr>:OMP: Info #218: KMP_AFFINITY: OS proc to physical thread map:
[1,0]<stderr>:OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 core 0 thread 0
[1,0]<stderr>:OMP: Info #254: KMP_AFFINITY: pid 12452 tid 12561 thread 1 bound to OS proc set 0
[1,1]<stderr>:OMP: Info #254: KMP_AFFINITY: pid 12453 tid 12566 thread 2 bound to OS proc set 0
[1,0]<stderr>:OMP: Info #254: KMP_AFFINITY: pid 12452 tid 12562 thread 2 bound to OS proc set 0
[1,0]<stderr>:DEEPMD INFO training data with min nbor dist: 1.0231442946289506
[1,0]<stderr>:DEEPMD INFO training data with max nbor size: [64, 134, 64]
[1,1]<stderr>:DEEPMD INFO training data with min nbor dist: 1.0231442946289506
[1,1]<stderr>:DEEPMD INFO training data with max nbor size: [64, 134, 64]
[1,1]<stderr>:DEEPMD INFO _____ _____ __ __ _____ _ _ _
[1,1]<stderr>:DEEPMD INFO | __ \ | __ \ | \/ || __ \ | | (_)| |
[1,1]<stderr>:DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |_
[1,1]<stderr>:DEEPMD INFO | | | | / _ \ / _ \| ___/ | |\/| || | | ||______|| |/ /| || __|
[1,1]<stderr>:DEEPMD INFO | |__| || __/| __/| | | | | || |__| | | < | || |_
[1,1]<stderr>:DEEPMD INFO |_____/ \___| \___||_| |_| |_||_____/ |_|\_\|_| \__|
[1,1]<stderr>:DEEPMD INFO Please read and cite:
[1,1]<stderr>:DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[1,1]<stderr>:DEEPMD INFO installed to: /tmp/pip-req-build-m0vk2rxi/_skbuild/linux-x86_64-3.9/cmake-install
[1,1]<stderr>:DEEPMD INFO source : v2.0.0
[1,1]<stderr>:DEEPMD INFO source brach: HEAD
[1,1]<stderr>:DEEPMD INFO source commit: 1a25414
[1,1]<stderr>:DEEPMD INFO source commit at: 2021-08-28 08:15:38 +0800
[1,1]<stderr>:DEEPMD INFO build float prec: double
[1,1]<stderr>:DEEPMD INFO build with tf inc: /expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/include
[1,1]<stderr>:DEEPMD INFO build with tf lib:
[1,1]<stderr>:DEEPMD INFO ---Summary of the training---------------------------------------
[1,1]<stderr>:DEEPMD INFO running on: exp-7-59
[1,1]<stderr>:DEEPMD INFO computing device: gpu:0
[1,1]<stderr>:DEEPMD INFO CUDA_VISIBLE_DEVICES: 0,1
[1,1]<stderr>:DEEPMD INFO Count of visible GPU: 2
[1,1]<stderr>:DEEPMD INFO num_intra_threads: 0
[1,1]<stderr>:DEEPMD INFO num_inter_threads: 0
[1,1]<stderr>:DEEPMD INFO -----------------------------------------------------------------
[1,0]<stderr>:DEEPMD INFO _____ _____ __ __ _____ _ _ _
[1,0]<stderr>:DEEPMD INFO | __ \ | __ \ | \/ || __ \ | | (_)| |
[1,0]<stderr>:DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |_
[1,0]<stderr>:DEEPMD INFO | | | | / _ \ / _ \| ___/ | |\/| || | | ||______|| |/ /| || __|
[1,0]<stderr>:DEEPMD INFO | |__| || __/| __/| | | | | || |__| | | < | || |_
[1,0]<stderr>:DEEPMD INFO |_____/ \___| \___||_| |_| |_||_____/ |_|\_\|_| \__|
[1,0]<stderr>:DEEPMD INFO Please read and cite:
[1,0]<stderr>:DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[1,0]<stderr>:DEEPMD INFO installed to: /tmp/pip-req-build-m0vk2rxi/_skbuild/linux-x86_64-3.9/cmake-install
[1,0]<stderr>:DEEPMD INFO source : v2.0.0
[1,0]<stderr>:DEEPMD INFO source brach: HEAD
[1,0]<stderr>:DEEPMD INFO source commit: 1a25414
[1,0]<stderr>:DEEPMD INFO source commit at: 2021-08-28 08:15:38 +0800
[1,0]<stderr>:DEEPMD INFO build float prec: double
[1,0]<stderr>:DEEPMD INFO build with tf inc: /expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/include
[1,0]<stderr>:DEEPMD INFO build with tf lib:
[1,0]<stderr>:DEEPMD INFO ---Summary of the training---------------------------------------
[1,0]<stderr>:DEEPMD INFO running on: exp-7-59
[1,0]<stderr>:DEEPMD INFO computing device: gpu:0
[1,0]<stderr>:DEEPMD INFO CUDA_VISIBLE_DEVICES: 0,1
[1,0]<stderr>:DEEPMD INFO Count of visible GPU: 2
[1,0]<stderr>:DEEPMD INFO num_intra_threads: 0
[1,0]<stderr>:DEEPMD INFO num_inter_threads: 0
[1,0]<stderr>:DEEPMD INFO -----------------------------------------------------------------
[1,0]<stderr>:2021-09-02 12:44:52.760624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,0]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]<stderr>:2021-09-02 12:44:52.762073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,1]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,1]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2021-09-02 12:44:52.763304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
[1,0]<stderr>:pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]<stderr>:2021-09-02 12:44:52.764668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
[1,1]<stderr>:pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,1]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2021-09-02 12:44:52.773133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
[1,0]<stderr>:2021-09-02 12:44:52.773207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]<stderr>:2021-09-02 12:44:52.773230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1
[1,0]<stderr>:2021-09-02 12:44:52.773258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Y
[1,0]<stderr>:2021-09-02 12:44:52.773270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: Y N
[1,1]<stderr>:2021-09-02 12:44:52.774708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
[1,1]<stderr>:2021-09-02 12:44:52.774759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]<stderr>:2021-09-02 12:44:52.774774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1
[1,1]<stderr>:2021-09-02 12:44:52.774795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Y
[1,1]<stderr>:2021-09-02 12:44:52.774822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: Y N
[1,1]<stderr>:2021-09-02 12:44:52.781309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,0]<stderr>:2021-09-02 12:44:52.782716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,0]<stderr>:2021-09-02 12:44:52.784111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
[1,1]<stderr>:2021-09-02 12:44:52.784288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
[1,1]<stderr>:DEEPMD INFO ---Summary of DataSystem: training -----------------------------------------------
[1,0]<stderr>:DEEPMD INFO ---Summary of DataSystem: training -----------------------------------------------
[1,1]<stderr>:DEEPMD INFO found 24 system(s):
[1,1]<stderr>:DEEPMD INFO system natoms bch_sz n_bch prob pbc
[1,1]<stderr>:DEEPMD INFO ../data/init/004/V112.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO found 24 system(s):
[1,1]<stderr>:DEEPMD INFO ../data/init/004/V114.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO system natoms bch_sz n_bch prob pbc
[1,1]<stderr>:DEEPMD INFO ../data/init/004/V110.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/004/V112.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/004/V116.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/004/V114.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/003/V112.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/004/V110.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/003/V110.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/004/V116.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/003/V114.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/003/V112.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/003/V116.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/003/V110.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/001/V116.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/003/V114.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/001/V110.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/003/V116.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/001/V114.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/001/V116.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/001/V112.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/001/V110.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/002/V110.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/001/V114.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/002/V116.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/001/V112.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/002/V112.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/002/V110.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/002/V114.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/002/V116.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/005/V44.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/002/V112.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/005/V46.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/005/V36.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/002/V114.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/005/V48.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/005/V44.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/005/V38.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/005/V46.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/005/V42.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/005/V36.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/005/V50.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/005/V48.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO ../data/init/005/V40.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/005/V38.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO --------------------------------------------------------------------------------------
[1,0]<stderr>:DEEPMD INFO ../data/init/005/V42.0 128 1 40 0.042 T
[1,1]<stderr>:DEEPMD INFO training without frame parameter
[1,0]<stderr>:DEEPMD INFO ../data/init/005/V50.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO ../data/init/005/V40.0 128 1 40 0.042 T
[1,0]<stderr>:DEEPMD INFO --------------------------------------------------------------------------------------
[1,0]<stderr>:DEEPMD INFO training without frame parameter
[1,0]<stderr>:2021-09-02 12:44:54.392611: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 29.84G (32039239680 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.393996: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 26.85G (28835315712 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.395343: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 24.17G (25951782912 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.396857: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 21.75G (23356604416 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.398347: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 19.58G (21020944384 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.399878: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 17.62G (18918848512 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.401414: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 15.86G (17026963456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.402936: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 14.27G (15324266496 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.406219: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 12.84G (13791839232 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.409880: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 11.56G (12412654592 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.412509: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 10.40G (11171388416 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.416159: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 9.36G (10054249472 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.418764: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 8.43G (9048823808 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.422393: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 7.58G (8143941120 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.424980: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 6.83G (7329546752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.428616: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 6.14G (6596592128 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.431204: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 5.53G (5936932864 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.433794: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 4.98G (5343239680 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.437447: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 4.48G (4808915456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.440039: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 4.03G (4328023552 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.443686: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 3.63G (3895220992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.446268: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 3.26G (3505698816 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.449884: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.94G (3155128832 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.452390: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.64G (2839616000 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.454955: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.38G (2555654400 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.458543: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.14G (2300088832 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.461126: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.93G (2070080000 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.464731: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.73G (1863072000 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.467280: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.56G (1676764928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.470896: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.41G (1509088512 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.473391: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.26G (1358179584 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.475956: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.14G (1222361600 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:2021-09-02 12:44:54.479581: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 1.02G (1100125440 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
[1,0]<stderr>:OMP: Info #254: KMP_AFFINITY: pid 12452 tid 12452 thread 0 bound to OS proc set 0
[1,0]<stderr>:2021-09-02 12:44:55.882946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,0]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2021-09-02 12:44:55.884261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
[1,0]<stderr>:pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2021-09-02 12:44:55.887535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
[1,0]<stderr>:2021-09-02 12:44:55.887600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]<stderr>:2021-09-02 12:44:55.887616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1
[1,0]<stderr>:2021-09-02 12:44:55.887635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Y
[1,0]<stderr>:2021-09-02 12:44:55.887648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: Y N
[1,0]<stderr>:2021-09-02 12:44:55.889699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,0]<stderr>:2021-09-02 12:44:55.891118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
[1,1]<stderr>:OMP: Info #254: KMP_AFFINITY: pid 12453 tid 12453 thread 0 bound to OS proc set 0
[1,1]<stderr>:2021-09-02 12:44:56.427814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,1]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,1]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]<stderr>:2021-09-02 12:44:56.433542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
[1,1]<stderr>:pciBusID: 0000:3b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,1]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]<stderr>:2021-09-02 12:44:56.438922: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
[1,1]<stderr>:2021-09-02 12:44:56.438972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]<stderr>:2021-09-02 12:44:56.439854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1
[1,1]<stderr>:2021-09-02 12:44:56.439871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Y
[1,1]<stderr>:2021-09-02 12:44:56.439885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: Y N
[1,1]<stderr>:2021-09-02 12:44:56.447266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,1]<stderr>:2021-09-02 12:44:56.449673: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
[1,0]<stderr>:DEEPMD INFO training data with min nbor dist: 1.0231442946289506
[1,0]<stderr>:DEEPMD INFO training data with max nbor size: [64, 134, 64]
[1,0]<stderr>:DEEPMD INFO built lr
[1,1]<stderr>:DEEPMD INFO training data with min nbor dist: 1.0231442946289506
[1,1]<stderr>:DEEPMD INFO training data with max nbor size: [64, 134, 64]
[1,1]<stderr>:DEEPMD INFO built lr
[1,0]<stderr>:DEEPMD INFO built network
[1,1]<stderr>:DEEPMD INFO built network
[1,0]<stderr>:DEEPMD INFO built training
[1,0]<stderr>:2021-09-02 12:45:12.125558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,0]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,0]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,0]<stderr>:2021-09-02 12:45:12.127459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
[1,0]<stderr>:2021-09-02 12:45:12.127529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,0]<stderr>:2021-09-02 12:45:12.127546: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
[1,0]<stderr>:2021-09-02 12:45:12.127560: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
[1,0]<stderr>:2021-09-02 12:45:12.128376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,1]<stderr>:DEEPMD INFO built training
[1,1]<stderr>:2021-09-02 12:45:12.196720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
[1,1]<stderr>:pciBusID: 0000:18:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
[1,1]<stderr>:coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[1,1]<stderr>:2021-09-02 12:45:12.198628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
[1,1]<stderr>:2021-09-02 12:45:12.198704: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[1,1]<stderr>:2021-09-02 12:45:12.198729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
[1,1]<stderr>:2021-09-02 12:45:12.198742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
[1,1]<stderr>:2021-09-02 12:45:12.200548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0)
[1,0]<stderr>:DEEPMD INFO initialize model from scratch
[1,1]<stderr>:DEEPMD INFO initialize model from scratch
[1,0]<stderr>:DEEPMD INFO start training at lr 5.00e-04 (== 5.00e-04), decay_step 5000, decay_rate 0.763002, final lr will be 1.00e-08
[1,1]<stderr>:DEEPMD INFO start training at lr 5.00e-04 (== 5.00e-04), decay_step 5000, decay_rate 0.763002, final lr will be 1.00e-08
[1,1]<stderr>:2021-09-02 12:45:15.559436: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
[1,0]<stderr>:2021-09-02 12:45:15.559362: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
[1,0]<stderr>:2021-09-02 12:45:16.024942: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-09-02 12:45:16.025026: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
[1,1]<stderr>:2021-09-02 12:45:16.033066: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,1]<stderr>:2021-09-02 12:45:16.033134: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
[1,0]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:Traceback (most recent call last):
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
[1,0]<stderr>: return fn(*args)
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
[1,0]<stderr>: return self._call_tf_sessionrun(options, feed_dict, fetch_list,
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
[1,0]<stderr>: return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
[1,0]<stderr>:tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
[1,0]<stderr>: (0) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,0]<stderr>: [[{{node filter_type_1/MatMul_8}}]]
[1,0]<stderr>: [[l2_force_test/_39]]
[1,0]<stderr>: (1) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,0]<stderr>: [[{{node filter_type_1/MatMul_8}}]]
[1,0]<stderr>:0 successful operations.
[1,0]<stderr>:0 derived errors ignored.[1,0]<stderr>:
[1,0]<stderr>:
[1,0]<stderr>:During handling of the above exception, another exception occurred:
[1,0]<stderr>:
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/bin/dp", line 10, in <module>
[1,0]<stderr>: sys.exit(main())
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 437, in main
[1,0]<stderr>: train_dp(**dict_args)
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 102, in train
[1,0]<stderr>: _do_work(jdata, run_opt, is_compress)
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 163, in _do_work
[1,0]<stderr>: model.train(train_data, valid_data)
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/train/trainer.py", line 506, in train
[1,0]<stderr>: self.valid_on_the_fly(fp, [train_batch], valid_batches, print_header=True)
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/train/trainer.py", line 600, in valid_on_the_fly
[1,0]<stderr>: train_results = self.get_evaluation_results(train_batches)
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/train/trainer.py", line 652, in get_evaluation_results
[1,0]<stderr>: results = self.loss.eval(self.sess, feed_dict, natoms)
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/loss/ener.py", line 140, in eval
[1,1]<stderr>: return fn(*args)
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
[1,1]<stderr>: return self._call_tf_sessionrun(options, feed_dict, fetch_list,
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
[1,1]<stderr>: return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
[1,1]<stderr>:tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
[1,1]<stderr>: (0) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,1]<stderr>: [[{{node filter_type_1/MatMul_8}}]]
[1,1]<stderr>: [[l2_force_test/_39]]
[1,1]<stderr>: (1) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,1]<stderr>: [[{{node filter_type_1/MatMul_8}}]]
[1,1]<stderr>:0 successful operations.
[1,1]<stderr>:0 derived errors ignored.[1,1]<stderr>:
[1,1]<stderr>:
[1,1]<stderr>:During handling of the above exception, another exception occurred:
[1,1]<stderr>:
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/bin/dp", line 10, in <module>
[1,0]<stderr>: error, error_e, error_f, error_v, error_ae, error_pf = run_sess(sess, run_data, feed_dict=feed_dict)
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/utils/sess.py", line 20, in run_sess
[1,1]<stderr>: sys.exit(main())
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 437, in main
[1,1]<stderr>: train_dp(**dict_args)
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 102, in train
[1,1]<stderr>: _do_work(jdata, run_opt, is_compress)
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 163, in _do_work
[1,1]<stderr>: model.train(train_data, valid_data)
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/train/trainer.py", line 506, in train
[1,1]<stderr>: self.valid_on_the_fly(fp, [train_batch], valid_batches, print_header=True)
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/train/trainer.py", line 600, in valid_on_the_fly
[1,1]<stderr>: train_results = self.get_evaluation_results(train_batches)
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/train/trainer.py", line 652, in get_evaluation_results
[1,1]<stderr>: results = self.loss.eval(self.sess, feed_dict, natoms)
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/loss/ener.py", line 140, in eval
[1,1]<stderr>: error, error_e, error_f, error_v, error_ae, error_pf = run_sess(sess, run_data, feed_dict=feed_dict)
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/deepmd/utils/sess.py", line 20, in run_sess
[1,0]<stderr>: return sess.run(*args, **kwargs)
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 967, in run
[1,0]<stderr>: result = self._run(None, fetches, feed_dict, options_ptr,
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1190, in _run
[1,0]<stderr>: results = self._do_run(handle, final_targets, final_fetches,
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1368, in _do_run
[1,0]<stderr>: return self._do_call(_run_fn, feeds, fetches, targets, options,
[1,0]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
[1,0]<stderr>: raise type(e)(node_def, op, message)
[1,0]<stderr>:tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
[1,0]<stderr>: (0) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,0]<stderr>: [[node filter_type_1/MatMul_8 (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:176) ]]
[1,0]<stderr>: [[l2_force_test/_39]]
[1,0]<stderr>: (1) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,0]<stderr>: [[node filter_type_1/MatMul_8 (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:176) ]]
[1,0]<stderr>:0 successful operations.
[1,0]<stderr>:0 derived errors ignored.
[1,0]<stderr>:
[1,0]<stderr>:Errors may have originated from an input operation.
[1,0]<stderr>:Input Source operations connected to node filter_type_1/MatMul_8:
[1,0]<stderr>: filter_type_1/matrix_1_2/read (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:155)
[1,0]<stderr>: filter_type_1/Reshape_15 (defined at /lib/python3.9/site-packages/deepmd/descriptor/se_a.py:715)
[1,0]<stderr>:
[1,0]<stderr>:Input Source operations connected to node filter_type_1/MatMul_8:
[1,0]<stderr>: filter_type_1/matrix_1_2/read (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:155)
[1,0]<stderr>: filter_type_1/Reshape_15 (defined at /lib/python3.9/site-packages/deepmd/descriptor/se_a.py:715)
[1,0]<stderr>:
[1,0]<stderr>:Original stack trace for 'filter_type_1/MatMul_8':
[1,0]<stderr>: File "/bin/dp", line 10, in <module>
[1,0]<stderr>: sys.exit(main())
[1,0]<stderr>: File "/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 437, in main
[1,0]<stderr>: train_dp(**dict_args)
[1,0]<stderr>: File "/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 102, in train
[1,0]<stderr>: _do_work(jdata, run_opt, is_compress)
[1,0]<stderr>: File "/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 158, in _do_work
[1,0]<stderr>: model.build(train_data, stop_batch)
[1,0]<stderr>: File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 338, in build
[1,0]<stderr>: self._build_network(data)
[1,0]<stderr>: File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 362, in _build_network
[1,0]<stderr>: = self.model.build (self.place_holders['coord'],
[1,0]<stderr>: File "/lib/python3.9/site-packages/deepmd/model/ener.py", line 159, in build
[1,0]<stderr>: = self.descrpt.build(coord_,
[1,0]<stderr>: File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 433, in build
[1,0]<stderr>: self.dout, self.qmat = self._pass_filter(self.descrpt_reshape,
[1,0]<stderr>: File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 590, in _pass_filter
[1,0]<stderr>: layer, qmat = self._filter(tf.cast(inputs_i, self.filter_precision), type_i, name='filter_type_'+str(type_i)+suffix, natoms=natoms, reuse=reuse, trainable = trainable, activation_fn = self.filter_activation_fn)
[1,0]<stderr>: File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 793, in _filter
[1,0]<stderr>: ret = self._filter_lower(
[1,0]<stderr>: File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 732, in _filter_lower
[1,0]<stderr>: xyz_scatter = embedding_net(
[1,0]<stderr>: File "/lib/python3.9/site-packages/deepmd/utils/network.py", line 176, in embedding_net
[1,0]<stderr>: hidden = tf.reshape(activation_fn(tf.matmul(xx, w) + b), [-1, outputs_size[ii]])
[1,0]<stderr>: File "/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
[1,0]<stderr>: return target(*args, **kwargs)
[1,0]<stderr>: File "/lib/python3.9/site-packages/tensorflow/python/ops/math_ops.py", line 3489, in matmul
[1,0]<stderr>: return gen_math_ops.mat_mul(
[1,0]<stderr>: File "/lib/python3.9/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5716, in mat_mul
[1,0]<stderr>: _, _, _op, _outputs = _op_def_library._apply_op_helper(
[1,0]<stderr>: File "/lib/python3.9/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
[1,0]<stderr>: op = g._create_op_internal(op_type_name, inputs, dtypes=None,
[1,0]<stderr>: File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 3557, in _create_op_internal
[1,0]<stderr>: ret = Operation(
[1,0]<stderr>: File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
[1,0]<stderr>: self._traceback = tf_stack.extract_stack_for_node(self._c_op)
[1,0]<stderr>:
[1,1]<stderr>: return sess.run(*args, **kwargs)
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 967, in run
[1,1]<stderr>: result = self._run(None, fetches, feed_dict, options_ptr,
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1190, in _run
[1,1]<stderr>: results = self._do_run(handle, final_targets, final_fetches,
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1368, in _do_run
[1,1]<stderr>: return self._do_call(_run_fn, feeds, fetches, targets, options,
[1,1]<stderr>: File "/expanse/lustre/scratch/chazeon/temp_project/deepmd-kit-2.0.0-cuda10.1_gpu/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
[1,1]<stderr>: raise type(e)(node_def, op, message)
[1,1]<stderr>:tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
[1,1]<stderr>: (0) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,1]<stderr>: [[node filter_type_1/MatMul_8 (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:176) ]]
[1,1]<stderr>: [[l2_force_test/_39]]
[1,1]<stderr>: (1) Internal: Blas xGEMM launch failed : a.shape=[1,8192,1], b.shape=[1,1,25], m=8192, n=25, k=1
[1,1]<stderr>: [[node filter_type_1/MatMul_8 (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:176) ]]
[1,1]<stderr>:0 successful operations.
[1,1]<stderr>:0 derived errors ignored.
[1,1]<stderr>:
[1,1]<stderr>:Errors may have originated from an input operation.
[1,1]<stderr>:Input Source operations connected to node filter_type_1/MatMul_8:
[1,1]<stderr>: filter_type_1/matrix_1_2/read (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:155)
[1,1]<stderr>: filter_type_1/Reshape_15 (defined at /lib/python3.9/site-packages/deepmd/descriptor/se_a.py:715)
[1,1]<stderr>:
[1,1]<stderr>:Input Source operations connected to node filter_type_1/MatMul_8:
[1,1]<stderr>: filter_type_1/matrix_1_2/read (defined at /lib/python3.9/site-packages/deepmd/utils/network.py:155)
[1,1]<stderr>: filter_type_1/Reshape_15 (defined at /lib/python3.9/site-packages/deepmd/descriptor/se_a.py:715)
[1,1]<stderr>:
[1,1]<stderr>:Original stack trace for 'filter_type_1/MatMul_8':
[1,1]<stderr>: File "/bin/dp", line 10, in <module>
[1,1]<stderr>: sys.exit(main())
[1,1]<stderr>: File "/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 437, in main
[1,1]<stderr>: train_dp(**dict_args)
[1,1]<stderr>: File "/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 102, in train
[1,1]<stderr>: _do_work(jdata, run_opt, is_compress)
[1,1]<stderr>: File "/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 158, in _do_work
[1,1]<stderr>: model.build(train_data, stop_batch)
[1,1]<stderr>: File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 338, in build
[1,1]<stderr>: self._build_network(data)
[1,1]<stderr>: File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 362, in _build_network
[1,1]<stderr>: = self.model.build (self.place_holders['coord'],
[1,1]<stderr>: File "/lib/python3.9/site-packages/deepmd/model/ener.py", line 159, in build
[1,1]<stderr>: = self.descrpt.build(coord_,
[1,1]<stderr>: File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 433, in build
[1,1]<stderr>: self.dout, self.qmat = self._pass_filter(self.descrpt_reshape,
[1,1]<stderr>: File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 590, in _pass_filter
[1,1]<stderr>: layer, qmat = self._filter(tf.cast(inputs_i, self.filter_precision), type_i, name='filter_type_'+str(type_i)+suffix, natoms=natoms, reuse=reuse, trainable = trainable, activation_fn = self.filter_activation_fn)
[1,1]<stderr>: File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 793, in _filter
[1,1]<stderr>: ret = self._filter_lower(
[1,1]<stderr>: File "/lib/python3.9/site-packages/deepmd/descriptor/se_a.py", line 732, in _filter_lower
[1,1]<stderr>: xyz_scatter = embedding_net(
[1,1]<stderr>: File "/lib/python3.9/site-packages/deepmd/utils/network.py", line 176, in embedding_net
[1,1]<stderr>: hidden = tf.reshape(activation_fn(tf.matmul(xx, w) + b), [-1, outputs_size[ii]])
[1,1]<stderr>: File "/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
[1,1]<stderr>: return target(*args, **kwargs)
[1,1]<stderr>: File "/lib/python3.9/site-packages/tensorflow/python/ops/math_ops.py", line 3489, in matmul
[1,1]<stderr>: return gen_math_ops.mat_mul(
[1,1]<stderr>: File "/lib/python3.9/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5716, in mat_mul
[1,1]<stderr>: _, _, _op, _outputs = _op_def_library._apply_op_helper(
[1,1]<stderr>: File "/lib/python3.9/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
[1,1]<stderr>: op = g._create_op_internal(op_type_name, inputs, dtypes=None,
[1,1]<stderr>: File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 3557, in _create_op_internal
[1,1]<stderr>: ret = Operation(
[1,1]<stderr>: File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
[1,1]<stderr>: self._traceback = tf_stack.extract_stack_for_node(self._c_op)
[1,1]<stderr>:
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[21147,1],0]
Exit code: 1
--------------------------------------------------------------------------
Here is the nvidia-smi output:
(base) [chazeon@exp-7-59 000.1]$ nvidia-smi
Thu Sep 2 12:54:49 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:18:00.0 Off | 0 |
| N/A 32C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
If you notice that your program is running out of GPU memory and multiple processes are being placed on the same GPU, it’s likely that your program (or its dependencies) create a tf.Session that does not use the config that pins specific GPU.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I am trying to use deepmd-kit 2.0.0 to practice parallel training, I am able to train with
CUDA_VISIBLE_DEVICES=0,1 horovodrun -np 1 dp train params.yml
but when I set-np
to2
, I get the following error.It seems that both processes from mpi are trying to use the same GPU?
Here is the
nvidia-smi
output:Seems like the situation described on the horovod’s troubleshooting page:
Beta Was this translation helpful? Give feedback.
All reactions