Replies: 1 comment 1 reply
-
It seems to me you got an out-of-memory error, so the GPU may not be used.
By the way, it looks strange that your two cards have different memory. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Dear developers,
Here, I have compiled the deepmd-kit and lammps using the following commands. However, the molecular dynamics (MD) speed is only 25% compared to when I use a conda installation directly. Since I have utilized a modified plumed, I had to compile them myself. Therefore, I kindly request your assistance in identifying and addressing the underlying issue.
Installation commands:
conda create -n cuda11
conda activate cuda11
conda install python==3.11.5
conda install cuda-nvcc
pip install --upgrade pip
pip install nvidia-cudnn-cu11==8.6.0.163 protobuf==4.23.4 tensorflow==2.13.*
#open a new terminal
CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.file)"))
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/usergpu/soft/anaconda/install/envs/cuda11/lib/:$CUDNN_PATH/lib
## deepmd-kit
tar
cd source
mkdir build
cd build
export PATH=/home/usergpu/soft/cmake/cmake-3.30.0-rc2-linux-x86_64/bin:$PATH
cmake -DUSE_TF_PYTHON_LIBS=TRUE -DCMAKE_INSTALL_PREFIX=/home/usergpu/soft/deepmd-kit/install/ -DTENSORFLOW_ROOT=/home/usergpu/soft/anaconda/install/envs/cuda11/lib/python3.11/site-packages/tensorflow/ ..
make -j12
make install -j12
make lammps
lammps
cd lammps-stable_2Aug2023_update2/
cd src/
cp -r /home/usergpu/soft/deepmd-kit/deepmd-kit-2.2.7/source/build/USER-DEEPMD/ .
make yes-kspace
make yes-extra-fix
make yes-user-deepmd
source /home/usergpu/soft/plumed-2.8.1/sourceme.sh
make lib-plumed args='-p /home/usergpu/xyliu/soft/plumed-2.8.1/bilud/ -m shared'
make yes-user-deepmd
make mpi -j 12
the output in screen when I submit a task
2024-06-18 20:02:33.307208: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0
.2024-06-18 20:02:33.344105: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0
.DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-06-18 20:02:34.312498: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-06-18 20:02:34.333658: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-18 20:02:35.269270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 36881 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:18:00.0, compute capability: 8.0
2024-06-18 20:02:35.276366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 77825 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:86:00.0, compute capability: 8.0
2024-06-18 20:02:35.289328: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 36881 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:18:00.0, compute capability: 8.0
2024-06-18 20:02:35.291170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 77825 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:86:00.0, compute capability: 8.0
2024-06-18 20:02:35.325452: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:375] MLIR V1 optimization pass is not enabled
2024-06-18 20:02:35.358886: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:375] MLIR V1 optimization pass is not enabled
2024-06-18 20:02:35.508553: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 36.02GiB (38673055744 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.512437: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 32.42GiB (34805747712 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.516274: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 29.17GiB (31325171712 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.520040: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 26.26GiB (28192653312 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.524024: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 23.63GiB (25373386752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.528826: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 21.27GiB (22836047872 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.534870: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 19.14GiB (20552441856 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.540422: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 17.23GiB (18497198080 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.544968: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 15.50GiB (16647478272 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.548758: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 13.95GiB (14982729728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.552603: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 12.56GiB (13484455936 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.556746: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 11.30GiB (12136009728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.562263: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 10.17GiB (10922408960 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.567612: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 9.15GiB (9830167552 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.571641: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 8.24GiB (8847150080 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.576848: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 7.42GiB (7962435072 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.580606: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 6.67GiB (7166191616 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.584758: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 6.01GiB (6449572352 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.588573: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 5.41GiB (5804615168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.594112: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 4.87GiB (5224153600 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.599475: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 4.38GiB (4701737984 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.603644: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 3.94GiB (4231564032 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.607448: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 3.55GiB (3808407552 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.611319: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 3.19GiB (3427566592 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.615339: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 2.87GiB (3084809728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.620897: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 2.58GiB (2776328704 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.626355: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 2.33GiB (2498695680 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.631273: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 2.09GiB (2248826112 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.635073: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 1.88GiB (2023943424 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.638871: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 1.70GiB (1821549056 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.643169: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 1.53GiB (1639394048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-06-18 20:02:36.492733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 36881 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:18:00.0, compute capability: 8.0
2024-06-18 20:02:36.494282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 77825 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:86:00.0, compute capability: 8.0
2024-06-18 20:02:36.497800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 36881 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:18:00.0, compute capability: 8.0
2024-06-18 20:02:36.499288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 77825 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:86:00.0, compute capability: 8.0
2024-06-18 20:02:36.532222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 36881 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:18:00.0, compute capability: 8.0
2024-06-18 20:02:36.549077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 77825 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:86:00.0, compute capability: 8.0
2024-06-18 20:02:36.558621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 36881 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:18:00.0, compute capability: 8.0
2024-06-18 20:02:36.560058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 77825 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:86:00.0, compute capability: 8.0
log. file
Beta Was this translation helpful? Give feedback.
All reactions