-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Hi,
I'm facing issues with running tinker-hp GPU with DeepHP enabled on HPC cluster. I prepare system according to instructions - I load Nvidia HPC package through its modulefile and GNU compilers from modulefile provided by HPC administrators. I load conda environment from tinkerml.torch.yaml file modified for chosen CUDA version. Build through installation script is successful with settings as follows:
target_arch='gpu' , c_c=80
cuda_ver=11.0
FPA=1 [left as default]
build_plumed=0 [left as default]
build_colvars=0 [left as default]
NN=1
Build completes without error. ‘Normal’ tasks, such as dynamic or analyze, run without a problem. When I try to run ML potential tasks from the ‘examples’ directory though, run fails at library load with error:
Exception: Fail to load modules with exception: /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /users/kdm/adrmalin/.conda/envs/DeepHP-torch/lib/python3.10/site-packages/torch/../../../libtorch_python.so)
It looks like binaries link to libstdc++.so library in default /usr/lib64/ location, which is outdated and thus causing error, instead of library provided by HPC compiler module. GNUROOT variable is correctly identified by install.sh script though and as far as I was able to check it – all variables inside makefile are being defined correctly by install script.
I have tried building with HPC-SDK 22.7 / CUDA 11.7 / GNU 12.2.0 in a first attempt, as this combination of SDK and CUDA version is mentioned as working correctly, then I tried to downgrade to HPC-SDK 22.2 / CUDA 11.0 / GNU 9.3 with the same results. I modified conda .yaml file accordingly with change in CUDA version.
HPC cluster runs on Rocky Linux 8.10 and GPU nodes contain Nvidia A100 GPUs.
Also I’ve encountered a minor issue with examples provided in github package – Deep-HP_example1 file failed to run with error:
Error in dispersion neigbor list: max cutoff + buffer should be less than half one edge of the box
dispersion cutoff = 9.000
After adding keyword disp-cutoff 7 run fails with abovementioned GLIBCXX error.
Is there any way to make this work?
EDIT on 01042125:
Ok, I made it work by building apptainer docker based on Ubuntu 24.04, which natively contains necessary GLIBCXX strings in system libraries. ML potential tasks launch and finish properly, I think.