Failure to run ML-enabled example runs

Hi,
I'm facing issues with running tinker-hp GPU with DeepHP enabled on HPC cluster. I prepare system according to instructions - I load Nvidia HPC package through its modulefile and GNU compilers from modulefile provided by HPC administrators. I load conda environment from tinkerml.torch.yaml file modified for chosen CUDA version. Build through installation script is successful with settings as follows:

```
target_arch='gpu' , c_c=80
cuda_ver=11.0
FPA=1 [left as default]
build_plumed=0 [left as default]
build_colvars=0 [left as default]
NN=1
```

Build completes without error. ‘Normal’ tasks, such as dynamic or analyze, run without a problem. When I try to run ML potential tasks from the ‘examples’ directory though, run fails at library load with error:

```
Exception: Fail to load modules with exception: /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /users/kdm/adrmalin/.conda/envs/DeepHP-torch/lib/python3.10/site-packages/torch/../../../libtorch_python.so)
```

It looks like binaries link to libstdc++.so library in default /usr/lib64/ location, which is outdated and thus causing error, instead of library provided by HPC compiler module. GNUROOT variable is correctly identified by install.sh script though and as far as I was able to check it – all variables inside makefile are being defined correctly by install script.
I have tried building with HPC-SDK 22.7 / CUDA 11.7 / GNU 12.2.0 in a first attempt, as this combination of SDK and CUDA version is mentioned as  working correctly, then I tried to downgrade to HPC-SDK 22.2 / CUDA 11.0 / GNU 9.3 with the same results. I modified conda .yaml file accordingly with change in CUDA version.
HPC cluster runs on Rocky Linux 8.10 and GPU nodes contain Nvidia A100 GPUs.
Also I’ve encountered a minor issue with examples provided in github package – Deep-HP_example1 file failed to run with error:

```
Error in dispersion neigbor list: max cutoff + buffer should be less than half one edge of the box
  dispersion cutoff =          9.000
```

After adding keyword _disp-cutoff  7_ run fails with abovementioned GLIBCXX error. 
Is there any way to make this work?

EDIT on 01042125:
Ok, I made it work by building apptainer docker based on Ubuntu 24.04, which natively contains necessary GLIBCXX strings in system libraries. ML potential tasks launch and finish properly, I think.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failure to run ML-enabled example runs #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failure to run ML-enabled example runs #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions