Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
* @bcumming @msimberg @RMeli
docs/software/sciapps/cp2k @abussy @RMeli
docs/software/sciapps/cp2k.md @abussy @RMeli
docs/software/communication @msimberg
93 changes: 69 additions & 24 deletions docs/software/sciapps/cp2k.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,22 +63,22 @@ MPS] daemon so that multiple MPI ranks can use the same GPU.
#!/bin/bash -l

#SBATCH --job-name=cp2k-job
#SBATCH --time=00:30:00 # (1)
#SBATCH --time=00:30:00 (1)
#SBATCH --nodes=4
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=32 # (2)
#SBATCH --cpus-per-task=8 # (3)
#SBATCH --ntasks-per-node=32 (2)
#SBATCH --cpus-per-task=8 (3)
#SBATCH --account=<ACCOUNT>
#SBATCH --hint=nomultithread
#SBATCH --hint=exclusive
#SBATCH --no-requeue
#SBATCH --uenv=<CP2K_UENV>
#SBATCH --view=cp2k

export CUDA_CACHE_PATH="/dev/shm/$USER/cuda_cache" # (5)
export MPICH_GPU_SUPPORT_ENABLED=1 # (6)
export CUDA_CACHE_PATH="/dev/shm/$USER/cuda_cache" # (5)!
export MPICH_GPU_SUPPORT_ENABLED=1 # (6)!
export MPICH_MALLOC_FALLBACK=1
export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4)
export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4)!

ulimit -s unlimited
srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPUT>
Expand All @@ -94,7 +94,7 @@ srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPU
for good performance. With [Intel MKL], this is not necessary and one can set `OMP_NUM_THREADS` to
`SLURM_CPUS_PER_TASK`.

5. [DBCSR] relies on extensive JIT compilation and we store the cache in memory to avoid I/O overhead.
5. [DBCSR] relies on extensive JIT compilation, and we store the cache in memory to avoid I/O overhead.
This is set by default on the HPC platform, but it's set here explicitly as it's essential to avoid performance degradation.

6. CP2K's dependencies use GPU-aware MPI, which requires enabling support at runtime.
Expand Down Expand Up @@ -308,19 +308,19 @@ On Eiger, a similar sbatch script can be used:
```bash title="run_cp2k.sh"
#!/bin/bash -l
#SBATCH --job-name=cp2k-job
#SBATCH --time=00:30:00 # (1)
#SBATCH --time=00:30:00 (1)
#SBATCH --nodes=1
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=32 # (2)
#SBATCH --cpus-per-task=4 # (3)
#SBATCH --ntasks-per-node=32 (2)
#SBATCH --cpus-per-task=4 (3)
#SBATCH --account=<ACCOUNT>
#SBATCH --hint=nomultithread
#SBATCH --hint=exclusive
#SBATCH --constraint=mc
#SBATCH --uenv=<CP2K_UENV>
#SBATCH --view=cp2k

export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4)
export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4)!

ulimit -s unlimited
srun --cpu-bind=socket cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPUT>
Expand All @@ -336,8 +336,6 @@ srun --cpu-bind=socket cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPUT>
for good performance. With [Intel MKL], this is not necessary and one can set `OMP_NUM_THREADS` to
`SLURM_CPUS_PER_TASK`.

5. [DBCSR] relies on extensive JIT compilation and we store the cache in memory to avoid I/O overhead

* Change <ACCOUNT> to your project account name
* Change `<CP2K_UENV>` to the name (or path) of the actual CP2K uenv you want to use
* Change `<PATH_TO_CP2K_DATA_DIR>` to the actual path to the CP2K data directory
Expand All @@ -355,19 +353,26 @@ srun --cpu-bind=socket cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPUT>

## Building CP2K from Source

!!! warning
The following installation instructions are up-to-date with the latest version of CP2K provided by the uenv.
That is, they work when manually compiling the CP2K source code corresponding to the CP2K version provided by the uenv.
**They are not necessarily up-to-date with the latest version of CP2K available on the `master` branch.**

If you are trying to build CP2K from source, make sure you understand what is different in `master`
compared to the latest version of CP2K provided by the uenv.

The [CP2K] uenv provides all the dependencies required to build [CP2K] from source, with several optional features
enabled. You can follow these steps to build [CP2K] from source:

```bash
uenv start --view=develop <CP2K_UENV> # (1)
uenv start --view=develop <CP2K_UENV> # (1)!

cd <PATH_TO_CP2K_SOURCE> # (2)
cd <PATH_TO_CP2K_SOURCE> # (2)!

mkdir build && cd build
CC=mpicc CXX=mpic++ FC=mpifort cmake \
-GNinja \
-DCMAKE_CUDA_HOST_COMPILER=mpicc \ # (3)
-DCMAKE_CUDA_HOST_COMPILER=mpicc \ # (3)!
-DCP2K_USE_LIBXC=ON \
-DCP2K_USE_LIBINT2=ON \
-DCP2K_USE_SPGLIB=ON \
Expand All @@ -378,7 +383,7 @@ CC=mpicc CXX=mpic++ FC=mpifort cmake \
-DCP2K_USE_PLUMED=ON \
-DCP2K_USE_DFTD4=ON \
-DCP2K_USE_DLAF=ON \
-DCP2K_USE_ACCEL=CUDA -DCP2K_WITH_GPU=H100 \ # (4)
-DCP2K_USE_ACCEL=CUDA -DCP2K_WITH_GPU=H100 \ # (4)!
..

ninja -j 32
Expand Down Expand Up @@ -406,10 +411,50 @@ ninja -j 32

See [manual.cp2k.org/CMake] for more details.

### Known issues
## Known issues

### DLA-Future

The `cp2k/2025.1` uenv provides CP2K with [DLA-Future] support enabled.
The DLA-Future library is initialized even if you don't [explicitly ask to use it](https://manual.cp2k.org/trunk/technologies/eigensolvers/dlaf.html).
This can lead to some surprising warnings and failures described below.

#### `CUSOLVER_STATUS_INTERNAL_ERROR` during initialization

If you are heavily over-subscribing the GPU by running multiple ranks per GPU, you may encounter the following error:

```
created exception: cuSOLVER function returned error code 7 (CUSOLVER_STATUS_INTERNAL_ERROR): pika(bad_function_call)
terminate called after throwing an instance of 'pika::cuda::experimental::cusolver_exception'
what(): cuSOLVER function returned error code 7 (CUSOLVER_STATUS_INTERNAL_ERROR): pika(bad_function_call)
```

The reason is that too many cuSOLVER handles are created.
If you don't need DLA-Future, you can manually set the number of BLAS and LAPACK handlers to 1 by setting the following environment variables:

```bash
DLAF_NUM_GPU_BLAS_HANDLES=1
DLAF_NUM_GPU_LAPACK_HANDLES=1
```

#### Warning about pika only using one worker thread

When running CP2K with multiple tasks per node and only one core per task, the initialization of DLA-Future may trigger the following warning:

```
The pika runtime will be started with only one worker thread because the
process mask has restricted the available resources to only one thread. If
this is unintentional make sure the process mask contains the resources
you need or use --pika:ignore-process-mask to use all resources. Use
--pika:print-bind to print the thread bindings used by pika.
```

This warning is triggered because the runtime used by DLA-Future, [pika](https://pikacpp.org),
should typically be used with more than one thread and indicates a configuration mistake.
However, if you are not using DLA-Future, the warning is harmless and can be ignored.
The warning cannot be silenced.

#### DBCSR GPU scaling
### DBCSR GPU scaling

On the GH200 architecture, it has been observed that the GPU accelerated version of [DBCSR] does not perform optimally in some cases.
For example, in the `QS/H2O-1024` benchmark above, CP2K does not scale well beyond 2 nodes.
Expand All @@ -420,21 +465,21 @@ GPU acceleration on/off with an environment variable:
export DBCSR_RUN_ON_GPU=0
```

While GPU acceleration is very good on a small number of nodes, the CPU implementation scales better.
Therefore, for CP2K jobs running on a large number of nodes, it is worth investigating the use of the `DBCSR_RUN_ON_GPU`
While GPU acceleration is very good on few nodes, the CPU implementation scales better.
Therefore, for CP2K jobs running on many nodes, it is worth investigating the use of the `DBCSR_RUN_ON_GPU`
environment variable.

Ssome niche application cases such as the `QS_low_scaling_postHF` benchmarks only run efficiently with the CPU version
Some niche application cases such as the `QS_low_scaling_postHF` benchmarks only run efficiently with the CPU version
of DBCSR. Generally, if the function `dbcsr_multiply_generic` takes a significant portion of the timing report
(at the end of the CP2K output file), it is worth investigating the effect of the `DBCSR_RUN_ON_GPU` environment variable.


#### CUDA grid backend with high angular momenta basis sets
### CUDA grid backend with high angular momenta basis sets

The CP2K grid CUDA backend is currently buggy on Alps. Using basis sets with high angular momenta ($l \ge 3$)
result in slow calculations, especially for force calculations with meta-GGA functionals.

As a workaround, you can you can disable CUDA acceleration fo the grid backend:
As a workaround, you can disable CUDA acceleration for the grid backend:

```bash
&GLOBAL
Expand Down