eth-cscs · msimberg · Mar 21, 2025 · Mar 20, 2025 · Mar 20, 2025 · Mar 20, 2025
@@ -1,3 +1,3 @@
 * @bcumming @msimberg @RMeli
-docs/software/sciapps/cp2k @abussy @RMeli
+docs/software/sciapps/cp2k.md @abussy @RMeli
 docs/software/communication @msimberg
@@ -63,22 +63,22 @@ MPS] daemon so that multiple MPI ranks can use the same GPU.
 #!/bin/bash -l
 
 #SBATCH --job-name=cp2k-job
-#SBATCH --time=00:30:00           # (1)
+#SBATCH --time=00:30:00 (1)
 #SBATCH --nodes=4
 #SBATCH --ntasks-per-core=1
-#SBATCH --ntasks-per-node=32      # (2)
-#SBATCH --cpus-per-task=8         # (3)
+#SBATCH --ntasks-per-node=32 (2)
+#SBATCH --cpus-per-task=8 (3)
 #SBATCH --account=<ACCOUNT>
 #SBATCH --hint=nomultithread
 #SBATCH --hint=exclusive
 #SBATCH --no-requeue
 #SBATCH --uenv=<CP2K_UENV>
 #SBATCH --view=cp2k
 
-export CUDA_CACHE_PATH="/dev/shm/$USER/cuda_cache" # (5)
-export MPICH_GPU_SUPPORT_ENABLED=1 # (6)
+export CUDA_CACHE_PATH="/dev/shm/$USER/cuda_cache" # (5)!
+export MPICH_GPU_SUPPORT_ENABLED=1 # (6)!
 export MPICH_MALLOC_FALLBACK=1
-export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4)
+export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4)!
 
 ulimit -s unlimited
 srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPUT>
@@ -94,7 +94,7 @@ srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPU
    for good performance. With [Intel MKL], this is not necessary and one can set `OMP_NUM_THREADS` to
    `SLURM_CPUS_PER_TASK`.
 
-5. [DBCSR] relies on extensive JIT compilation and we store the cache in memory to avoid I/O overhead.
+5. [DBCSR] relies on extensive JIT compilation, and we store the cache in memory to avoid I/O overhead.
    This is set by default on the HPC platform, but it's set here explicitly as it's essential to avoid performance degradation.
 
 6. CP2K's dependencies use GPU-aware MPI, which requires enabling support at runtime.
@@ -308,19 +308,19 @@ On Eiger, a similar sbatch script can be used:
 ```bash title="run_cp2k.sh"
 #!/bin/bash -l
 #SBATCH --job-name=cp2k-job
-#SBATCH --time=00:30:00           # (1)
+#SBATCH --time=00:30:00 (1)
 #SBATCH --nodes=1
 #SBATCH --ntasks-per-core=1
-#SBATCH --ntasks-per-node=32      # (2)
-#SBATCH --cpus-per-task=4         # (3)
+#SBATCH --ntasks-per-node=32 (2)
+#SBATCH --cpus-per-task=4 (3)
 #SBATCH --account=<ACCOUNT>
 #SBATCH --hint=nomultithread
 #SBATCH --hint=exclusive
 #SBATCH --constraint=mc
 #SBATCH --uenv=<CP2K_UENV>
 #SBATCH --view=cp2k
 
-export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4)
+export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4)!
 
 ulimit -s unlimited
 srun --cpu-bind=socket cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPUT>
@@ -336,8 +336,6 @@ srun --cpu-bind=socket cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPUT>
    for good performance. With [Intel MKL], this is not necessary and one can set `OMP_NUM_THREADS` to
    `SLURM_CPUS_PER_TASK`.
 
-5. [DBCSR] relies on extensive JIT compilation and we store the cache in memory to avoid I/O overhead
-
 * Change <ACCOUNT> to your project account name
 * Change `<CP2K_UENV>` to the name (or path) of the actual CP2K uenv you want to use
 * Change `<PATH_TO_CP2K_DATA_DIR>` to the actual path to the CP2K data directory
@@ -355,19 +353,26 @@ srun --cpu-bind=socket cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPUT>
 
 ## Building CP2K from Source
 
+!!! warning
+    The following installation instructions are up-to-date with the latest version of CP2K provided by the uenv.
+    That is, they work when manually compiling the CP2K source code corresponding to the CP2K version provided by the uenv.
+    **They are not necessarily up-to-date with the latest version of CP2K available on the `master` branch.**
+
+    If you are trying to build CP2K from source, make sure you understand what is different in `master`
+    compared to the latest version of CP2K provided by the uenv.
 
 The [CP2K] uenv provides all the dependencies required to build [CP2K] from source, with several optional features
 enabled. You can follow these steps to build [CP2K] from source:
 
 ```bash
-uenv start --view=develop <CP2K_UENV>               # (1)
+uenv start --view=develop <CP2K_UENV> # (1)!
 
-cd <PATH_TO_CP2K_SOURCE>                            # (2)
+cd <PATH_TO_CP2K_SOURCE> # (2)!
 
 mkdir build && cd build
 CC=mpicc CXX=mpic++ FC=mpifort cmake \
     -GNinja \
-    -DCMAKE_CUDA_HOST_COMPILER=mpicc \              # (3)
+    -DCMAKE_CUDA_HOST_COMPILER=mpicc \ # (3)!
     -DCP2K_USE_LIBXC=ON \
     -DCP2K_USE_LIBINT2=ON \
     -DCP2K_USE_SPGLIB=ON \
@@ -378,7 +383,7 @@ CC=mpicc CXX=mpic++ FC=mpifort cmake \
     -DCP2K_USE_PLUMED=ON \
     -DCP2K_USE_DFTD4=ON \
     -DCP2K_USE_DLAF=ON \
-    -DCP2K_USE_ACCEL=CUDA -DCP2K_WITH_GPU=H100 \    # (4)
+    -DCP2K_USE_ACCEL=CUDA -DCP2K_WITH_GPU=H100 \ # (4)!
     ..
 
 ninja -j 32
@@ -406,10 +411,50 @@ ninja -j 32
 
 See [manual.cp2k.org/CMake] for more details.
 
-### Known issues
+## Known issues
+
+### DLA-Future
+
+The `cp2k/2025.1` uenv provides CP2K with [DLA-Future] support enabled.
+The DLA-Future library is initialized even if you don't [explicitly ask to use it](https://manual.cp2k.org/trunk/technologies/eigensolvers/dlaf.html).
+This can lead to some surprising warnings and failures described below.
+
+#### `CUSOLVER_STATUS_INTERNAL_ERROR` during initialization
+
+If you are heavily over-subscribing the GPU by running multiple ranks per GPU, you may encounter the following error:
+
+```
+created exception: cuSOLVER function returned error code 7 (CUSOLVER_STATUS_INTERNAL_ERROR): pika(bad_function_call)
+terminate called after throwing an instance of 'pika::cuda::experimental::cusolver_exception'
+what(): cuSOLVER function returned error code 7 (CUSOLVER_STATUS_INTERNAL_ERROR): pika(bad_function_call)
+```
+
+The reason is that too many cuSOLVER handles are created.
+If you don't need DLA-Future, you can manually set the number of BLAS and LAPACK handlers to 1 by setting the following environment variables:
+
+```bash
+DLAF_NUM_GPU_BLAS_HANDLES=1
+DLAF_NUM_GPU_LAPACK_HANDLES=1
+```
+
+#### Warning about pika only using one worker thread
+
+When running CP2K with multiple tasks per node and only one core per task, the initialization of DLA-Future may trigger the following warning:
+
+```
+The pika runtime will be started with only one worker thread because the
+process mask has restricted the available resources to only one thread. If
+this is unintentional make sure the process mask contains the resources
+you need or use --pika:ignore-process-mask to use all resources. Use
+--pika:print-bind to print the thread bindings used by pika.
+```
 
+This warning is triggered because the runtime used by DLA-Future, [pika](https://pikacpp.org),
+should typically be used with more than one thread and indicates a configuration mistake.
+However, if you are not using DLA-Future, the warning is harmless and can be ignored.
+The warning cannot be silenced.
 
-#### DBCSR GPU scaling
+### DBCSR GPU scaling
 
 On the GH200 architecture, it has been observed that the GPU accelerated version of [DBCSR] does not perform optimally in some cases.
 For example, in the `QS/H2O-1024` benchmark above, CP2K does not scale well beyond 2 nodes. 
@@ -420,21 +465,21 @@ GPU acceleration on/off with an environment variable:
 export DBCSR_RUN_ON_GPU=0
 ```
 
-While GPU acceleration is very good on a small number of nodes, the CPU implementation scales better. 
-Therefore, for CP2K jobs running on a large number of nodes, it is worth investigating the use of the `DBCSR_RUN_ON_GPU`
+While GPU acceleration is very good on few nodes, the CPU implementation scales better. 
+Therefore, for CP2K jobs running on many nodes, it is worth investigating the use of the `DBCSR_RUN_ON_GPU`
 environment variable.
 
-Ssome niche application cases such as the `QS_low_scaling_postHF` benchmarks only run efficiently with the CPU version
+Some niche application cases such as the `QS_low_scaling_postHF` benchmarks only run efficiently with the CPU version
 of DBCSR. Generally, if the function `dbcsr_multiply_generic` takes a significant portion of the timing report
 (at the end of the CP2K output file), it is worth investigating the effect of the `DBCSR_RUN_ON_GPU` environment variable.
 
 
-#### CUDA grid backend with high angular momenta basis sets
+### CUDA grid backend with high angular momenta basis sets
 
 The CP2K grid CUDA backend is currently buggy on Alps. Using basis sets with high angular momenta ($l \ge 3$)
 result in slow calculations, especially for force calculations with meta-GGA functionals. 
 
-As a workaround, you can you can disable CUDA acceleration fo the grid backend:
+As a workaround, you can disable CUDA acceleration for the grid backend:
 
 ```bash
 &GLOBAL