diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 10d13999..2e73858a 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1,3 +1,3 @@ * @bcumming @msimberg @RMeli -docs/software/sciapps/cp2k @abussy @RMeli +docs/software/sciapps/cp2k.md @abussy @RMeli docs/software/communication @msimberg diff --git a/docs/software/sciapps/cp2k.md b/docs/software/sciapps/cp2k.md index 30f32aee..eae595de 100644 --- a/docs/software/sciapps/cp2k.md +++ b/docs/software/sciapps/cp2k.md @@ -63,11 +63,11 @@ MPS] daemon so that multiple MPI ranks can use the same GPU. #!/bin/bash -l #SBATCH --job-name=cp2k-job -#SBATCH --time=00:30:00 # (1) +#SBATCH --time=00:30:00 (1) #SBATCH --nodes=4 #SBATCH --ntasks-per-core=1 -#SBATCH --ntasks-per-node=32 # (2) -#SBATCH --cpus-per-task=8 # (3) +#SBATCH --ntasks-per-node=32 (2) +#SBATCH --cpus-per-task=8 (3) #SBATCH --account= #SBATCH --hint=nomultithread #SBATCH --hint=exclusive @@ -75,10 +75,10 @@ MPS] daemon so that multiple MPI ranks can use the same GPU. #SBATCH --uenv= #SBATCH --view=cp2k -export CUDA_CACHE_PATH="/dev/shm/$USER/cuda_cache" # (5) -export MPICH_GPU_SUPPORT_ENABLED=1 # (6) +export CUDA_CACHE_PATH="/dev/shm/$USER/cuda_cache" # (5)! +export MPICH_GPU_SUPPORT_ENABLED=1 # (6)! export MPICH_MALLOC_FALLBACK=1 -export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4) +export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4)! ulimit -s unlimited srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i -o @@ -94,7 +94,7 @@ srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i -o #SBATCH --hint=nomultithread #SBATCH --hint=exclusive @@ -320,7 +320,7 @@ On Eiger, a similar sbatch script can be used: #SBATCH --uenv= #SBATCH --view=cp2k -export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4) +export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4)! ulimit -s unlimited srun --cpu-bind=socket cp2k.psmp -i -o @@ -336,8 +336,6 @@ srun --cpu-bind=socket cp2k.psmp -i -o for good performance. With [Intel MKL], this is not necessary and one can set `OMP_NUM_THREADS` to `SLURM_CPUS_PER_TASK`. -5. [DBCSR] relies on extensive JIT compilation and we store the cache in memory to avoid I/O overhead - * Change to your project account name * Change `` to the name (or path) of the actual CP2K uenv you want to use * Change `` to the actual path to the CP2K data directory @@ -355,19 +353,26 @@ srun --cpu-bind=socket cp2k.psmp -i -o ## Building CP2K from Source +!!! warning + The following installation instructions are up-to-date with the latest version of CP2K provided by the uenv. + That is, they work when manually compiling the CP2K source code corresponding to the CP2K version provided by the uenv. + **They are not necessarily up-to-date with the latest version of CP2K available on the `master` branch.** + + If you are trying to build CP2K from source, make sure you understand what is different in `master` + compared to the latest version of CP2K provided by the uenv. The [CP2K] uenv provides all the dependencies required to build [CP2K] from source, with several optional features enabled. You can follow these steps to build [CP2K] from source: ```bash -uenv start --view=develop # (1) +uenv start --view=develop # (1)! -cd # (2) +cd # (2)! mkdir build && cd build CC=mpicc CXX=mpic++ FC=mpifort cmake \ -GNinja \ - -DCMAKE_CUDA_HOST_COMPILER=mpicc \ # (3) + -DCMAKE_CUDA_HOST_COMPILER=mpicc \ # (3)! -DCP2K_USE_LIBXC=ON \ -DCP2K_USE_LIBINT2=ON \ -DCP2K_USE_SPGLIB=ON \ @@ -378,7 +383,7 @@ CC=mpicc CXX=mpic++ FC=mpifort cmake \ -DCP2K_USE_PLUMED=ON \ -DCP2K_USE_DFTD4=ON \ -DCP2K_USE_DLAF=ON \ - -DCP2K_USE_ACCEL=CUDA -DCP2K_WITH_GPU=H100 \ # (4) + -DCP2K_USE_ACCEL=CUDA -DCP2K_WITH_GPU=H100 \ # (4)! .. ninja -j 32 @@ -406,10 +411,50 @@ ninja -j 32 See [manual.cp2k.org/CMake] for more details. -### Known issues +## Known issues + +### DLA-Future + +The `cp2k/2025.1` uenv provides CP2K with [DLA-Future] support enabled. +The DLA-Future library is initialized even if you don't [explicitly ask to use it](https://manual.cp2k.org/trunk/technologies/eigensolvers/dlaf.html). +This can lead to some surprising warnings and failures described below. + +#### `CUSOLVER_STATUS_INTERNAL_ERROR` during initialization + +If you are heavily over-subscribing the GPU by running multiple ranks per GPU, you may encounter the following error: + +``` +created exception: cuSOLVER function returned error code 7 (CUSOLVER_STATUS_INTERNAL_ERROR): pika(bad_function_call) +terminate called after throwing an instance of 'pika::cuda::experimental::cusolver_exception' +what(): cuSOLVER function returned error code 7 (CUSOLVER_STATUS_INTERNAL_ERROR): pika(bad_function_call) +``` + +The reason is that too many cuSOLVER handles are created. +If you don't need DLA-Future, you can manually set the number of BLAS and LAPACK handlers to 1 by setting the following environment variables: + +```bash +DLAF_NUM_GPU_BLAS_HANDLES=1 +DLAF_NUM_GPU_LAPACK_HANDLES=1 +``` + +#### Warning about pika only using one worker thread + +When running CP2K with multiple tasks per node and only one core per task, the initialization of DLA-Future may trigger the following warning: + +``` +The pika runtime will be started with only one worker thread because the +process mask has restricted the available resources to only one thread. If +this is unintentional make sure the process mask contains the resources +you need or use --pika:ignore-process-mask to use all resources. Use +--pika:print-bind to print the thread bindings used by pika. +``` +This warning is triggered because the runtime used by DLA-Future, [pika](https://pikacpp.org), +should typically be used with more than one thread and indicates a configuration mistake. +However, if you are not using DLA-Future, the warning is harmless and can be ignored. +The warning cannot be silenced. -#### DBCSR GPU scaling +### DBCSR GPU scaling On the GH200 architecture, it has been observed that the GPU accelerated version of [DBCSR] does not perform optimally in some cases. For example, in the `QS/H2O-1024` benchmark above, CP2K does not scale well beyond 2 nodes. @@ -420,21 +465,21 @@ GPU acceleration on/off with an environment variable: export DBCSR_RUN_ON_GPU=0 ``` -While GPU acceleration is very good on a small number of nodes, the CPU implementation scales better. -Therefore, for CP2K jobs running on a large number of nodes, it is worth investigating the use of the `DBCSR_RUN_ON_GPU` +While GPU acceleration is very good on few nodes, the CPU implementation scales better. +Therefore, for CP2K jobs running on many nodes, it is worth investigating the use of the `DBCSR_RUN_ON_GPU` environment variable. -Ssome niche application cases such as the `QS_low_scaling_postHF` benchmarks only run efficiently with the CPU version +Some niche application cases such as the `QS_low_scaling_postHF` benchmarks only run efficiently with the CPU version of DBCSR. Generally, if the function `dbcsr_multiply_generic` takes a significant portion of the timing report (at the end of the CP2K output file), it is worth investigating the effect of the `DBCSR_RUN_ON_GPU` environment variable. -#### CUDA grid backend with high angular momenta basis sets +### CUDA grid backend with high angular momenta basis sets The CP2K grid CUDA backend is currently buggy on Alps. Using basis sets with high angular momenta ($l \ge 3$) result in slow calculations, especially for force calculations with meta-GGA functionals. -As a workaround, you can you can disable CUDA acceleration fo the grid backend: +As a workaround, you can disable CUDA acceleration for the grid backend: ```bash &GLOBAL