@@ -63,22 +63,22 @@ MPS] daemon so that multiple MPI ranks can use the same GPU.
6363#! /bin/bash -l
6464
6565# SBATCH --job-name=cp2k-job
66- # SBATCH --time=00:30:00 # (1)
66+ # SBATCH --time=00:30:00 (1)
6767# SBATCH --nodes=4
6868# SBATCH --ntasks-per-core=1
69- # SBATCH --ntasks-per-node=32 # (2)
70- # SBATCH --cpus-per-task=8 # (3)
69+ # SBATCH --ntasks-per-node=32 (2)
70+ # SBATCH --cpus-per-task=8 (3)
7171# SBATCH --account=<ACCOUNT>
7272# SBATCH --hint=nomultithread
7373# SBATCH --hint=exclusive
7474# SBATCH --no-requeue
7575# SBATCH --uenv=<CP2K_UENV>
7676# SBATCH --view=cp2k
7777
78- export CUDA_CACHE_PATH=" /dev/shm/$USER /cuda_cache" # (5)
79- export MPICH_GPU_SUPPORT_ENABLED=1 # (6)
78+ export CUDA_CACHE_PATH=" /dev/shm/$USER /cuda_cache" # (5)!
79+ export MPICH_GPU_SUPPORT_ENABLED=1 # (6)!
8080export MPICH_MALLOC_FALLBACK=1
81- export OMP_NUM_THREADS=$(( SLURM_CPUS_PER_TASK - 1 )) # (4)
81+ export OMP_NUM_THREADS=$(( SLURM_CPUS_PER_TASK - 1 )) # (4)!
8282
8383ulimit -s unlimited
8484srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i < CP2K_INPUT> -o < CP2K_OUTPUT>
@@ -94,7 +94,7 @@ srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPU
9494 for good performance. With [ Intel MKL] , this is not necessary and one can set ` OMP_NUM_THREADS ` to
9595 ` SLURM_CPUS_PER_TASK ` .
9696
97- 5 . [ DBCSR] relies on extensive JIT compilation and we store the cache in memory to avoid I/O overhead.
97+ 5 . [ DBCSR] relies on extensive JIT compilation, and we store the cache in memory to avoid I/O overhead.
9898 This is set by default on the HPC platform, but it's set here explicitly as it's essential to avoid performance degradation.
9999
1001006 . CP2K's dependencies use GPU-aware MPI, which requires enabling support at runtime.
@@ -308,19 +308,19 @@ On Eiger, a similar sbatch script can be used:
308308``` bash title="run_cp2k.sh"
309309#! /bin/bash -l
310310# SBATCH --job-name=cp2k-job
311- # SBATCH --time=00:30:00 # (1)
311+ # SBATCH --time=00:30:00 (1)
312312# SBATCH --nodes=1
313313# SBATCH --ntasks-per-core=1
314- # SBATCH --ntasks-per-node=32 # (2)
315- # SBATCH --cpus-per-task=4 # (3)
314+ # SBATCH --ntasks-per-node=32 (2)
315+ # SBATCH --cpus-per-task=4 (3)
316316# SBATCH --account=<ACCOUNT>
317317# SBATCH --hint=nomultithread
318318# SBATCH --hint=exclusive
319319# SBATCH --constraint=mc
320320# SBATCH --uenv=<CP2K_UENV>
321321# SBATCH --view=cp2k
322322
323- export OMP_NUM_THREADS=$(( SLURM_CPUS_PER_TASK - 1 )) # (4)
323+ export OMP_NUM_THREADS=$(( SLURM_CPUS_PER_TASK - 1 )) # (4)!
324324
325325ulimit -s unlimited
326326srun --cpu-bind=socket cp2k.psmp -i < CP2K_INPUT> -o < CP2K_OUTPUT>
@@ -336,8 +336,6 @@ srun --cpu-bind=socket cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPUT>
336336 for good performance. With [ Intel MKL] , this is not necessary and one can set ` OMP_NUM_THREADS ` to
337337 ` SLURM_CPUS_PER_TASK ` .
338338
339- 5 . [ DBCSR] relies on extensive JIT compilation and we store the cache in memory to avoid I/O overhead
340-
341339* Change <ACCOUNT > to your project account name
342340* Change ` <CP2K_UENV> ` to the name (or path) of the actual CP2K uenv you want to use
343341* Change ` <PATH_TO_CP2K_DATA_DIR> ` to the actual path to the CP2K data directory
@@ -355,19 +353,26 @@ srun --cpu-bind=socket cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPUT>
355353
356354## Building CP2K from Source
357355
356+ !!! warning
357+ The following installation instructions are up-to-date with the latest version of CP2K provided by the uenv.
358+ That is, they work when manually compiling the CP2K source code corresponding to the CP2K version provided by the uenv.
359+ ** They are not necessarily up-to-date with the latest version of CP2K available on the ` master ` branch.**
360+
361+ If you are trying to build CP2K from source, make sure you understand what is different in `master`
362+ compared to the latest version of CP2K provided by the uenv.
358363
359364The [ CP2K] uenv provides all the dependencies required to build [ CP2K] from source, with several optional features
360365enabled. You can follow these steps to build [ CP2K] from source:
361366
362367``` bash
363- uenv start --view=develop < CP2K_UENV> # (1)
368+ uenv start --view=develop < CP2K_UENV> # (1)!
364369
365- cd < PATH_TO_CP2K_SOURCE> # (2)
370+ cd < PATH_TO_CP2K_SOURCE> # (2)!
366371
367372mkdir build && cd build
368373CC=mpicc CXX=mpic++ FC=mpifort cmake \
369374 -GNinja \
370- -DCMAKE_CUDA_HOST_COMPILER=mpicc \ # (3)
375+ -DCMAKE_CUDA_HOST_COMPILER=mpicc \ # (3)!
371376 -DCP2K_USE_LIBXC=ON \
372377 -DCP2K_USE_LIBINT2=ON \
373378 -DCP2K_USE_SPGLIB=ON \
@@ -378,7 +383,7 @@ CC=mpicc CXX=mpic++ FC=mpifort cmake \
378383 -DCP2K_USE_PLUMED=ON \
379384 -DCP2K_USE_DFTD4=ON \
380385 -DCP2K_USE_DLAF=ON \
381- -DCP2K_USE_ACCEL=CUDA -DCP2K_WITH_GPU=H100 \ # (4)
386+ -DCP2K_USE_ACCEL=CUDA -DCP2K_WITH_GPU=H100 \ # (4)!
382387 ..
383388
384389ninja -j 32
@@ -406,10 +411,50 @@ ninja -j 32
406411
407412See [ manual.cp2k.org/CMake] for more details.
408413
409- ### Known issues
414+ ## Known issues
415+
416+ ### DLA-Future
417+
418+ The ` cp2k/2025.1 ` uenv provides CP2K with [ DLA-Future] support enabled.
419+ The DLA-Future library is initialized even if you don't [ explicitly ask to use it] ( https://manual.cp2k.org/trunk/technologies/eigensolvers/dlaf.html ) .
420+ This can lead to some surprising warnings and failures described below.
421+
422+ #### ` CUSOLVER_STATUS_INTERNAL_ERROR ` during initialization
423+
424+ If you are heavily over-subscribing the GPU by running multiple ranks per GPU, you may encounter the following error:
425+
426+ ```
427+ created exception: cuSOLVER function returned error code 7 (CUSOLVER_STATUS_INTERNAL_ERROR): pika(bad_function_call)
428+ terminate called after throwing an instance of 'pika::cuda::experimental::cusolver_exception'
429+ what(): cuSOLVER function returned error code 7 (CUSOLVER_STATUS_INTERNAL_ERROR): pika(bad_function_call)
430+ ```
431+
432+ The reason is that too many cuSOLVER handles are created.
433+ If you don't need DLA-Future, you can manually set the number of BLAS and LAPACK handlers to 1 by setting the following environment variables:
434+
435+ ``` bash
436+ DLAF_NUM_GPU_BLAS_HANDLES=1
437+ DLAF_NUM_GPU_LAPACK_HANDLES=1
438+ ```
439+
440+ #### Warning about pika only using one worker thread
441+
442+ When running CP2K with multiple tasks per node and only one core per task, the initialization of DLA-Future may trigger the following warning:
443+
444+ ```
445+ The pika runtime will be started with only one worker thread because the
446+ process mask has restricted the available resources to only one thread. If
447+ this is unintentional make sure the process mask contains the resources
448+ you need or use --pika:ignore-process-mask to use all resources. Use
449+ --pika:print-bind to print the thread bindings used by pika.
450+ ```
410451
452+ This warning is triggered because the runtime used by DLA-Future, [ pika] ( https://pikacpp.org ) ,
453+ should typically be used with more than one thread and indicates a configuration mistake.
454+ However, if you are not using DLA-Future, the warning is harmless and can be ignored.
455+ The warning cannot be silenced.
411456
412- #### DBCSR GPU scaling
457+ ### DBCSR GPU scaling
413458
414459On the GH200 architecture, it has been observed that the GPU accelerated version of [ DBCSR] does not perform optimally in some cases.
415460For example, in the ` QS/H2O-1024 ` benchmark above, CP2K does not scale well beyond 2 nodes.
@@ -420,21 +465,21 @@ GPU acceleration on/off with an environment variable:
420465export DBCSR_RUN_ON_GPU=0
421466```
422467
423- While GPU acceleration is very good on a small number of nodes, the CPU implementation scales better.
424- Therefore, for CP2K jobs running on a large number of nodes, it is worth investigating the use of the ` DBCSR_RUN_ON_GPU `
468+ While GPU acceleration is very good on few nodes, the CPU implementation scales better.
469+ Therefore, for CP2K jobs running on many nodes, it is worth investigating the use of the ` DBCSR_RUN_ON_GPU `
425470environment variable.
426471
427- Ssome niche application cases such as the ` QS_low_scaling_postHF ` benchmarks only run efficiently with the CPU version
472+ Some niche application cases such as the ` QS_low_scaling_postHF ` benchmarks only run efficiently with the CPU version
428473of DBCSR. Generally, if the function ` dbcsr_multiply_generic ` takes a significant portion of the timing report
429474(at the end of the CP2K output file), it is worth investigating the effect of the ` DBCSR_RUN_ON_GPU ` environment variable.
430475
431476
432- #### CUDA grid backend with high angular momenta basis sets
477+ ### CUDA grid backend with high angular momenta basis sets
433478
434479The CP2K grid CUDA backend is currently buggy on Alps. Using basis sets with high angular momenta ($l \ge 3$)
435480result in slow calculations, especially for force calculations with meta-GGA functionals.
436481
437- As a workaround, you can you can disable CUDA acceleration fo the grid backend:
482+ As a workaround, you can disable CUDA acceleration for the grid backend:
438483
439484``` bash
440485& GLOBAL
0 commit comments