@@ -10,11 +10,10 @@ PM6, RM1, MNDO, …), and classical force fields (AMBER, CHARMM, …). CP2K can
1010metadynamics, Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimization, and
1111transition state optimization using NEB or dimer method. See [ CP2K Features] for a detailed overview.
1212
13- !!! note "User Environments "
13+ !!! note "uenvs "
1414
15- [CP2K] is provided on [ALPS](#platforms-on-alps) via [User Environments](#ref-tool-uenv)
16- (UENVs). Please have a look at the [User Environments documentation](#ref-tool-uenv) for more information about
17- UENVs and how to use them.
15+ [CP2K] is provided on [ALPS][platforms-on-alps] via [uenv][ref-tool-uenv].
16+ Please have a look at the [uenv documentation][ref-tool-uenv] for more information about uenvs and how to use them.
1817
1918## Dependencies
2019
@@ -47,6 +46,8 @@ On our systems, CP2K is built with the following dependencies:
4746
4847## Running CP2K
4948
49+ ### Running on the HPC platform
50+
5051To start a job, two bash scripts are potentially required: a [ slurm] submission script, and a wrapper to start the [ CUDA
5152MPS] daemon so that multiple MPI ranks can use the same GPU.
5253
@@ -71,9 +72,6 @@ export MPICH_MALLOC_FALLBACK=1
7172export OMP_NUM_THREADS=$(( SLURM_CPUS_PER_TASK - 1 )) # (4)
7273
7374ulimit -s unlimited
74-
75- export CP2K_DATA_DIR=< PATH_TO_CP2K_DATA_DIR>
76-
7775srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i < CP2K_INPUT> -o < CP2K_OUTPUT>
7876```
7977
@@ -91,7 +89,7 @@ srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i <CP2K_INPUT> -o <CP2K_OUTPU
9189
9290
9391* Change <ACCOUNT > to your project account name
94- * Change ` <CP2K_UENV> ` to the name (or path) of the actual CP2K UENV you want to use
92+ * Change ` <CP2K_UENV> ` to the name (or path) of the actual CP2K uenv you want to use
9593* Change ` <PATH_TO_CP2K_DATA_DIR> ` to the actual path to the CP2K data directory
9694* Change ` <CP2K_INPUT> ` and ` <CP2K_OUTPUT> ` to the actual input and output files
9795
@@ -124,25 +122,11 @@ sbatch run_cp2k.sh
124122 per node. Experiments have shown that CP2K performs and scales better when the number of MPI ranks is a power
125123 of 2, even if some cores are left idling.
126124
127- ??? warning "CP2K grid CUDA backend with high angular momenta basis sets"
128-
129- The CP2K grid CUDA backend is currently buggy on Alps. Using basis sets with high angular momenta ($l \ge 3$)
130- result in slow calculations, especially for force calculations with meta-GGA functionals.
131-
132- As a workaround, you can you can disable CUDA acceleration fo the grid backend:
133-
134- ```bash
135- &GLOBAL
136- &GRID
137- BACKEND CPU
138- &END GRID
139- &END GLOBAL
140- ```
141125
142126??? info "Running regression tests"
143127
144- If you want to run CP2K regression tests with the CP2K executable provided by the UENV , make sure to use the version
145- of the regression tests corresponding to the version of CP2K provided by the UENV . The regression test data is
128+ If you want to run CP2K regression tests with the CP2K executable provided by the uenv , make sure to use the version
129+ of the regression tests corresponding to the version of CP2K provided by the uenv . The regression test data is
146130 sometimes adjusted, and using the wrong version of the regression tests can lead to test failures.
147131
148132
@@ -302,32 +286,62 @@ sbatch run_cp2k.sh
302286
303287 This RPA input scales well until 32 GH200 nodes.
304288
305- ### Known issues
289+ ### Running on Eiger
306290
291+ On Eiger, a similar sbatch script can be used:
307292
308- #### DBCSR GPU scaling
293+ ``` bash title="run_cp2k.sh"
294+ #! /bin/bash -l
295+ # SBATCH --job-name=cp2k-job
296+ # SBATCH --time=00:30:00 # (1)
297+ # SBATCH --nodes=1
298+ # SBATCH --ntasks-per-core=1
299+ # SBATCH --ntasks-per-node=32 # (2)
300+ # SBATCH --cpus-per-task=4 # (3)
301+ # SBATCH --account=<ACCOUNT>
302+ # SBATCH --hint=nomultithread
303+ # SBATCH --hint=exclusive
304+ # SBATCH --constraint=mc
305+ # SBATCH --uenv=<CP2K_UENV>
306+ # SBATCH --view=cp2k
309307
310- On the GH200 architecture, it has been observed that the GPU accelerated version of [ DBCSR] does not perform optimally in some cases.
311- For example, in the ` QS/H2O-1024 ` benchmark above, CP2K does not scale well beyond 2 nodes.
312- The CPU implementation of DBCSR does not suffer from this. A workaround was implemented in DBCSR, in order to switch
313- GPU acceleration on/off with an environment variable:
308+ export OMP_NUM_THREADS=$(( SLURM_CPUS_PER_TASK - 1 )) # (4)
314309
315- ``` bash
316- export DBCSR_RUN_ON_GPU=0
310+ ulimit -s unlimited
311+ srun --cpu-bind=socket cp2k.psmp -i < CP2K_INPUT > -o < CP2K_OUTPUT >
317312```
318313
319- While GPU acceleration is very good on a small number of nodes, the CPU implementation scales better.
320- Therefore, for CP2K jobs running on a large number of nodes, it is worth investigating the use of the ` DBCSR_RUN_ON_GPU `
321- environment variable.
314+ 1 . Time format: ` HH:MM:SS `
322315
323- Ssome niche application cases such as the ` QS_low_scaling_postHF ` benchmarks only run efficiently with the CPU version
324- of DBCSR. Generally, if the function ` dbcsr_multiply_generic ` takes a significant portion of the timing report
325- (at the end of the CP2K output file), it is worth investigating the effect of the ` DBCSR_RUN_ON_GPU ` environment variable.
316+ 2 . Number of MPI ranks per node
317+
318+ 3 . Number of CPUs per MPI ranks
319+
320+ 4 . [ OpenBLAS] spawns an extra thread, therefore it is necessary to set ` OMP_NUM_THREADS ` to ` SLURM_CPUS_PER_TASK - 1 `
321+ for good performance. With [ Intel MKL] , this is not necessary and one can set ` OMP_NUM_THREADS ` to
322+ ` SLURM_CPUS_PER_TASK ` .
323+
324+ 5 . [ DBCSR] relies on extensive JIT compilation and we store the cache in memory to avoid I/O overhead
325+
326+ * Change <ACCOUNT > to your project account name
327+ * Change ` <CP2K_UENV> ` to the name (or path) of the actual CP2K uenv you want to use
328+ * Change ` <PATH_TO_CP2K_DATA_DIR> ` to the actual path to the CP2K data directory
329+ * Change ` <CP2K_INPUT> ` and ` <CP2K_OUTPUT> ` to the actual input and output files
330+
331+ !!! warning
332+
333+ The `--cpu-bind=socket` option is necessary to get good performance.
334+
335+ ??? info "Running regression tests"
336+
337+ If you want to run CP2K regression tests with the CP2K executable provided by the uenv, make sure to use the version
338+ of the regression tests corresponding to the version of CP2K provided by the uenv. The regression test data is
339+ sometimes adjusted, and using the wrong version of the regression tests can lead to test failures.
326340
327341## Building CP2K from Source
328342
329343
330- The [ CP2K] UENV provides all the dependencies required to build [ CP2K] from source, with several optional features
344+ The [ CP2K] uenv provides all the dependencies required to build [ CP2K] from source, with several optional features
331345enabled. You can follow these steps to build [ CP2K] from source:
332346
333347``` bash
@@ -355,7 +369,7 @@ CC=mpicc CXX=mpic++ FC=mpifort cmake \
355369ninja -j 32
356370```
357371
358- 1 . Start the CP2K UENV and load the ` develop ` view (which provides all the necessary dependencies)
372+ 1 . Start the CP2K uenv and load the ` develop ` view (which provides all the necessary dependencies)
359373
3603742 . Go to the CP2K source directory
361375
@@ -368,7 +382,7 @@ ninja -j 32
368382??? note "Eiger: Intel MKL (before
` [email protected] ` )"
369383
370384 On `x86` we deployed with `intel-oneapi-mkl` before `[email protected] `. 371- If you are using a pre-`[email protected] ` UENV , add `-DCP2K_SCALAPACK_VENDOR=MKL` to the CMake invocation to find MKL. 385+ If you are using a pre-`[email protected] ` uenv , add `-DCP2K_SCALAPACK_VENDOR=MKL` to the CMake invocation to find MKL. 372386
373387??? note "CUDA architecture for
` [email protected] ` and earlier"
374388
@@ -377,6 +391,44 @@ ninja -j 32
377391
378392See [ manual.cp2k.org/CMake] for more details.
379393
394+ ### Known issues
395+
396+
397+ #### DBCSR GPU scaling
398+
399+ On the GH200 architecture, it has been observed that the GPU accelerated version of [ DBCSR] does not perform optimally in some cases.
400+ For example, in the ` QS/H2O-1024 ` benchmark above, CP2K does not scale well beyond 2 nodes.
401+ The CPU implementation of DBCSR does not suffer from this. A workaround was implemented in DBCSR, in order to switch
402+ GPU acceleration on/off with an environment variable:
403+
404+ ``` bash
405+ export DBCSR_RUN_ON_GPU=0
406+ ```
407+
408+ While GPU acceleration is very good on a small number of nodes, the CPU implementation scales better.
409+ Therefore, for CP2K jobs running on a large number of nodes, it is worth investigating the use of the ` DBCSR_RUN_ON_GPU `
410+ environment variable.
411+
412+ Ssome niche application cases such as the ` QS_low_scaling_postHF ` benchmarks only run efficiently with the CPU version
413+ of DBCSR. Generally, if the function ` dbcsr_multiply_generic ` takes a significant portion of the timing report
414+ (at the end of the CP2K output file), it is worth investigating the effect of the ` DBCSR_RUN_ON_GPU ` environment variable.
415+
416+
417+ ### CUDA grid backend with high angular momenta basis sets
418+
419+ The CP2K grid CUDA backend is currently buggy on Alps. Using basis sets with high angular momenta ($l \ge 3$)
420+ result in slow calculations, especially for force calculations with meta-GGA functionals.
421+
422+ As a workaround, you can you can disable CUDA acceleration fo the grid backend:
423+
424+ ``` bash
425+ & GLOBAL
426+ & GRID
427+ BACKEND CPU
428+ & END GRID
429+ & END GLOBAL
430+ ```
431+
380432[ CP2K ] : https://www.cp2k.org/
381433[ CP2K Features ] : https://www.cp2k.org/features
382434[ COSMA ] : https://github.com/eth-cscs/COSMA
0 commit comments