diff --git a/docs/javascripts/mathjax.js b/docs/javascripts/mathjax.js new file mode 100644 index 00000000..0be88e04 --- /dev/null +++ b/docs/javascripts/mathjax.js @@ -0,0 +1,19 @@ +window.MathJax = { + tex: { + inlineMath: [["\\(", "\\)"]], + displayMath: [["\\[", "\\]"]], + processEscapes: true, + processEnvironments: true + }, + options: { + ignoreHtmlClass: ".*|", + processHtmlClass: "arithmatex" + } +}; + +document$.subscribe(() => { + MathJax.startup.output.clearCache() + MathJax.typesetClear() + MathJax.texReset() + MathJax.typesetPromise() +}) diff --git a/docs/software/sciapps/cp2k.md b/docs/software/sciapps/cp2k.md index 0724f1a2..c1c88444 100644 --- a/docs/software/sciapps/cp2k.md +++ b/docs/software/sciapps/cp2k.md @@ -1,4 +1,460 @@ [](){#ref-uenv-cp2k} # CP2K -!!! todo - complete docs + +[CP2K] is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid +state, liquid, molecular, periodic, material, crystal, and biological systems. + +CP2K provides a general framework for different modeling methods such as DFT using the mixed Gaussian and plane waves +approaches GPW and GAPW. Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, +PM6, RM1, MNDO, …), and classical force fields (AMBER, CHARMM, …). CP2K can do simulations of molecular dynamics, +metadynamics, Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimization, and +transition state optimization using NEB or dimer method. See [CP2K Features] for a detailed overview. + +!!! note "uenvs" + + [CP2K] is provided on [ALPS][platforms-on-alps] via [uenv][ref-tool-uenv]. + Please have a look at the [uenv documentation][ref-tool-uenv] for more information about uenvs and how to use them. + +## Dependencies + +On our systems, CP2K is built with the following dependencies: + +* [COSMA] +* [Cray MPICH] +* [DBCSR] +* [DLA-Future] +* [dftd4] (from `cp2k@2025.1` onwards) +* [ELPA] +* [FFTW] +* [Libxc] +* [libint] +* [OpenBLAS] +* [PLUMED] (from `cp2k@2024.1` onwards) +* [ScaLAPACK] +* [SIRIUS] +* [Spglib] +* [spla] + +!!! note "GPU-aware MPI" + [COSMA] and [DLA-Future] are built with GPU-aware MPI. On the HPC platform, `MPICH_GPU_SUPPORT_ENABLED=1` is set by + default, therefore there is no need to set it manually. + +!!! warning "BLAS/LAPACK on Eiger" + + On Eiger, the default BLAS/LAPACK library is Intel oneAPI MKL (oneMKL) until `cp2k@2024.3`. + From `cp2k@2025.1` the default BLAS/LAPACK library is [OpenBLAS]. + +## Running CP2K + +### Running on the HPC platform + +To start a job, two bash scripts are potentially required: a [slurm] submission script, and a wrapper to start the [CUDA +MPS] daemon so that multiple MPI ranks can use the same GPU. + +```bash title="run_cp2k.sh" +#!/bin/bash -l + +#SBATCH --job-name=cp2k-job +#SBATCH --time=00:30:00 # (1) +#SBATCH --nodes=4 +#SBATCH --ntasks-per-core=1 +#SBATCH --ntasks-per-node=32 # (2) +#SBATCH --cpus-per-task=8 # (3) +#SBATCH --account= +#SBATCH --hint=nomultithread +#SBATCH --hint=exclusive +#SBATCH --no-requeue +#SBATCH --uenv= +#SBATCH --view=cp2k + +export CUDA_CACHE_PATH="/dev/shm/$RANDOM" # (5) +export MPICH_MALLOC_FALLBACK=1 +export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4) + +ulimit -s unlimited +srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i -o +``` + +1. Time format: `HH:MM:SS` + +2. Number of MPI ranks per node + +3. Number of CPUs per MPI ranks + +4. [OpenBLAS] spawns an extra thread, therefore it is necessary to set `OMP_NUM_THREADS` to `SLURM_CPUS_PER_TASK - 1` + for good performance. With [Intel MKL], this is not necessary and one can set `OMP_NUM_THREADS` to + `SLURM_CPUS_PER_TASK`. + +5. [DBCSR] relies on extensive JIT compilation and we store the cache in memory to avoid I/O overhead + + +* Change to your project account name +* Change `` to the name (or path) of the actual CP2K uenv you want to use +* Change `` to the actual path to the CP2K data directory +* Change `` and `` to the actual input and output files + + +With the above scripts, you can launch a [CP2K] calculation on 4 nodes, with 32 MPI ranks per node and 8 OpenMP threads +per rank with + + +```bash +sbatch run_cp2k.sh +``` + +!!! note + + The `mps-wrapper.sh` script, required to properly over-subscribe the GPU, is provided at the following page: + [NVIDIA GH200 GPU nodes: multiple ranks per GPU][ref-slurm-gh200-multi-rank-per-gpu]. + +!!! warning + + The `--cpu-bind=socket` option is necessary to get good performance. + + +!!! warning + + Each GH200 node has 4 modules, each of them composed of a ARM Grace CPU with 72 cores and a H200 GPU directly + attached to it. Please see [Alps hardware][ref-alps-hardware] for more information. + It is important that the number of MPI ranks passed to [slurm] with `--ntasks-per-node` is a multiple of 4. + + ??? note + + In the example above, we use 32 MPI ranks with 8 OpenMP threads, for a total of 64 cores per GPU and 256 cores + per node. Experiments have shown that CP2K performs and scales better when the number of MPI ranks is a power + of 2, even if some cores are left idling. + + +??? info "Running regression tests" + + If you want to run CP2K regression tests with the CP2K executable provided by the uenv, make sure to use the version + of the regression tests corresponding to the version of CP2K provided by the uenv. The regression test data is + sometimes adjusted, and using the wrong version of the regression tests can lead to test failures. + + +??? example "Scaling of `QS/H2O-1024` benchmark" + + The [`QS/H2O-1024` benchmark](https://github.com/cp2k/cp2k/blob/master/benchmarks/QS/H2O-1024.inp) is a DFT + molecular dynamics simulation of liquid water. It relies on [DBCSR] for block sparse matrix-matrix multiplication. + + All calculations were run with 32 MPI ranks per node, and 8 OpenMP threads per rank (best configuration for this + benchmark). + + !!! note + `H2O-102.inp` is the largest example of DFT molecular dynamics simulation of liquid water that fits on a single + GH200 node. + + | Number of nodes | Wall time (s) | Speecup | Efficiency | + |:---------------:|:-------------:|:-------:|:----------:| + | 1 | 793.1 | 1.00 | 1.00 | + | 2 | 535.2 | 1.48 | 0.74 | + | 4 | 543.9 | 1.45 | 0.36 | + | 8 | 487.3 | 1.64 | 0.20 | + | 16 | 616.7 | 1.28 | 0.08 | + + Scaling is not ideal on more than two nodes. + +??? example "Scaling of `QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ` benchmark" + + The `QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ` benchmark is a straighfoward modification of the + [`QS_mp2_rpa/64-H2O/H2O-64-RI-MP2-TZ` benchmark](https://github.com/cp2k/cp2k/blob/master/benchmarks/QS_mp2_rpa/64-H2O/H2O-64-RI-MP2-TZ.inp). + + It is a RI-MP2 calculation of a water cluster with 128 atoms. + + ??? info "Input file" + + ```Fortran + &GLOBAL + PRINT_LEVEL MEDIUM + PROJECT H2O-128-RI-MP2-TZ + RUN_TYPE ENERGY + &END GLOBAL + + &FORCE_EVAL + METHOD Quickstep + &DFT + BASIS_SET_FILE_NAME ./BASIS_H2O + POTENTIAL_FILE_NAME ./POTENTIAL_H2O + WFN_RESTART_FILE_NAME ./H2O-128-PBE-TZ-RESTART.wfn + &MGRID + CUTOFF 800 + REL_CUTOFF 50 + &END MGRID + &QS + EPS_DEFAULT 1.0E-12 + &END QS + &SCF + EPS_SCF 1.0E-6 + MAX_SCF 30 + SCF_GUESS RESTART + &OT + MINIMIZER CG + PRECONDITIONER FULL_ALL + &END OT + &OUTER_SCF + EPS_SCF 1.0E-6 + MAX_SCF 20 + &END OUTER_SCF + &PRINT + &RESTART OFF + &END RESTART + &END PRINT + &END SCF + &XC + &HF + FRACTION 1.0 + &INTERACTION_POTENTIAL + CUTOFF_RADIUS 6.0 + POTENTIAL_TYPE TRUNCATED + T_C_G_DATA ./t_c_g.dat + &END INTERACTION_POTENTIAL + &MEMORY + MAX_MEMORY 16384 + &END MEMORY + &SCREENING + EPS_SCHWARZ 1.0E-8 + SCREEN_ON_INITIAL_P TRUE + &END SCREENING + &END HF + &WF_CORRELATION + MEMORY 1200 + NUMBER_PROC 1 + &INTEGRALS + &WFC_GPW + CUTOFF 300 + EPS_FILTER 1.0E-12 + EPS_GRID 1.0E-8 + REL_CUTOFF 50 + &END WFC_GPW + &END INTEGRALS + &RI_MP2 + &END RI_MP2 + &END WF_CORRELATION + &XC_FUNCTIONAL NONE + &END XC_FUNCTIONAL + &END XC + &END DFT + &SUBSYS + &CELL + ABC 15.6404 15.6404 15.6404 + &END CELL + &KIND H + BASIS_SET cc-TZ + BASIS_SET RI_AUX RI-cc-TZ + POTENTIAL GTH-HF-q1 + &END KIND + &KIND O + BASIS_SET cc-TZ + BASIS_SET RI_AUX RI-cc-TZ + POTENTIAL GTH-HF-q6 + &END KIND + &TOPOLOGY + COORD_FILE_FORMAT XYZ + COORD_FILE_NAME ./H2O-128.xyz + &END TOPOLOGY + &END SUBSYS + &END FORCE_EVAL + ``` + + All calculations run for this scaling tests were using 32 MPI ranks per node and 8 OpenMP threads per rank. + The smallest amount of nodes necessary to run this calculation is 8. + + + | Number of nodes | Wall time (s) | Speecup | Efficiency | + |:---------------:|:-------------:|:-------:|:----------:| + | 8 | 2037.0 | 1.00 | 1.00 | + | 16 | 1096.2 | 1.85 | 0.92 | + | 32 | 611.5 | 3.33 | 0.83 | + | 64 | 410.5 | 4.96 | 0.62 | + | 128 | 290.9 | 7.00 | 0.43 | + + MP2 calculations scale well on GH200, up to a large number of nodes ($> 50\%$ efficiency with 64 nodes). + +??? example "Scaling of `QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-TZ` benchmark" + + The [`QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-T` benchmark](https://github.com/cp2k/cp2k/blob/master/benchmarks/QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-TZ.inp) + is a RPA energy calculation, traditionally used to benchmark the performance of the [COSMA] library. + It a very large calculation, which requires at least 8 GH200 nodes to run. + The calculations were run with 16 MPI ranks per node and 16 OpenMP threads per rank. + For RPA workloads, a higher ratio of threads per rank were beneficial. + + | Number of nodes | Wall time (s) | Speecup | Efficiency | + |:---------------:|:-------------:|:-------:|:----------:| + | 8 | 575.4 | 1.00 | 1.00 | + | 16 | 465.8 | 1.23 | 0.61 | + | 32 | 281.1 | 2.04 | 0.51 | + | 64 | 205.3 | 2.80 | 0.35 | + | 128 | 185.8 | 3.09 | 0.19 | + + This RPA input scales well until 32 GH200 nodes. + +### Running on Eiger + +On Eiger, a similar sbatch script can be used: + +```bash title="run_cp2k.sh" +#!/bin/bash -l +#SBATCH --job-name=cp2k-job +#SBATCH --time=00:30:00 # (1) +#SBATCH --nodes=1 +#SBATCH --ntasks-per-core=1 +#SBATCH --ntasks-per-node=32 # (2) +#SBATCH --cpus-per-task=4 # (3) +#SBATCH --account= +#SBATCH --hint=nomultithread +#SBATCH --hint=exclusive +#SBATCH --constraint=mc +#SBATCH --uenv= +#SBATCH --view=cp2k + +export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) # (4) + +ulimit -s unlimited +srun --cpu-bind=socket cp2k.psmp -i -o +``` + +1. Time format: `HH:MM:SS` + +2. Number of MPI ranks per node + +3. Number of CPUs per MPI ranks + +4. [OpenBLAS] spawns an extra thread, therefore it is necessary to set `OMP_NUM_THREADS` to `SLURM_CPUS_PER_TASK - 1` + for good performance. With [Intel MKL], this is not necessary and one can set `OMP_NUM_THREADS` to + `SLURM_CPUS_PER_TASK`. + +5. [DBCSR] relies on extensive JIT compilation and we store the cache in memory to avoid I/O overhead + +* Change to your project account name +* Change `` to the name (or path) of the actual CP2K uenv you want to use +* Change `` to the actual path to the CP2K data directory +* Change `` and `` to the actual input and output files + +!!! warning + + The `--cpu-bind=socket` option is necessary to get good performance. + +??? info "Running regression tests" + + If you want to run CP2K regression tests with the CP2K executable provided by the uenv, make sure to use the version + of the regression tests corresponding to the version of CP2K provided by the uenv. The regression test data is + sometimes adjusted, and using the wrong version of the regression tests can lead to test failures. + +## Building CP2K from Source + + +The [CP2K] uenv provides all the dependencies required to build [CP2K] from source, with several optional features +enabled. You can follow these steps to build [CP2K] from source: + +```bash +uenv start --view=develop # (1) + +cd # (2) + +mkdir build && cd build +CC=mpicc CXX=mpic++ FC=mpifort cmake \ + -GNinja \ + -DCMAKE_CUDA_HOST_COMPILER=mpicc \ # (3) + -DCP2K_USE_LIBXC=ON \ + -DCP2K_USE_LIBINT2=ON \ + -DCP2K_USE_SPGLIB=ON \ + -DCP2K_USE_ELPA=ON \ + -DCP2K_USE_SPLA=ON \ + -DCP2K_USE_SIRIUS=ON \ + -DCP2K_USE_COSMA=ON \ + -DCP2K_USE_PLUMED=ON \ + -DCP2K_USE_DFTD4=ON \ + -DCP2K_USE_DLAF=ON \ + -DCP2K_USE_ACCEL=CUDA -DCP2K_WITH_GPU=H100 \ # (4) + .. + +ninja -j 32 +``` + +1. Start the CP2K uenv and load the `develop` view (which provides all the necessary dependencies) + +2. Go to the CP2K source directory + +3. The `H100` option enables the `sm_90` architecture for the CUDA backend + +!!! note "Eiger: `libxsmm`" + + On `x86` we deploy with `libxmm`. Add `-DCP2K_USE_LIBXSMM=ON` to the CMake invocation to use `libxsmm`. + +??? note "Eiger: Intel MKL (before `cp2k@2025.1`)" + + On `x86` we deployed with `intel-oneapi-mkl` before `cp2k@2025.1`. + If you are using a pre-`cp2k@2025.1` uenv, add `-DCP2K_SCALAPACK_VENDOR=MKL` to the CMake invocation to find MKL. + +??? note "CUDA architecture for `cp2k@2024.1` and earlier" + + `cp2k@2024.1` (and earlier) does not support compiling for `cuda_arch=90`. Use `-DCP2K_WITH_GPU=A100` instead, + which enables the `sm_80` architecture. + +See [manual.cp2k.org/CMake] for more details. + +### Known issues + + +#### DBCSR GPU scaling + +On the GH200 architecture, it has been observed that the GPU accelerated version of [DBCSR] does not perform optimally in some cases. +For example, in the `QS/H2O-1024` benchmark above, CP2K does not scale well beyond 2 nodes. +The CPU implementation of DBCSR does not suffer from this. A workaround was implemented in DBCSR, in order to switch +GPU acceleration on/off with an environment variable: + +```bash +export DBCSR_RUN_ON_GPU=0 +``` + +While GPU acceleration is very good on a small number of nodes, the CPU implementation scales better. +Therefore, for CP2K jobs running on a large number of nodes, it is worth investigating the use of the `DBCSR_RUN_ON_GPU` +environment variable. + +Ssome niche application cases such as the `QS_low_scaling_postHF` benchmarks only run efficiently with the CPU version +of DBCSR. Generally, if the function `dbcsr_multiply_generic` takes a significant portion of the timing report +(at the end of the CP2K output file), it is worth investigating the effect of the `DBCSR_RUN_ON_GPU` environment variable. + + +#### CUDA grid backend with high angular momenta basis sets + +The CP2K grid CUDA backend is currently buggy on Alps. Using basis sets with high angular momenta ($l \ge 3$) +result in slow calculations, especially for force calculations with meta-GGA functionals. + +As a workaround, you can you can disable CUDA acceleration fo the grid backend: + +```bash +&GLOBAL + &GRID + BACKEND CPU + &END GRID +&END GLOBAL +``` + +??? info "Fix available upon request" + + A fix for this issue for the HIP backend is currently being tested by CSCS engineers. If you would like to test it, + please contact us and we will be able to provide the source code. The fix will eventually land on the upstream + [CP2K] repository. + +[CP2K]: https://www.cp2k.org/ +[CP2K Features]: https://www.cp2k.org/features +[COSMA]: https://github.com/eth-cscs/COSMA +[DLA-Future]: https://github.com/eth-cscs/DLA-Future +[manual.cp2k.org/CMake]: https://manual.cp2k.org/trunk/getting-started/CMake.html +[DBCSR]: https://cp2k.github.io/dbcsr/develop/ +[SIRIUS]: https://github.com/electronic-structure/SIRIUS +[COSMA]: https://github.com/eth-cscs/COSMA +[dftd4]: https://dftd4.readthedocs.io/en/latest/ +[libint]: https://github.com/evaleev/libint +[PLUMED]: https://www.plumed.org +[ELPA]: https://elpa.mpcdf.mpg.de +[spla]: https://github.com/eth-cscs/spla +[Spglib]: https://spglib.readthedocs.io/en/stable/ +[Libxc]: https://libxc.gitlab.io +[FFTW]: https://www.fftw.org +[ScaLAPACK]: https://www.netlib.org/scalapack/ +[OpenBLAS]: http://www.openmathlib.org/OpenBLAS/ +[Cray MPICH]: https://docs.nersc.gov/development/programming-models/mpi/cray-mpich/ +[slurm]: https://slurm.schedmd.com/ +[CUDA MPS]: https://docs.nvidia.com/deploy/mps/index.html diff --git a/docs/software/sciapps/namd.md b/docs/software/sciapps/namd.md index a0e71a75..8de193ae 100644 --- a/docs/software/sciapps/namd.md +++ b/docs/software/sciapps/namd.md @@ -6,6 +6,11 @@ !!! danger "Licensing Terms and Conditions" [NAMD] is distributed free of charge for research purposes only and not for commercial use: users must agree to the [NAMD license] in order to use it at [CSCS]. Users agree to acknowledge use of [NAMD] in any reports or publications of results obtained with the Software (see [NAMD Homepage] for details). +!!! note "User Environments" + + [CP2K] is provided on [ALPS](#ref-alps-platforms) via [User Environments](#ref-tool-uenv) (UENVs). Please have a look at + the [User Environments documentation](#ref-tool-uenv) for more information about UENVs and how to use them. + [NAMD] is provided in two flavours on [CSCS] systems: * Single-node build @@ -32,6 +37,7 @@ The following sbatch script shows how to run NAMD on a single node with 4 GPUs: #!/bin/bash #SBATCH --job-name="namd-example" #SBATCH --time=00:10:00 +#SBATCH --account= #SBATCH --nodes=1 (1) #SBATCH --ntasks-per-node=1 (2) #SBATCH --cpus-per-task=288 @@ -49,6 +55,7 @@ srun namd3 +p 29 +pmeps 5 +setcpuaffinity +devices 0,1,2,3 4. Load the NAMD UENV (UENV name or path to the UENV) 5. Load the `namd-single-node` view +* Change `` to your project account * Change `` to the name (or path) of the actual NAMD UENV you want to use * Change `` to the name (or path) of the NAMD configuration file for your simulation * Make sure you set `+p`, `+pmeps`, and other NAMD options optimally for your calculation diff --git a/mkdocs.yml b/mkdocs.yml index 9f325060..3b648f0b 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -160,8 +160,6 @@ markdown_extensions: # for captioning images - pymdownx.blocks.caption -# disable mathjax until the "GET /javascripts/mathjax.js HTTP/1.1" code 404 errors are fixed -#extra_javascript: -# - javascripts/mathjax.js -# - https://unpkg.com/mathjax@3/es5/tex-mml-chtml.js - +extra_javascript: + - javascripts/mathjax.js + - https://unpkg.com/mathjax@3/es5/tex-mml-chtml.js