diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt index 4a5ac151..cbbae743 100644 --- a/.github/actions/spelling/allow.txt +++ b/.github/actions/spelling/allow.txt @@ -106,6 +106,7 @@ SSHService STMV Scopi Signalkuppe +THP TOTP UANs UIs @@ -174,6 +175,8 @@ gromos groundstate gsl hdf +hugepages +hugetlbfs hotmail huggingface hwloc @@ -194,6 +197,7 @@ lapackpp lexer lexers libfabric +libhugetlbfs libint libtree libxc diff --git a/docs/clusters/daint.md b/docs/clusters/daint.md index c2836194..c59f2e79 100644 --- a/docs/clusters/daint.md +++ b/docs/clusters/daint.md @@ -151,7 +151,7 @@ Daint can also be accessed using [FirecREST][ref-firecrest] at the `https://api. The [access-counter-based memory migration feature](https://developer.nvidia.com/blog/cuda-toolkit-12-4-enhances-support-for-nvidia-grace-hopper-and-confidential-computing/#access-counter-based_migration_for_nvidia_grace_hopper_memory) in the NVIDIA driver for Grace Hopper is disabled to address performance issues affecting NCCL-based workloads (e.g. LLM training) ??? note "NVIDIA boost slider" - Added an option to enable the NVIDIA boost slider (vboost) via Slurm using the `-C nvidia_vboost_enabled` flag. + Added [an option to enable the NVIDIA boost slider (vboost)][ref-slurm-features-vboost] via Slurm using the `-C nvidia_vboost_enabled` flag. This feature, disabled by default, may increase GPU frequency and performance while staying within the power budget ??? note "Enroot update" diff --git a/docs/running/slurm.md b/docs/running/slurm.md index 866cef4c..3b2ff864 100644 --- a/docs/running/slurm.md +++ b/docs/running/slurm.md @@ -237,6 +237,90 @@ The build generates the following executables: You can also check GPU affinity by inspecting the value of the `CUDA_VISIBLE_DEVICES` environment variable. +[](){#ref-slurm-features} +## Slurm features + +Slurm allows specifying [constraints](https://slurm.schedmd.com/sbatch.html#OPT_constraint) for jobs, which can be used to change features available on nodes in a job. +CSCS implements a few custom features, described below, that can be selected on certain clusters. +To check which features are available on a cluster, for example on the `normal` partition, use `sinfo`: + +```console +$ sinfo --partition normal --format %b +ACTIVE_FEATURES +gh,gpu,thp_never,thp_always,thp_madvise,nvidia_vboost_enabled,nvidia_vboost_disabled +``` + +One or more constraints can be selected using the `--constraint`/`-C` flag of `sbatch` or `srun`: + +```bash +sbatch --constraint thp_never&nvidia_vboost_enabled batch.sh +``` + +[](){#ref-slurm-features-thp} +### Transparent hugepages + +!!! info "The THP Slurm feature is only available on [GH200 nodes][ref-alps-gh200-node]" + +[Transparent hugepages (THP)](https://www.kernel.org/doc/html/v6.17/admin-guide/mm/transhuge.html) are a Linux kernel feature that allows automatically coalescing pages into huge pages without the user application explicitly asking for hugepages: + +> Performance critical computing applications dealing with large memory working sets are already running on top of libhugetlbfs and in turn hugetlbfs. +> Transparent HugePage Support (THP) is an alternative mean of using huge pages for the backing of virtual memory with huge pages that supports the automatic promotion and demotion of page sizes and without the shortcomings of hugetlbfs. + +While this feature generally improves performance, we have observed degrading application performance with the THP feature enabled due to the page coalescing blocking progress on certain operations. +An example of this is ICON, a latency-sensitive application where small delays can can cause large performance drops. + +THP support is enabled by default, and the current setting can be checked with: + +```console +$ cat /sys/kernel/mm/transparent_hugepage/enabled +[always] madvise never +``` + +A detailed explanation of how the different options behave can be found in the [THP documentation](https://www.kernel.org/doc/html/v6.17/admin-guide/mm/transhuge.html#global-thp-controls). + +The available Slurm features to select the THP mode are listed below: + +| Kernel setting | Slurm constraint | +|----------------|------------------------| +| `always` | `thp_always` (default) | +| `madvise` | `thp_madvise` | +| `never` | `thp_never` | + +[](){#ref-slurm-features-vboost} +### NVIDIA vboost + +!!! info "The NVIDIA vboost Slurm feature is only available on [GH200 nodes][ref-alps-gh200-node]" + +The [NVIDIA NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html#gpu-core-clock-optimization) describes the vboost feature as: + +> NVIDIA GPUs support a CPU core clock boost mode, which increases the core clock rate by reducing the off-chip memory clock rate. +> This is particularly beneficial for LLMs, which are typically compute throughput-bound. + +The vboost slider is at `0` by default, and the current value can be checked checked with `nvidia-smi`: + +```console +$ nvidia-smi boost-slider --list ++-------------------------------------------------+ +| GPU Boost Slider | +| GPU Slider Max Value Current Value | +|=================================================| +| 0 vboost 4 0 | ++-------------------------------------------------+ +| 1 vboost 4 0 | ++-------------------------------------------------+ +| 2 vboost 4 0 | ++-------------------------------------------------+ +| 3 vboost 4 0 | ++-------------------------------------------------+ +``` + +The slider can be set to `1` using the `nvidia_vboost_enable` feature: + +| vboost setting | Slurm constraint | +|----------------|-----------------------------------| +| `0` | `nvidia_vboost_disable` (default) | +| `1` | `nvidia_vboost_enable` | + [](){#ref-slurm-gh200} ## NVIDIA GH200 GPU Nodes diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index dc272d3e..2239496d 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -185,7 +185,7 @@ For further details on execution logic, job monitoring and data management, plea * Extensively evaluate all possible parallelization dimensions, including data-, tensor- and pipeline parallelism (including virtual pipeline parallelism) and more, when available. Identify storage-related bottlenecks by isolating data loading/generation operations into a separate benchmark. - * Disabling transparent huge pages and enabling the Nvidia [vboost](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html#gpu-core-clock-optimization) feature has been observed to improve performance in large-scale LLM training in Megatron-LM. This can be achieved by adding these constraints to the sbatch script: + * [Disabling transparent huge pages][ref-slurm-features-thp] and [enabling the Nvidia vboost feature][ref-slurm-features-vboost] has been observed to improve performance in large-scale LLM training in Megatron-LM. This can be achieved by adding these constraints to the sbatch script: ```bash #SBATCH -C thp_never&nvidia_vboost_enabled ```