From 397a28030563e164d95675302dff13df53cacf46 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Tue, 8 Jul 2025 16:45:07 +0200 Subject: [PATCH 1/4] Fix more typos --- .github/actions/spelling/allow.txt | 3 +++ docs/running/slurm.md | 2 +- docs/services/cicd.md | 6 +++--- docs/software/container-engine/run.md | 2 +- 4 files changed, 8 insertions(+), 5 deletions(-) diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt index 844c4877..02641e95 100644 --- a/.github/actions/spelling/allow.txt +++ b/.github/actions/spelling/allow.txt @@ -124,6 +124,8 @@ prioritised proactively pytorch quickstart +runtime +runtimes santis sbatch screenshot @@ -140,6 +142,7 @@ subtables supercomputing superlu sysadmin +tarball tcl tcsh testuser diff --git a/docs/running/slurm.md b/docs/running/slurm.md index 0593b703..aabb099e 100644 --- a/docs/running/slurm.md +++ b/docs/running/slurm.md @@ -19,7 +19,7 @@ Refer to the [Quick Start User Guide](https://slurm.schedmd.com/quickstart.html) - :fontawesome-solid-mountain-sun: __Node sharing__ - Guides on how to effectively use all resouces on nodes by running more than one job per node. + Guides on how to effectively use all resources on nodes by running more than one job per node. [:octicons-arrow-right-24: Node sharing][ref-slurm-sharing] diff --git a/docs/services/cicd.md b/docs/services/cicd.md index 0aa173e3..4e94513d 100644 --- a/docs/services/cicd.md +++ b/docs/services/cicd.md @@ -783,7 +783,7 @@ This is the clone URL of the registered project, i.e. this is not the clone URL ### `ARCH` value: `x86_64` or `aarch64` -This is the architecture of the runner. It is either an ARM64 machine, i.e. `aarch64`, or a traditinal `x86_64` machine. +This is the architecture of the runner. It is either an ARM64 machine, i.e. `aarch64`, or a traditional `x86_64` machine. ## Runners reference @@ -867,7 +867,7 @@ The value must be a valid JSON array, where each entry is a string. It is almost always correct to wrap the full value in single-quotes. -It is also possible to define the argument's values as an entry in `variables`, and then reference in `DOCKER_BUILD_ARGS` only the variables that you want to expose to the build process, i.e. sth like this: +It is also possible to define the argument's values as an entry in `variables`, and then reference in `DOCKER_BUILD_ARGS` only the variables that you want to expose to the build process, i.e. something like this: ```yaml my job: extends: .container-builder-cscs-gh200 @@ -987,7 +987,7 @@ This tag is mandatory. ##### `GIT_STRATEGY` Optional variable, default is `none` -This is a [default Gitlab variable](https://docs.gitlab.com/ee/ci/runners/configure_runners.html#git-strategy), but mentioned here explicitly, because very often you do not need to clone the repository sourcecode when you run your containerized application. +This is a [default Gitlab variable](https://docs.gitlab.com/ee/ci/runners/configure_runners.html#git-strategy), but mentioned here explicitly, because very often you do not need to clone the repository source code when you run your containerized application. The default is `none`, and you must explicitly set it to `fetch` or `clone` to fetch the source code by the runner. diff --git a/docs/software/container-engine/run.md b/docs/software/container-engine/run.md index 186f6108..5b85c4a0 100644 --- a/docs/software/container-engine/run.md +++ b/docs/software/container-engine/run.md @@ -205,7 +205,7 @@ Directories outside a container can be *mounted* inside a container so that the !!! note The source (before `:`) should be present on the cluster: the destination (after `:`) doesn't have to be inside the container. -See [the EDF reference][ref-ce-edf-reference] for the full specifiction of the `mounts` EDF entry. +See [the EDF reference][ref-ce-edf-reference] for the full specification of the `mounts` EDF entry. [](){#ref-ce-run-mounting-squashfs} From e4c1cc739b7f2bee6db327eadd56fadc1705ba3a Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Tue, 8 Jul 2025 16:47:13 +0200 Subject: [PATCH 2/4] Remove non-alpha-in-dictionary spell checker warning --- .github/workflows/spelling.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/spelling.yaml b/.github/workflows/spelling.yaml index 50f623e4..1472efdc 100644 --- a/.github/workflows/spelling.yaml +++ b/.github/workflows/spelling.yaml @@ -36,7 +36,7 @@ jobs: only_check_changed_files: 1 post_comment: 1 use_magic_file: 1 - warnings: bad-regex,binary-file,deprecated-feature,large-file,limited-references,no-newline-at-eof,noisy-file,non-alpha-in-dictionary,token-is-substring,unexpected-line-ending,whitespace-in-dictionary,minified-file,unsupported-configuration,no-files-to-check + warnings: bad-regex,binary-file,deprecated-feature,large-file,limited-references,no-newline-at-eof,noisy-file,token-is-substring,unexpected-line-ending,whitespace-in-dictionary,minified-file,unsupported-configuration,no-files-to-check use_sarif: ${{ (!github.event.pull_request || (github.event.pull_request.head.repo.full_name == github.repository)) && 1 }} extra_dictionary_limit: 20 extra_dictionaries: From 7fadc712cb25d1d1a67c5046481c7c437b828e60 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Tue, 8 Jul 2025 16:54:51 +0200 Subject: [PATCH 3/4] More typos and whitelist --- .github/actions/spelling/allow.txt | 23 +++++++++++++++++++++-- docs/running/slurm.md | 6 +++--- docs/services/cicd.md | 6 +++--- 3 files changed, 27 insertions(+), 8 deletions(-) diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt index 02641e95..a0d9e7b6 100644 --- a/.github/actions/spelling/allow.txt +++ b/.github/actions/spelling/allow.txt @@ -16,6 +16,7 @@ CXI Ceph Containerfile DNS +Dockerfiles EDF EDFs EDFs @@ -57,11 +58,9 @@ MFA MLP MNDO MPICH -MPS MeteoSwiss NAMD NICs -NVIDIA NVMe OTP OTPs @@ -94,6 +93,8 @@ XDG aarch aarch64 acl +autodetection +baremetal biomolecular bristen bytecode @@ -104,26 +105,43 @@ concretizer containerised cpe cscs +cuda customised diagonalisation +dockerhub +dotenv eiger +epyc filesystems +fontawesome +gitlab +gpu groundstate ijulia inodes iopsstor +jfrog lexer libfabric miniconda mpi +mps multitenancy +netrc nsight +numa +nvidia +octicons +oom podman +preinstalled prgenv prioritised proactively +pyfirecrest pytorch quickstart +rocm runtime runtimes santis @@ -131,6 +149,7 @@ sbatch screenshot slurm smartphone +sphericart squashfs srun ssh diff --git a/docs/running/slurm.md b/docs/running/slurm.md index aabb099e..860b3170 100644 --- a/docs/running/slurm.md +++ b/docs/running/slurm.md @@ -68,7 +68,7 @@ $ sbatch --account=g123 ./job.sh !!! note The flags `--account` and `-Cmc` that were required on the old [Eiger][ref-cluster-eiger] cluster are no longer required. -## Prioritization and scheduling +## Prioritisation and scheduling Job priorities are determined based on each project's resource usage relative to its quarterly allocation, as well as in comparison to other projects. An aging factor is also applied to each job in the queue to ensure fairness over time. @@ -219,7 +219,7 @@ The build generates the following executables: 1. Test GPU affinity: note how all 4 ranks see the same 4 GPUs. - 2. Test GPU affinity: note how the `--gpus-per-task=1` parameter assings a unique GPU to each rank. + 2. Test GPU affinity: note how the `--gpus-per-task=1` parameter assigns a unique GPU to each rank. !!! info "Quick affinity checks" @@ -491,7 +491,7 @@ rank 7 @ nid002199: thread 0 -> cores [112:127] In the above examples all threads on each -- we are effectively allowing the OS to schedule the threads on the available set of cores as it sees fit. This often gives the best performance, however sometimes it is beneficial to bind threads to explicit cores. -The OpenMP threading runtime provides additional options for controlling the pinning of threads to the cores assinged to each MPI rank. +The OpenMP threading runtime provides additional options for controlling the pinning of threads to the cores assigned to each MPI rank. Use the `--omp` flag with `affinity.mpi` to get more detailed information about OpenMP thread affinity. For example, four MPI ranks on one node with four cores and four OpenMP threads: diff --git a/docs/services/cicd.md b/docs/services/cicd.md index 4e94513d..7a684395 100644 --- a/docs/services/cicd.md +++ b/docs/services/cicd.md @@ -718,7 +718,7 @@ Private projects will always get as notification a link to the CSCS pipeline ove To view the CSCS pipeline overview for a public project and restart / cancel jobs, follow these steps: * Copy the web link of the CSCS CI status of your project and remove the from the link the `type=gitlab`. -* Alternativily, assemble the link yourself, it has the form `https://cicd-ext-mw.cscs.ch/ci/pipeline/results///` (the IDs can be found on the Gitlab page of your mirror project). +* Alternatively, assemble the link yourself, it has the form `https://cicd-ext-mw.cscs.ch/ci/pipeline/results///` (the IDs can be found on the Gitlab page of your mirror project). * Click on `Login to restart jobs` at the bottom right and login with your CSCS credentials * Click `Cancel running` or `Restart jobs` or cancel individual jobs (button next to job's name) * Everybody that has at least *Manager* access can restart / cancel jobs (access level is managed on the CI setup page in the Admin section) @@ -819,7 +819,7 @@ Accepted variables are documented at [Slurm's srun man page](https://slurm.sched !!! Warning "SLURM_TIMELIMIT" Special attention should go the variable `SLURM_TIMELIMIT`, which sets the maximum time of your Slurm job. - You will be billed the nodehours that your CI jobs are spending on the cluster, i.e. you want to set the `SLURM_TIMELIMIT` to the maximum time that you expect the job to run. + You will be billed the node hours that your CI jobs are spending on the cluster, i.e. you want to set the `SLURM_TIMELIMIT` to the maximum time that you expect the job to run. You should also pay attention to wrap the value in quotes, because the gitlab-runner interprets the time differently than Slurm, when it is not wrapped in quotes, i.e. This is correct: ``` SLURM_TIMELIMIT: "00:30:00" @@ -1323,7 +1323,7 @@ The easiest way to use the FirecREST scheduler of ReFrame is to use the configur In case you want to run ReFrame for a system that is not already available in this directory, please open a ticket to the Service Desk and we will add it or help you update one of the existing ones. Something you should be aware of when running with this scheduler is that ReFrame will not have direct access to the filesystem of the cluster so the stage directory will need to be kept in sync through FirecREST. -It is recommended to try to clean the stage directory whenever possible with the [postrun_cmds](https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.postrun_cmds) and [postbuild_cmds](https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.postbuild_cmds) and to avoid [autodetection of the processor](https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#config.systems.partitions.processor) in each run. +It is recommended to try to clean the stage directory whenever possible with the [`postrun_cmds`](https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.postrun_cmds) and [`postbuild_cmds`](https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.postbuild_cmds) and to avoid [autodetection of the processor](https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#config.systems.partitions.processor) in each run. Normally ReFrame stores these files in `~/.reframe/topology/{system}-{part}/processor.json`, but you get a "clean" runner every time. You could either add them in the configuration files or store the files in the first run and copy them to the right directory before ReFrame runs. From 6cb6d85a589f7bc467e1f602b4aef4cec27b8492 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Tue, 8 Jul 2025 16:58:00 +0200 Subject: [PATCH 4/4] More typos and whitelist --- .github/actions/spelling/allow.txt | 2 ++ docs/running/slurm.md | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt index a0d9e7b6..6b920419 100644 --- a/.github/actions/spelling/allow.txt +++ b/.github/actions/spelling/allow.txt @@ -107,6 +107,7 @@ cpe cscs cuda customised +dcomex diagonalisation dockerhub dotenv @@ -136,6 +137,7 @@ oom podman preinstalled prgenv +prioritisation prioritised proactively pyfirecrest diff --git a/docs/running/slurm.md b/docs/running/slurm.md index 860b3170..62c6da99 100644 --- a/docs/running/slurm.md +++ b/docs/running/slurm.md @@ -580,7 +580,7 @@ The approach is to: 1. first allocate all the resources on each node to the job; 2. then subdivide those resources at each invocation of srun. -If Slurm believes that a request for resources (cores, gpus, memory) overlaps with what another step has already allocated, it will defer the execution until the resources are relinquished. +If Slurm believes that a request for resources (cores, GPUs, memory) overlaps with what another step has already allocated, it will defer the execution until the resources are relinquished. This must be avoided. First ensure that *all* resources are allocated to the whole job with the following preamble: