Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 26 additions & 2 deletions .github/actions/spelling/allow.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ CXI
Ceph
Containerfile
DNS
Dockerfiles
EDF
EDFs
EDFs
Expand Down Expand Up @@ -57,11 +58,9 @@ MFA
MLP
MNDO
MPICH
MPS
MeteoSwiss
NAMD
NICs
NVIDIA
NVMe
OTP
OTPs
Expand Down Expand Up @@ -94,6 +93,8 @@ XDG
aarch
aarch64
acl
autodetection
baremetal
biomolecular
bristen
bytecode
Expand All @@ -104,31 +105,53 @@ concretizer
containerised
cpe
cscs
cuda
customised
dcomex
diagonalisation
dockerhub
dotenv
eiger
epyc
filesystems
fontawesome
gitlab
gpu
groundstate
ijulia
inodes
iopsstor
jfrog
lexer
libfabric
miniconda
mpi
mps
multitenancy
netrc
nsight
numa
nvidia
octicons
oom
podman
preinstalled
prgenv
prioritisation
prioritised
proactively
pyfirecrest
pytorch
quickstart
rocm
runtime
runtimes
santis
sbatch
screenshot
slurm
smartphone
sphericart
squashfs
srun
ssh
Expand All @@ -140,6 +163,7 @@ subtables
supercomputing
superlu
sysadmin
tarball
tcl
tcsh
testuser
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/spelling.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
only_check_changed_files: 1
post_comment: 1
use_magic_file: 1
warnings: bad-regex,binary-file,deprecated-feature,large-file,limited-references,no-newline-at-eof,noisy-file,non-alpha-in-dictionary,token-is-substring,unexpected-line-ending,whitespace-in-dictionary,minified-file,unsupported-configuration,no-files-to-check
warnings: bad-regex,binary-file,deprecated-feature,large-file,limited-references,no-newline-at-eof,noisy-file,token-is-substring,unexpected-line-ending,whitespace-in-dictionary,minified-file,unsupported-configuration,no-files-to-check
use_sarif: ${{ (!github.event.pull_request || (github.event.pull_request.head.repo.full_name == github.repository)) && 1 }}
extra_dictionary_limit: 20
extra_dictionaries:
Expand Down
10 changes: 5 additions & 5 deletions docs/running/slurm.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Refer to the [Quick Start User Guide](https://slurm.schedmd.com/quickstart.html)

- :fontawesome-solid-mountain-sun: __Node sharing__

Guides on how to effectively use all resouces on nodes by running more than one job per node.
Guides on how to effectively use all resources on nodes by running more than one job per node.

[:octicons-arrow-right-24: Node sharing][ref-slurm-sharing]

Expand Down Expand Up @@ -68,7 +68,7 @@ $ sbatch --account=g123 ./job.sh
!!! note
The flags `--account` and `-Cmc` that were required on the old [Eiger][ref-cluster-eiger] cluster are no longer required.

## Prioritization and scheduling
## Prioritisation and scheduling

Job priorities are determined based on each project's resource usage relative to its quarterly allocation, as well as in comparison to other projects.
An aging factor is also applied to each job in the queue to ensure fairness over time.
Expand Down Expand Up @@ -219,7 +219,7 @@ The build generates the following executables:

1. Test GPU affinity: note how all 4 ranks see the same 4 GPUs.

2. Test GPU affinity: note how the `--gpus-per-task=1` parameter assings a unique GPU to each rank.
2. Test GPU affinity: note how the `--gpus-per-task=1` parameter assigns a unique GPU to each rank.

!!! info "Quick affinity checks"

Expand Down Expand Up @@ -491,7 +491,7 @@ rank 7 @ nid002199: thread 0 -> cores [112:127]
In the above examples all threads on each -- we are effectively allowing the OS to schedule the threads on the available set of cores as it sees fit.
This often gives the best performance, however sometimes it is beneficial to bind threads to explicit cores.

The OpenMP threading runtime provides additional options for controlling the pinning of threads to the cores assinged to each MPI rank.
The OpenMP threading runtime provides additional options for controlling the pinning of threads to the cores assigned to each MPI rank.

Use the `--omp` flag with `affinity.mpi` to get more detailed information about OpenMP thread affinity.
For example, four MPI ranks on one node with four cores and four OpenMP threads:
Expand Down Expand Up @@ -580,7 +580,7 @@ The approach is to:
1. first allocate all the resources on each node to the job;
2. then subdivide those resources at each invocation of srun.

If Slurm believes that a request for resources (cores, gpus, memory) overlaps with what another step has already allocated, it will defer the execution until the resources are relinquished.
If Slurm believes that a request for resources (cores, GPUs, memory) overlaps with what another step has already allocated, it will defer the execution until the resources are relinquished.
This must be avoided.

First ensure that *all* resources are allocated to the whole job with the following preamble:
Expand Down
12 changes: 6 additions & 6 deletions docs/services/cicd.md
Original file line number Diff line number Diff line change
Expand Up @@ -718,7 +718,7 @@ Private projects will always get as notification a link to the CSCS pipeline ove
To view the CSCS pipeline overview for a public project and restart / cancel jobs, follow these steps:

* Copy the web link of the CSCS CI status of your project and remove the from the link the `type=gitlab`.
* Alternativily, assemble the link yourself, it has the form `https://cicd-ext-mw.cscs.ch/ci/pipeline/results/<repository_id>/<project_id>/<pipeline_nb>` (the IDs can be found on the Gitlab page of your mirror project).
* Alternatively, assemble the link yourself, it has the form `https://cicd-ext-mw.cscs.ch/ci/pipeline/results/<repository_id>/<project_id>/<pipeline_nb>` (the IDs can be found on the Gitlab page of your mirror project).
* Click on `Login to restart jobs` at the bottom right and login with your CSCS credentials
* Click `Cancel running` or `Restart jobs` or cancel individual jobs (button next to job's name)
* Everybody that has at least *Manager* access can restart / cancel jobs (access level is managed on the CI setup page in the Admin section)
Expand Down Expand Up @@ -783,7 +783,7 @@ This is the clone URL of the registered project, i.e. this is not the clone URL
### `ARCH`
value: `x86_64` or `aarch64`

This is the architecture of the runner. It is either an ARM64 machine, i.e. `aarch64`, or a traditinal `x86_64` machine.
This is the architecture of the runner. It is either an ARM64 machine, i.e. `aarch64`, or a traditional `x86_64` machine.


## Runners reference
Expand Down Expand Up @@ -819,7 +819,7 @@ Accepted variables are documented at [Slurm's srun man page](https://slurm.sched

!!! Warning "SLURM_TIMELIMIT"
Special attention should go the variable `SLURM_TIMELIMIT`, which sets the maximum time of your Slurm job.
You will be billed the nodehours that your CI jobs are spending on the cluster, i.e. you want to set the `SLURM_TIMELIMIT` to the maximum time that you expect the job to run.
You will be billed the node hours that your CI jobs are spending on the cluster, i.e. you want to set the `SLURM_TIMELIMIT` to the maximum time that you expect the job to run.
You should also pay attention to wrap the value in quotes, because the gitlab-runner interprets the time differently than Slurm, when it is not wrapped in quotes, i.e. This is correct:
```
SLURM_TIMELIMIT: "00:30:00"
Expand Down Expand Up @@ -867,7 +867,7 @@ The value must be a valid JSON array, where each entry is a string.

It is almost always correct to wrap the full value in single-quotes.

It is also possible to define the argument's values as an entry in `variables`, and then reference in `DOCKER_BUILD_ARGS` only the variables that you want to expose to the build process, i.e. sth like this:
It is also possible to define the argument's values as an entry in `variables`, and then reference in `DOCKER_BUILD_ARGS` only the variables that you want to expose to the build process, i.e. something like this:
```yaml
my job:
extends: .container-builder-cscs-gh200
Expand Down Expand Up @@ -987,7 +987,7 @@ This tag is mandatory.
##### `GIT_STRATEGY`
Optional variable, default is `none`

This is a [default Gitlab variable](https://docs.gitlab.com/ee/ci/runners/configure_runners.html#git-strategy), but mentioned here explicitly, because very often you do not need to clone the repository sourcecode when you run your containerized application.
This is a [default Gitlab variable](https://docs.gitlab.com/ee/ci/runners/configure_runners.html#git-strategy), but mentioned here explicitly, because very often you do not need to clone the repository source code when you run your containerized application.

The default is `none`, and you must explicitly set it to `fetch` or `clone` to fetch the source code by the runner.

Expand Down Expand Up @@ -1323,7 +1323,7 @@ The easiest way to use the FirecREST scheduler of ReFrame is to use the configur
In case you want to run ReFrame for a system that is not already available in this directory, please open a ticket to the Service Desk and we will add it or help you update one of the existing ones.

Something you should be aware of when running with this scheduler is that ReFrame will not have direct access to the filesystem of the cluster so the stage directory will need to be kept in sync through FirecREST.
It is recommended to try to clean the stage directory whenever possible with the [postrun_cmds](https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.postrun_cmds) and [postbuild_cmds](https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.postbuild_cmds) and to avoid [autodetection of the processor](https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#config.systems.partitions.processor) in each run.
It is recommended to try to clean the stage directory whenever possible with the [`postrun_cmds`](https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.postrun_cmds) and [`postbuild_cmds`](https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.postbuild_cmds) and to avoid [autodetection of the processor](https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#config.systems.partitions.processor) in each run.
Normally ReFrame stores these files in `~/.reframe/topology/{system}-{part}/processor.json`, but you get a "clean" runner every time.
You could either add them in the configuration files or store the files in the first run and copy them to the right directory before ReFrame runs.

Expand Down
2 changes: 1 addition & 1 deletion docs/software/container-engine/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ Directories outside a container can be *mounted* inside a container so that the
!!! note
The source (before `:`) should be present on the cluster: the destination (after `:`) doesn't have to be inside the container.

See [the EDF reference][ref-ce-edf-reference] for the full specifiction of the `mounts` EDF entry.
See [the EDF reference][ref-ce-edf-reference] for the full specification of the `mounts` EDF entry.


[](){#ref-ce-run-mounting-squashfs}
Expand Down
Loading