Skip to content

Commit e605fd1

Browse files
authored
Merge branch 'main' into patch-1
2 parents 1014229 + b9aa92c commit e605fd1

File tree

12 files changed

+181
-26
lines changed

12 files changed

+181
-26
lines changed

.github/actions/spelling/allow.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ AMD
44
Alpstein
55
Balfrin
66
Besard
7+
Besso
78
Broyden
89
CFLAGS
910
CHARMM
@@ -121,6 +122,7 @@ artifactory
121122
autodetection
122123
aws
123124
baremetal
125+
besso
124126
biomolecular
125127
blaspp
126128
blt
@@ -326,6 +328,7 @@ uenv
326328
uenvs
327329
uids
328330
ultrasoft
331+
unsquashfs
329332
utkin
330333
vCluster
331334
vClusters
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
name: Delete PR preview
2+
on:
3+
pull_request_target:
4+
branches: ['main']
5+
types: ['closed']
6+
7+
jobs:
8+
preview_delete:
9+
name: Delete preview
10+
runs-on: ubuntu-latest
11+
steps:
12+
- name: delete-preview
13+
run: |
14+
curl --fail -X DELETE -H "Authorization: Bearer ${{ secrets.UPLOAD_TOKEN }}" https://docs.tds.cscs.ch/upload?path=${{ github.event.pull_request.number }}

.github/workflows/welcome.yaml

Lines changed: 0 additions & 21 deletions
This file was deleted.

docs/clusters/besso.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
[](){#ref-cluster-besso}
2+
# Besso
3+
4+
Besso is a small Alps cluster that provides development resources for porting software for selected customers.
5+
It is provided as is, without the same level of support as the main platform clusters.
6+
7+
### Storage and file systems
8+
9+
Besso uses the [HPCP filesystems and storage policies][ref-hpcp-storage].
10+
11+
## Getting started
12+
13+
### Logging into Besso
14+
15+
To connect to Besso via SSH, first refer to the [ssh guide][ref-ssh].
16+
17+
!!! example "`~/.ssh/config`"
18+
Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to besso using `ssh besso`.
19+
```
20+
Host besso
21+
HostName besso.vc.cscs.ch
22+
ProxyJump ela
23+
User cscsusername
24+
IdentityFile ~/.ssh/cscs-key
25+
IdentitiesOnly yes
26+
```
27+
28+
### Software
29+
30+
[](){#ref-cluster-besso-uenv}
31+
#### uenv
32+
33+
Besso is a development and testing system, for which CSCS does not provide supported applications.
34+
35+
Instead, the [prgenv-gnu][ref-uenv-prgenv-gnu] programming environment is provided for the both the [a100][ref-alps-a100-node] and [mi200][ref-alps-mi200-node] node types.
36+
37+
[](){#ref-cluster-besso-containers}
38+
#### Containers
39+
40+
Besso supports container workloads using the [Container Engine][ref-container-engine].
41+
42+
To build images, see the [guide to building container images on Alps][ref-build-containers].
43+
44+
#### Cray Modules
45+
46+
!!! warning
47+
The Cray Programming Environment (CPE), loaded using `module load cray`, is no longer supported by CSCS.
48+
49+
CSCS will continue to support and update uenv and the Container Engine, and users are encouraged to update their workflows to use these methods at the first opportunity.
50+
51+
The CPE is still installed on Besso, however it will receive no support or updates, and will be [replaced with a container][ref-cpe] in a future update.
52+
53+
## Running jobs on Besso
54+
55+
### Slurm
56+
57+
Besso uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor workloads on compute nodes.
58+
59+
There are multiple [Slurm partitions][ref-slurm-partitions] on the system:
60+
61+
* the `a100` partition contains [NVIDIA A100 GPU][ref-alps-a100-node] nodes
62+
* the `mi200` partition contains [AMD Mi250x GPU][ref-alps-mi200-node] nodes
63+
* the `normal` partition contains all of the nodes in the system.
64+
65+
| name | max nodes per job | time limit |
66+
| -- | -- | -- |
67+
| `a100` | 2 | 24 hours |
68+
| `mi200` | 2 | 24 hours |
69+
| `normal` | 4 | 24 hours |
70+
71+
See the Slurm documentation for instructions on how to [run jobs][ref-slurm].
72+
73+
### FirecREST
74+
75+
!!! under-construction
76+
Besso will have support for [FirecREST][ref-firecrest] access.
77+
78+
## Maintenance and status
79+
80+
There is no regular scheduled maintenance for this system.

docs/alps/clusters.md renamed to docs/clusters/index.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,4 +43,14 @@ The following clusters are part of the platforms that are fully operated by CSCS
4343
[:octicons-arrow-right-24: Santis][ref-cluster-santis]
4444
</div>
4545

46+
## Other systems
47+
48+
<div class="grid cards" markdown>
49+
- :fontawesome-solid-mountain: __Porting and Development__
50+
51+
Besso is a small system used by some partners for development and porting with AMD and NVIDIA GPUs.
52+
53+
[:octicons-arrow-right-24: Besso][ref-cluster-besso]
54+
</div>
55+
4656

docs/guides/storage.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,8 @@ The first step is to create the virtual environment using the usual workflow.
206206
# create and activate a new relocatable venv using uv
207207
# in this case we explicitly select python 3.12
208208
uv venv -p 3.12 --relocatable --link-mode=copy /dev/shm/sqfs-demo/.venv
209+
# You can also point to the uenv python with `uv venv -p $(which python) ...`
210+
# which, among other things, enables user portability of the venv
209211
cd /dev/shm/sqfs-demo
210212
source .venv/bin/activate
211213

docs/index.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,15 +30,15 @@ Find out more about Alps...
3030

3131
Learn more about the Alps research infrastructure
3232

33-
[:octicons-arrow-right-24: Alps Overview](alps/index.md)
33+
[:octicons-arrow-right-24: Alps Overview][ref-alps]
3434

3535
Get detailed information about the main components of the infrastructure
3636

37-
[:octicons-arrow-right-24: Alps Clusters](alps/clusters.md)
37+
[:octicons-arrow-right-24: Alps Clusters][ref-alps-clusters]
3838

39-
[:octicons-arrow-right-24: Alps Hardware](alps/hardware.md)
39+
[:octicons-arrow-right-24: Alps Hardware][ref-alps-hardware]
4040

41-
[:octicons-arrow-right-24: Alps Storage](alps/storage.md)
41+
[:octicons-arrow-right-24: Alps Storage][ref-alps-storage]
4242

4343
- :fontawesome-solid-key: __Logging In__
4444

docs/software/container-engine/known-issue.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,3 +79,32 @@ The use of `--environment` as `#SBATCH` is known to cause **unexpected behaviors
7979
- **Nested use of `--environment`**: running `srun --environment` in `#SBATCH --environment` results in double-entering EDF containers, causing unexpected errors in the underlying container runtime.
8080

8181
To avoid any unexpected confusion, users are advised **not** to use `--environment` as `#SBATCH`. If users encounter a problem while using this, it's recommended to move `--environment` from `#SBATCH` to each `srun` and see if the problem disappears.
82+
83+
[](){#ref-ce-no-user-id}
84+
## Container start fails with `id: cannot find name for user ID`
85+
86+
If your slurm job using a container fails to start with an error message similar to:
87+
```console
88+
slurmstepd: error: pyxis: container start failed with error code: 1
89+
slurmstepd: error: pyxis: container exited too soon
90+
slurmstepd: error: pyxis: printing engine log file:
91+
slurmstepd: error: pyxis: id: cannot find name for user ID 42
92+
slurmstepd: error: pyxis: id: cannot find name for user ID 42
93+
slurmstepd: error: pyxis: id: cannot find name for user ID 42
94+
slurmstepd: error: pyxis: mkdir: cannot create directory ‘/iopsstor/scratch/cscs/42’: Permission denied
95+
slurmstepd: error: pyxis: couldn't start container
96+
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
97+
slurmstepd: error: Failed to invoke spank plugin stack
98+
srun: error: nid001234: task 0: Exited with exit code 1
99+
srun: Terminating StepId=12345.0
100+
```
101+
it does not indicate an issue with your container, but instead means that one or more of the compute nodes have user databases that are not fully synchronized.
102+
If the problematic node is not automatically drained, please [let us know][ref-get-in-touch] so that we can ensure the node is in a good state.
103+
You can check the state of a node using `sinfo --nodes=<node>`, e.g.:
104+
```console
105+
$ sinfo --nodes=nid006886
106+
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
107+
debug up 1:30:00 0 n/a
108+
normal* up 12:00:00 1 drain$ nid006886
109+
xfer up 1-00:00:00 0 n/a
110+
```

docs/software/container-engine/run.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,10 @@ There are three ways to do so:
2424
!!! note "Shared container at the node-level"
2525
For memory efficiency reasons, all Slurm tasks on an individual compute node share the same container, including its filesystem. As a consequence, any write operation to the container filesystem by one task will eventually become visible to all other tasks on the same node.
2626

27+
!!! warning "Container start failure with `id: cannot find name for user ID`"
28+
Containers may fail to start due to user database issues on compute nodes.
29+
See [this section][ref-ce-no-user-id] for more details.
30+
2731
### Use from batch scripts
2832

2933
Use `--environment` with the Slurm command (e.g., `srun` or `salloc`):

docs/software/sciviz/paraview.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,5 +136,7 @@ You will need to add the corresponding XML code to your local ParaView installat
136136
<Argument value="6000"/>
137137
</Arguments>
138138
</Command>
139+
</CommandStartup>
140+
</Server>
139141
</Servers>
140142
```

0 commit comments

Comments
 (0)