You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/clusters/daint.md
+15-14Lines changed: 15 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,12 +7,12 @@ Daint is the main [HPC Platform][ref-platform-hpcp] cluster that provides comput
7
7
8
8
### Compute nodes
9
9
10
-
Daint consists of around 793[Grace-Hopper nodes][ref-alps-gh200-node].
10
+
Daint consists of around 800-1000[Grace-Hopper nodes][ref-alps-gh200-node].
11
11
12
-
The number of nodes can change when nodes are added or removed from other clusters on Alps.
12
+
The number of nodes can vary as nodes are added or removed from other clusters on Alps.
13
13
14
-
There are four login nodes, labelled `daint-ln00[1-4]`.
15
-
You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and start simulation jobs.
14
+
There are four login nodes, `daint-ln00[1-4]`.
15
+
You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and launch batch jobs.
16
16
17
17
| node type | number of nodes | total CPU sockets | total GPUs |
Provide the latest versions of scientific applications, tuned for Daint, and the tools required to build your own version of the applications.
54
+
Provide the latest versions of scientific applications, tuned for Daint, and the tools required to build your own versions of the applications.
55
55
56
56
*[CP2K][ref-uenv-cp2k]
57
57
*[GROMACS][ref-uenv-gromacs]
@@ -97,30 +97,31 @@ To build images, see the [guide to building container images on Alps][ref-build-
97
97
98
98
CSCS will continue to support and update uenv and container engine, and users are encouraged to update their workflows to use these methods at the first opportunity.
99
99
100
-
The CPE is still installed on Daint, however it will recieve no support or updates, and will be replaced with a container in a future update.
100
+
The CPE is still installed on Daint, however it will receive no support or updates, and will be replaced with a container in a future update.
101
101
102
102
## Running jobs on Daint
103
103
104
104
### Slurm
105
105
106
-
Santis uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
106
+
Daint uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor compute-intensive workloads.
107
107
108
-
There are two[Slurm partitions][ref-slurm-partitions] on the system:
108
+
There are four[Slurm partitions][ref-slurm-partitions] on the system:
109
109
110
110
* the `normal` partition is for all production workloads.
111
111
* the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes.
112
-
* the `xfer` partition is for [internal data transfer][ref-data-xfer-internal] at CSCS.
112
+
* the `xfer` partition is for [internal data transfer][ref-data-xfer-internal].
113
+
* the `low` partition is a low-priority partition, which may be enabled for specific projects at specific times.
114
+
113
115
114
-
!!! todo "Timmy: add definition of the low partition"
115
116
116
117
| name | nodes | max nodes per job | time limit |
117
118
| -- | -- | -- | -- |
118
-
|`normal`| 994 | - | 24 hours |
119
-
|`low`| 994 | 2 | 24 hours |
119
+
|`normal`| unlim | - | 24 hours |
120
120
|`debug`| 24 | 2 | 30 minutes |
121
121
|`xfer`| 2 | 1 | 24 hours |
122
+
|`low`| unlim | - | 24 hours |
122
123
123
-
* nodes in the `normal` and `debug` partitions are not shared
124
+
* nodes in the `normal` and `debug`(and `low`) partitions are not shared
124
125
* nodes in the `xfer` partition can be shared
125
126
126
127
See the Slurm documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
@@ -136,7 +137,7 @@ Daint can also be accessed using [FirecREST][ref-firecrest] at the `https://api.
136
137
### Scheduled maintenance
137
138
138
139
!!! todo "move this to HPCP top level docs"
139
-
Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.
140
+
Wednesday mornings 8:00-12:00 CET are reserved for periodic updates, with services potentially unavailable during this time frame. If the batch queues must be drained (for redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.
140
141
141
142
Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch).
0 commit comments