You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/modeling-simulation-workbench/tutorial-install-slurm.md
+43-45Lines changed: 43 additions & 45 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ ms.date: 10/02/2024
12
12
13
13
# Tutorial: Install the Slurm workload manager in the Azure Modeling and Simulation Workbench
14
14
15
-
The [Slurm](https://slurm.schedmd.com/overview.html) Workload Manager is a scheduler used in microelectronics design and other high-performance computing scenarios to manage jobs across compute clusters. The Modeling and Simulation Workbench can be deployed with a range of high-performance virtual machines (VM) ideal for large, compute-intensive workloads. Slurm clusters consist of a *controller node*, where the administrator manages, stages, and schedules jobs bound for the *compute nodes*, where the actual workloads are performed. A *node* is simply a part of the cluster, in this case a VM.
15
+
The [Slurm](https://slurm.schedmd.com/overview.html) Workload Manager is a scheduler used in microelectronics design and other high-performance computing scenarios to manage jobs across compute clusters. The Modeling and Simulation Workbench can be deployed with a range of high-performance virtual machines (VM) ideal for large, compute-intensive workloads. Slurm clusters consist of a *controller node* that manages, stages, and schedules jobs bound for the *compute nodes*. Compute nodes are where the actual workloads are performed. A *node* is an individual element of the cluster, such as a VM.
16
16
17
17
The Slurm installation package is already available on all Modeling and Simulation Workbench Chamber VMs. This tutorial shows you how to create VMs for your Slurm cluster and install Slurm.
18
18
@@ -24,7 +24,7 @@ In this tutorial, you learn how to:
24
24
> * Create an inventory of VMs
25
25
> * Designate controller and compute nodes and install Slurm on each
26
26
27
-
If you don’t have a Azure subscription, [create a free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F).
27
+
If you don’t have an Azure subscription, [create a free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F).
28
28
29
29
## Prerequisites
30
30
@@ -34,60 +34,60 @@ If you don’t have a Azure subscription, [create a free account](https://azure.
34
34
35
35
## Sign in to the Azure portal and navigate to your workbench
36
36
37
-
If you are not already signed into the Azure portal, go to [https://portal.azure.com](https://portal.azure.com). Navigate to your workbench, then the chamber where you will create your Slurm cluster.
37
+
If you aren't already signed into the Azure portal, go to [https://portal.azure.com](https://portal.azure.com). Navigate to your workbench, then the chamber where you'll create your Slurm cluster.
38
38
39
39
## Create a cluster for Slurm
40
40
41
-
Slurm requires one node to serve as the controller and a set of compute nodes where workloads will execute. The controller is traditionally a modestly sized VM as it isn't used for computational tasks and is left deployed between jobs, while the compute nodes are sized for the workload. Learn about the different VMs available in Modeling and Simulation Workbench on the [VM Offerings page](./concept-vm-offerings.md).
41
+
Slurm requires one node to serve as the controller and a set of compute nodes where workloads execute. The controller is traditionally a modestly sized VM. The controller isn't used for computational workloads and is left deployed between jobs, while the compute nodes themselves are typically sized for a specific task and often deleted after the job. Learn about the different VMs available in Modeling and Simulation Workbench on the [VM Offerings page](./concept-vm-offerings.md).
42
42
43
43
### Create the Slurm controller node
44
44
45
45
1. From the chamber overview page, select **Chamber VM** from the **Settings** menu, then either the **+ Create** button on action menu along the top or the blue **Create chamber VM** button in center of the page.
46
46
:::image type="content" source="media/tutorial-slurm/create-chamber-vm.png" alt-text="Screenshot of chamber VM overview page with Chamber VM in Settings and the create options on the page highlighted by red outlines.":::
47
47
1. On the **Create chamber VM** page:
48
-
* Enter a **Name** for the VM. Select something to indicate that this is the controller node.
49
-
* Select a VM size. For the controller, you can select the smallest VM available.
50
-
* Leave the **Chamber VM image type** and **Chamber VM count** as the default of "Semiconductor" and "1".
48
+
* Enter a **Name** for the VM. We recommend choosing a name that indicates it is the controller node.
49
+
* Select a VM size. For the controller, you can select the smallest VM available. The *D4s_v4* is currently the smallest.
50
+
* Leave the **Chamber VM image type** and **Chamber VM count** as the default of *Semiconductor* and *1*.
51
51
* Select **Review + create**.
52
52
:::image type="content" source="media/tutorial-slurm/configure-create-chamber-vm.png" alt-text="Screenshot of create chamber VM page with the name and VM size textboxes and the create button highlighted in red outline.":::
53
53
1. After the validation check passes, select the **Create** button.
54
54
55
55
Once the VM deploys, it's available in the connector desktop dashboard.
56
56
57
-
### Create the Slurm compute cluster
57
+
### Create a Slurm compute cluster
58
58
59
-
A *cluster* is a collection of VMs, individually referred to as *nodes* that perform the actual work. The compute nodes have their workloads dispatched and managed by the controller node. Similar to the steps taken in creating the controller, return to the **Chamber VM** page and create a cluster. The Modeling and Simulation Workbench allows you to create identical VMs.
59
+
A *cluster* is a collection of VMs, individually referred to as *nodes* that perform the actual work. The compute nodes have their workloads dispatched and managed by the controller node. Similar to the steps you took when you created the controller, return to the **Chamber VM** page to create a cluster. The Modeling and Simulation Workbench allows you to create multiple, identical VMs in a single step.
60
60
61
61
1. On the **Create chamber VM** page:
62
-
* Enter a **Name** for the VM cluster. Use a name that identifies these VMs as compute nodes. For example, include the word "node" in the name.
63
-
* Select a VM appropriate for the workload. Refer to the [VM Offerings](concept-vm-offerings.md) page for guidance on VM capabilities and sizes.
64
-
* Leave the **Chamber VM image type** as the default of "Semiconductor".
62
+
* Enter a **Name** for the VM cluster. Use a name that identifies these VMs as compute nodes. For example, include the word "node" or the type of workload somewhere in the name.
63
+
* Select a VM appropriately sized for the workload. Refer to the [VM Offerings](concept-vm-offerings.md) page for guidance on VM offerings, capabilities, features, and sizes.
64
+
* Leave the **Chamber VM image type** as the default of *Semiconductor*.
65
65
* In the **Chamber VM count** box, enter the number of nodes required.
66
66
* Select **Review + create**.
67
67
1. After the validation check passes, select the **Create** button.
68
68
69
-
VMs are deployed in parallel and appear in the dashboard. It isn't typically necessary to access worker nodes, however you can ssh to worker nodes in the same chamber if needed.
69
+
VMs are deployed in parallel and appear in the dashboard. It isn't typically necessary to individually access worker nodes, however you can ssh to worker nodes in the same chamber if needed. In the next steps, you'll configure the compute nodes from the controller.
70
70
71
-
### Connect to the controller node
71
+
### Connect to the controller node desktop
72
72
73
73
Slurm installation is performed from the controller node.
74
74
75
-
1. Navigate to the connector. From the **Settings** menu of the chamber, select **Connector**, then select the sole connector that appears in the resource list.
75
+
1. Navigate to the connector. From the **Settings** menu of the chamber, select **Connector**. Select the sole connector that appears in the resource list.
76
76
:::image type="content" source="media/tutorial-slurm/connector-overview.png" alt-text="Screenshot of connector overview page with Connector in Settings and the target connector highlighted with a red rectangle.":::
77
77
1. From the connector page, select the **Desktop dashboard** URL.
78
78
1. The desktop dashboard opens. Select your controller VM.
79
79
80
80
## Create an inventory of VMs
81
81
82
-
Slurm installation requires that you have a technical inventory of the compute nodes, as well as their host names.
82
+
Slurm installation requires that you have a technical inventory of the compute nodes and as host names.
83
83
84
-
### Get a list of available VMs
84
+
### Get a list of deployed VMs
85
85
86
86
Configuring Slurm requires an inventory of nodes. From the controller node:
87
87
88
-
1. Open a terminal in your desktop.
88
+
1. Open a terminal in your desktop by selecting the terminal icon from the menu bar at the top.
89
89
:::image type="content" source="media/tutorial-slurm/open-terminal.png" alt-text="Screenshot of desktop with terminal button highlighted in red.":::
90
-
1. Execute the following bash script to print a list of all VMs in the chamber. In this example, we have one controller and five nodes. The command prints the IP addresses in the first column and the hostnames in the second.
90
+
1. Execute the following commands to print a list of all VMs in the chamber. In this example, we have one controller and five nodes. The command prints the IP addresses in the first column and the hostnames in the second. From the naming, you can see the controller node and the compute nodes.
91
91
92
92
```bash
93
93
$ ip=$(hostname -i | cut -d'.' -f1-3)
@@ -100,11 +100,11 @@ Configuring Slurm requires an inventory of nodes. From the controller node:
100
100
10.163.4.9 wrkldvmslurm-nod034b970
101
101
```
102
102
103
-
1. Create a file with just the worker nodes, one host per line and call it *slurm_worker.txt*. For the remaining steps of this tutorial, you'll use this list to configure the compute nodes from your controller. In some steps, the nodes need to be in a comma-delimited format. In those instances, we use a command-line shortcut to format without having to create a new file. Create the file of your Slurm worker nodes as your hostnames will not match the name you entered in the portal.
103
+
1. Create a file with just the worker nodes, one host per line and call it *slurm_worker.txt*. For the remaining steps of this tutorial, you'll use this list to configure the compute nodes from your controller. In some steps, the nodes need to be in a comma-delimited format. In those instances, we use a command-line shortcut to format the list without having to create a new file. To create *slurm_worker.txt*, remove the IP addresses in the first column, and the controller node which is listed first.
104
104
105
105
### Gather technical specifications about the compute nodes
106
106
107
-
Assuming that you created all the worker nodes in your cluster using the same VM, choose any node to retrieve technical information about the platform. In this example, we use *head* to grab the first host name in the compute node list:
107
+
Assuming that you created all the worker nodes in your cluster using the same VM, choose any node to retrieve technical information about the platform. In this example, we use `head` to grab the first host name in the compute node list and using `ssh` send the `lscpu` command to be executed:
108
108
109
109
```bash
110
110
$ ssh `head -1 ./slurm_worker.txt` lscpu
@@ -146,7 +146,7 @@ You'll be asked by the ssh client to verify the ECDSA key fingerprint of the rem
146
146
***Core(s) per socket**
147
147
***Thread(s) per core**
148
148
149
-
Slurm also requires an estimate of available memory on the compute nodes. To obtain the available memory of a worker node, execute the *free* command on any of the compute nodes from your controller and note the **available** memory reported in the output. Again, we use the first worker node in our list using the *head* command.
149
+
Slurm also requires an estimate of available memory on the compute nodes. To obtain the available memory of a worker node, execute the `free` command on any of the compute nodes from your controller and note the **available** memory reported in the output. Again, we use the first worker node in our list using the `head` command submitted via `ssh`.
Note the available memory listed in the "available" column.
158
+
Note the available memory listed in the **available** column.
159
159
160
160
## Install Slurm on your cluster
161
161
162
-
### Pre-requisite: Install MariaDB
162
+
### Prerequisite: Install MariaDB
163
163
164
-
Slurm requires the MySql fork of MariaDB to be installed from the Red Hat repository before it can be installed. Azure maintains a private Red Hat repository mirror and chamber VMs have access to this repository. Install and configure MariaDB with the following:
164
+
Slurm requires the MySql fork MariaDB to be installed from the Red Hat repository before installation. Azure maintains a private Red Hat repository mirror and chamber VMs have access to this repository. Install and configure MariaDB with the following commands:
The *mysql_secure_installation* script requires more configuration.
173
+
The *mysql_secure_installation* script asks for more configuration.
174
174
175
-
* The default, new installation password isn't set. Hit **Enter** when asked for current password.
176
-
* Enter 'Y' when asked to set root password. Create a new, secure root password for MariaDB, then reenter to confirm. You'll need this password later when you set up the Slurm controller in the following step.
177
-
* Enter 'Y' for the remaining questions for:
175
+
* The default database password isn't set. Hit **Enter** when asked for current password.
176
+
* Enter *Y* when asked to set root password. Create a new, secure root password for MariaDB, take note of it for later, then reenter to confirm. You need this password when you configure the Slurm controller in the following step.
177
+
* Enter *Y* for the remaining questions for:
178
178
* Reloading privileged tables
179
179
* Removing anonymous users
180
180
* Disabling remote root login
@@ -183,13 +183,13 @@ The *mysql_secure_installation* script requires more configuration.
183
183
184
184
### Install Slurm on the controller
185
185
186
-
The Modeling and Simulation Workbench provides a setup script to speed installation. It requires the parameters you collected earlier in this tutorial. Replace the placeholders in these example commands with the parameters you collected. Execute these commands on the controller node. The *\<clusternodes>* placeholder is a comma-separated, no space list of hostnames. The examples include a shortcut to do so, reformatting your compute node list into the proper comma-delimited format to prevent having to create another file. The format of the *sdwChamberSlurm* script is as follows:
186
+
The Modeling and Simulation Workbench provides a setup script to speed installation. It requires the parameters you collected earlier in this tutorial. Replace the placeholders with the parameters you collected. Execute these commands on the controller node. The \<clusternodes\> placeholder is a comma-separated, no space list of hostnames. The examples include a shortcut to do so, reformatting your compute node list in *slurm_worker.txt*into the proper comma-delimited format. The argument of the *sdwChamberSlurm.sh* script is as follows:
For this example, we use the list of nodes we created in the previous steps and substitute our values collected during discovery. The *paste* command is used to reformat the list of worker nodes into the comma-delimited format without needing to create a new file.
192
+
For this example, we use the list of nodes we created in the previous steps and substitute our values collected during discovery. The `paste` command is used to reformat the list of worker nodes into the comma-delimited format without needing to create a new file.
> If your installation shows an[ERROR] message, check that you haven't mistyped any parameter. Review your information and repeat the step.
213
+
> If your installation shows any[ERROR] message in these steps, check that you haven't mistyped or misplaced any parameter. Review your information and repeat the step.
214
214
215
215
### Install Slurm on compute nodes
216
216
217
-
Slurm must now be installed on the compute nodes. To ease this task, you can use your home directory which is mounted on all VMs, to ease distribution of files and scripts used.
217
+
Slurm must now be installed on the compute nodes. To ease this task, use your home directory which is mounted on all VMs, to ease distribution of files and scripts used.
218
218
219
219
From your user account, copy the *munge.key* file to your home directory.
220
220
221
221
```bash
222
+
cd
222
223
sudo cp /etc/munge/munge.key .
223
224
```
224
225
225
226
Create a script named *node-munge.sh* to set up each node's **munge** settings. This script should be in your home directory.
Using the same file of the node hostnames that you previously used, execute the bash script you just created on the node.
240
+
Using the same file of the node hostnames that you previously used, execute the bash script you created on the node.
240
241
241
242
```bash
242
243
$ forhostin`cat ./slurm_nodes.txt`;do ssh $host sudo sh ~/node-munge.sh;done
@@ -266,32 +267,29 @@ Installed:
266
267
Complete!
267
268
```
268
269
269
-
After setting up the compute nodes, be sure to delete the *munge.key* file from your home directory.
270
-
271
-
```bash
272
-
rm munge.key
273
-
```
270
+
> [!IMPORTANT]
271
+
> After configuring the compute nodes, be sure to delete the *munge.key* file from your home directory.
274
272
275
273
## Validate installation
276
274
277
-
To validate that Slurm installed, a Chamber Admin can execute the *sinfo* command on any Slurm node, either the controller or a compute node.
275
+
To validate that Slurm installed successfully, a Chamber Admin can execute the `sinfo` command on any Slurm node, either on the controller or on a compute node.
278
276
279
277
```bash
280
278
$ sinfo
281
279
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
282
280
chamberSlurmPartition1* up infinite 5 idle wrkldvmslurm-nod0aef63d,wrkldvmslurm-nod034b970...
283
281
```
284
282
285
-
You can validate execution on compute nodes by sending a simple command with the *srun* command.
283
+
You can validate execution on compute nodes by sending a simple command using the `srun` command.
286
284
287
285
```shell
288
-
$ srun --nodes=3 hostname && srun sleep 30
286
+
$ srun --nodes=6 hostname && srun sleep 30
289
287
wrkldvmslurm-nod034b970
290
288
wrkldvmslurm-nod0aef63d
291
289
wrkldvmslurm-nod10870ad
292
290
```
293
291
294
-
If jobs show as queued, run *squeue* to list the job queue.
292
+
If a job shows as *queued*, run `squeue` to list the job queue.
295
293
296
294
```shell
297
295
$ squeue
@@ -328,7 +326,7 @@ JobId=12 JobName=sleep
328
326
329
327
## Troubleshooting
330
328
331
-
If a node's state is *down* or *drain*, the *scontrol* command can restart it. Follow that with the *sinfo* command to verify.
329
+
If a node's state is reported as *down* or *drain*, the `scontrol` command can restart it. Follow that with the `sinfo` command to verify activation.
0 commit comments