Skip to content

Commit deeaf89

Browse files
committed
Style corrections, Acrolinx fixes
1 parent 033f01a commit deeaf89

File tree

1 file changed

+43
-45
lines changed

1 file changed

+43
-45
lines changed

articles/modeling-simulation-workbench/tutorial-install-slurm.md

Lines changed: 43 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ ms.date: 10/02/2024
1212

1313
# Tutorial: Install the Slurm workload manager in the Azure Modeling and Simulation Workbench
1414

15-
The [Slurm](https://slurm.schedmd.com/overview.html) Workload Manager is a scheduler used in microelectronics design and other high-performance computing scenarios to manage jobs across compute clusters. The Modeling and Simulation Workbench can be deployed with a range of high-performance virtual machines (VM) ideal for large, compute-intensive workloads. Slurm clusters consist of a *controller node*, where the administrator manages, stages, and schedules jobs bound for the *compute nodes*, where the actual workloads are performed. A *node* is simply a part of the cluster, in this case a VM.
15+
The [Slurm](https://slurm.schedmd.com/overview.html) Workload Manager is a scheduler used in microelectronics design and other high-performance computing scenarios to manage jobs across compute clusters. The Modeling and Simulation Workbench can be deployed with a range of high-performance virtual machines (VM) ideal for large, compute-intensive workloads. Slurm clusters consist of a *controller node* that manages, stages, and schedules jobs bound for the *compute nodes*. Compute nodes are where the actual workloads are performed. A *node* is an individual element of the cluster, such as a VM.
1616

1717
The Slurm installation package is already available on all Modeling and Simulation Workbench Chamber VMs. This tutorial shows you how to create VMs for your Slurm cluster and install Slurm.
1818

@@ -24,7 +24,7 @@ In this tutorial, you learn how to:
2424
> * Create an inventory of VMs
2525
> * Designate controller and compute nodes and install Slurm on each
2626
27-
If you don’t have a Azure subscription, [create a free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F).
27+
If you don’t have an Azure subscription, [create a free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F).
2828

2929
## Prerequisites
3030

@@ -34,60 +34,60 @@ If you don’t have a Azure subscription, [create a free account](https://azure.
3434

3535
## Sign in to the Azure portal and navigate to your workbench
3636

37-
If you are not already signed into the Azure portal, go to [https://portal.azure.com](https://portal.azure.com). Navigate to your workbench, then the chamber where you will create your Slurm cluster.
37+
If you aren't already signed into the Azure portal, go to [https://portal.azure.com](https://portal.azure.com). Navigate to your workbench, then the chamber where you'll create your Slurm cluster.
3838

3939
## Create a cluster for Slurm
4040

41-
Slurm requires one node to serve as the controller and a set of compute nodes where workloads will execute. The controller is traditionally a modestly sized VM as it isn't used for computational tasks and is left deployed between jobs, while the compute nodes are sized for the workload. Learn about the different VMs available in Modeling and Simulation Workbench on the [VM Offerings page](./concept-vm-offerings.md).
41+
Slurm requires one node to serve as the controller and a set of compute nodes where workloads execute. The controller is traditionally a modestly sized VM. The controller isn't used for computational workloads and is left deployed between jobs, while the compute nodes themselves are typically sized for a specific task and often deleted after the job. Learn about the different VMs available in Modeling and Simulation Workbench on the [VM Offerings page](./concept-vm-offerings.md).
4242

4343
### Create the Slurm controller node
4444

4545
1. From the chamber overview page, select **Chamber VM** from the **Settings** menu, then either the **+ Create** button on action menu along the top or the blue **Create chamber VM** button in center of the page.
4646
:::image type="content" source="media/tutorial-slurm/create-chamber-vm.png" alt-text="Screenshot of chamber VM overview page with Chamber VM in Settings and the create options on the page highlighted by red outlines.":::
4747
1. On the **Create chamber VM** page:
48-
* Enter a **Name** for the VM. Select something to indicate that this is the controller node.
49-
* Select a VM size. For the controller, you can select the smallest VM available.
50-
* Leave the **Chamber VM image type** and **Chamber VM count** as the default of "Semiconductor" and "1".
48+
* Enter a **Name** for the VM. We recommend choosing a name that indicates it is the controller node.
49+
* Select a VM size. For the controller, you can select the smallest VM available. The *D4s_v4* is currently the smallest.
50+
* Leave the **Chamber VM image type** and **Chamber VM count** as the default of *Semiconductor* and *1*.
5151
* Select **Review + create**.
5252
:::image type="content" source="media/tutorial-slurm/configure-create-chamber-vm.png" alt-text="Screenshot of create chamber VM page with the name and VM size textboxes and the create button highlighted in red outline.":::
5353
1. After the validation check passes, select the **Create** button.
5454

5555
Once the VM deploys, it's available in the connector desktop dashboard.
5656

57-
### Create the Slurm compute cluster
57+
### Create a Slurm compute cluster
5858

59-
A *cluster* is a collection of VMs, individually referred to as *nodes* that perform the actual work. The compute nodes have their workloads dispatched and managed by the controller node. Similar to the steps taken in creating the controller, return to the **Chamber VM** page and create a cluster. The Modeling and Simulation Workbench allows you to create identical VMs.
59+
A *cluster* is a collection of VMs, individually referred to as *nodes* that perform the actual work. The compute nodes have their workloads dispatched and managed by the controller node. Similar to the steps you took when you created the controller, return to the **Chamber VM** page to create a cluster. The Modeling and Simulation Workbench allows you to create multiple, identical VMs in a single step.
6060

6161
1. On the **Create chamber VM** page:
62-
* Enter a **Name** for the VM cluster. Use a name that identifies these VMs as compute nodes. For example, include the word "node" in the name.
63-
* Select a VM appropriate for the workload. Refer to the [VM Offerings](concept-vm-offerings.md) page for guidance on VM capabilities and sizes.
64-
* Leave the **Chamber VM image type** as the default of "Semiconductor".
62+
* Enter a **Name** for the VM cluster. Use a name that identifies these VMs as compute nodes. For example, include the word "node" or the type of workload somewhere in the name.
63+
* Select a VM appropriately sized for the workload. Refer to the [VM Offerings](concept-vm-offerings.md) page for guidance on VM offerings, capabilities, features, and sizes.
64+
* Leave the **Chamber VM image type** as the default of *Semiconductor*.
6565
* In the **Chamber VM count** box, enter the number of nodes required.
6666
* Select **Review + create**.
6767
1. After the validation check passes, select the **Create** button.
6868

69-
VMs are deployed in parallel and appear in the dashboard. It isn't typically necessary to access worker nodes, however you can ssh to worker nodes in the same chamber if needed.
69+
VMs are deployed in parallel and appear in the dashboard. It isn't typically necessary to individually access worker nodes, however you can ssh to worker nodes in the same chamber if needed. In the next steps, you'll configure the compute nodes from the controller.
7070

71-
### Connect to the controller node
71+
### Connect to the controller node desktop
7272

7373
Slurm installation is performed from the controller node.
7474

75-
1. Navigate to the connector. From the **Settings** menu of the chamber, select **Connector**, then select the sole connector that appears in the resource list.
75+
1. Navigate to the connector. From the **Settings** menu of the chamber, select **Connector**. Select the sole connector that appears in the resource list.
7676
:::image type="content" source="media/tutorial-slurm/connector-overview.png" alt-text="Screenshot of connector overview page with Connector in Settings and the target connector highlighted with a red rectangle.":::
7777
1. From the connector page, select the **Desktop dashboard** URL.
7878
1. The desktop dashboard opens. Select your controller VM.
7979

8080
## Create an inventory of VMs
8181

82-
Slurm installation requires that you have a technical inventory of the compute nodes, as well as their host names.
82+
Slurm installation requires that you have a technical inventory of the compute nodes and as host names.
8383

84-
### Get a list of available VMs
84+
### Get a list of deployed VMs
8585

8686
Configuring Slurm requires an inventory of nodes. From the controller node:
8787

88-
1. Open a terminal in your desktop.
88+
1. Open a terminal in your desktop by selecting the terminal icon from the menu bar at the top.
8989
:::image type="content" source="media/tutorial-slurm/open-terminal.png" alt-text="Screenshot of desktop with terminal button highlighted in red.":::
90-
1. Execute the following bash script to print a list of all VMs in the chamber. In this example, we have one controller and five nodes. The command prints the IP addresses in the first column and the hostnames in the second.
90+
1. Execute the following commands to print a list of all VMs in the chamber. In this example, we have one controller and five nodes. The command prints the IP addresses in the first column and the hostnames in the second. From the naming, you can see the controller node and the compute nodes.
9191

9292
```bash
9393
$ ip=$(hostname -i | cut -d'.' -f1-3)
@@ -100,11 +100,11 @@ Configuring Slurm requires an inventory of nodes. From the controller node:
100100
10.163.4.9 wrkldvmslurm-nod034b970
101101
```
102102

103-
1. Create a file with just the worker nodes, one host per line and call it *slurm_worker.txt*. For the remaining steps of this tutorial, you'll use this list to configure the compute nodes from your controller. In some steps, the nodes need to be in a comma-delimited format. In those instances, we use a command-line shortcut to format without having to create a new file. Create the file of your Slurm worker nodes as your hostnames will not match the name you entered in the portal.
103+
1. Create a file with just the worker nodes, one host per line and call it *slurm_worker.txt*. For the remaining steps of this tutorial, you'll use this list to configure the compute nodes from your controller. In some steps, the nodes need to be in a comma-delimited format. In those instances, we use a command-line shortcut to format the list without having to create a new file. To create *slurm_worker.txt*, remove the IP addresses in the first column, and the controller node which is listed first.
104104
105105
### Gather technical specifications about the compute nodes
106106
107-
Assuming that you created all the worker nodes in your cluster using the same VM, choose any node to retrieve technical information about the platform. In this example, we use *head* to grab the first host name in the compute node list:
107+
Assuming that you created all the worker nodes in your cluster using the same VM, choose any node to retrieve technical information about the platform. In this example, we use `head` to grab the first host name in the compute node list and using `ssh` send the `lscpu` command to be executed:
108108
109109
```bash
110110
$ ssh `head -1 ./slurm_worker.txt` lscpu
@@ -146,7 +146,7 @@ You'll be asked by the ssh client to verify the ECDSA key fingerprint of the rem
146146
* **Core(s) per socket**
147147
* **Thread(s) per core**
148148

149-
Slurm also requires an estimate of available memory on the compute nodes. To obtain the available memory of a worker node, execute the *free* command on any of the compute nodes from your controller and note the **available** memory reported in the output. Again, we use the first worker node in our list using the *head* command.
149+
Slurm also requires an estimate of available memory on the compute nodes. To obtain the available memory of a worker node, execute the `free` command on any of the compute nodes from your controller and note the **available** memory reported in the output. Again, we use the first worker node in our list using the `head` command submitted via `ssh`.
150150

151151
```bash
152152
$ ssh `head -1 ./slurm_worker.txt` free
@@ -155,13 +155,13 @@ Mem: 16139424 1433696 7885256 766356 6820472 13593564
155155
Swap: 0 0 0
156156
```
157157

158-
Note the available memory listed in the "available" column.
158+
Note the available memory listed in the **available** column.
159159

160160
## Install Slurm on your cluster
161161

162-
### Pre-requisite: Install MariaDB
162+
### Prerequisite: Install MariaDB
163163

164-
Slurm requires the MySql fork of MariaDB to be installed from the Red Hat repository before it can be installed. Azure maintains a private Red Hat repository mirror and chamber VMs have access to this repository. Install and configure MariaDB with the following:
164+
Slurm requires the MySql fork MariaDB to be installed from the Red Hat repository before installation. Azure maintains a private Red Hat repository mirror and chamber VMs have access to this repository. Install and configure MariaDB with the following commands:
165165

166166
```bash
167167
sudo yum install -y mariadb-server
@@ -170,11 +170,11 @@ sudo systemctl enable mariadb
170170
mysql_secure_installation
171171
```
172172

173-
The *mysql_secure_installation* script requires more configuration.
173+
The *mysql_secure_installation* script asks for more configuration.
174174

175-
* The default, new installation password isn't set. Hit **Enter** when asked for current password.
176-
* Enter 'Y' when asked to set root password. Create a new, secure root password for MariaDB, then reenter to confirm. You'll need this password later when you set up the Slurm controller in the following step.
177-
* Enter 'Y' for the remaining questions for:
175+
* The default database password isn't set. Hit **Enter** when asked for current password.
176+
* Enter *Y* when asked to set root password. Create a new, secure root password for MariaDB, take note of it for later, then reenter to confirm. You need this password when you configure the Slurm controller in the following step.
177+
* Enter *Y* for the remaining questions for:
178178
* Reloading privileged tables
179179
* Removing anonymous users
180180
* Disabling remote root login
@@ -183,13 +183,13 @@ The *mysql_secure_installation* script requires more configuration.
183183

184184
### Install Slurm on the controller
185185

186-
The Modeling and Simulation Workbench provides a setup script to speed installation. It requires the parameters you collected earlier in this tutorial. Replace the placeholders in these example commands with the parameters you collected. Execute these commands on the controller node. The *\<clusternodes>* placeholder is a comma-separated, no space list of hostnames. The examples include a shortcut to do so, reformatting your compute node list into the proper comma-delimited format to prevent having to create another file. The format of the *sdwChamberSlurm* script is as follows:
186+
The Modeling and Simulation Workbench provides a setup script to speed installation. It requires the parameters you collected earlier in this tutorial. Replace the placeholders with the parameters you collected. Execute these commands on the controller node. The \<clusternodes\> placeholder is a comma-separated, no space list of hostnames. The examples include a shortcut to do so, reformatting your compute node list in *slurm_worker.txt* into the proper comma-delimited format. The argument of the *sdwChamberSlurm.sh* script is as follows:
187187

188188
```bash
189189
sudo /usr/sdw/slurm/sdwChamberSlurm.sh CONTROLLER <databaseSecret> <clusterNodes> <numberOfCpus> <numberOfSockets> <coresPerSocket> <threadsPerCore> <availableMemory>
190190
```
191191

192-
For this example, we use the list of nodes we created in the previous steps and substitute our values collected during discovery. The *paste* command is used to reformat the list of worker nodes into the comma-delimited format without needing to create a new file.
192+
For this example, we use the list of nodes we created in the previous steps and substitute our values collected during discovery. The `paste` command is used to reformat the list of worker nodes into the comma-delimited format without needing to create a new file.
193193

194194
```bash
195195
$ sudo /usr/sdw/slurm/sdwChamberSlurm.sh CONTROLLER <databasepassword> `paste -d, -s ./slurm_nodes.txt` 4 1 2 2 13593564
@@ -210,22 +210,23 @@ Complete!
210210
```
211211

212212
> [!TIP]
213-
> If your installation shows an [ERROR] message, check that you haven't mistyped any parameter. Review your information and repeat the step.
213+
> If your installation shows any [ERROR] message in these steps, check that you haven't mistyped or misplaced any parameter. Review your information and repeat the step.
214214
215215
### Install Slurm on compute nodes
216216

217-
Slurm must now be installed on the compute nodes. To ease this task, you can use your home directory which is mounted on all VMs, to ease distribution of files and scripts used.
217+
Slurm must now be installed on the compute nodes. To ease this task, use your home directory which is mounted on all VMs, to ease distribution of files and scripts used.
218218

219219
From your user account, copy the *munge.key* file to your home directory.
220220

221221
```bash
222+
cd
222223
sudo cp /etc/munge/munge.key .
223224
```
224225

225226
Create a script named *node-munge.sh* to set up each node's **munge** settings. This script should be in your home directory.
226227

227228
```bash
228-
$ cat >> node-munge.sh <<END
229+
$ cat > node-munge.sh <<END
229230
#!/bin/bash
230231
231232
mkdir -p /etc/munge
@@ -236,7 +237,7 @@ chown -R munge:munge /etc/munge/munge.key
236237
END
237238
```
238239

239-
Using the same file of the node hostnames that you previously used, execute the bash script you just created on the node.
240+
Using the same file of the node hostnames that you previously used, execute the bash script you created on the node.
240241

241242
```bash
242243
$ for host in `cat ./slurm_nodes.txt`; do ssh $host sudo sh ~/node-munge.sh; done
@@ -266,32 +267,29 @@ Installed:
266267
Complete!
267268
```
268269

269-
After setting up the compute nodes, be sure to delete the *munge.key* file from your home directory.
270-
271-
```bash
272-
rm munge.key
273-
```
270+
> [!IMPORTANT]
271+
> After configuring the compute nodes, be sure to delete the *munge.key* file from your home directory.
274272
275273
## Validate installation
276274

277-
To validate that Slurm installed, a Chamber Admin can execute the *sinfo* command on any Slurm node, either the controller or a compute node.
275+
To validate that Slurm installed successfully, a Chamber Admin can execute the `sinfo` command on any Slurm node, either on the controller or on a compute node.
278276

279277
```bash
280278
$ sinfo
281279
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
282280
chamberSlurmPartition1* up infinite 5 idle wrkldvmslurm-nod0aef63d,wrkldvmslurm-nod034b970...
283281
```
284282

285-
You can validate execution on compute nodes by sending a simple command with the *srun* command.
283+
You can validate execution on compute nodes by sending a simple command using the `srun` command.
286284

287285
```shell
288-
$ srun --nodes=3 hostname && srun sleep 30
286+
$ srun --nodes=6 hostname && srun sleep 30
289287
wrkldvmslurm-nod034b970
290288
wrkldvmslurm-nod0aef63d
291289
wrkldvmslurm-nod10870ad
292290
```
293291

294-
If jobs show as queued, run *squeue* to list the job queue.
292+
If a job shows as *queued*, run `squeue` to list the job queue.
295293

296294
```shell
297295
$ squeue
@@ -328,7 +326,7 @@ JobId=12 JobName=sleep
328326

329327
## Troubleshooting
330328

331-
If a node's state is *down* or *drain*, the *scontrol* command can restart it. Follow that with the *sinfo* command to verify.
329+
If a node's state is reported as *down* or *drain*, the `scontrol` command can restart it. Follow that with the `sinfo` command to verify activation.
332330

333331
```bash
334332
$ sudo -u slurm scontrol update nodename=nodename1,nodename2 state=resume

0 commit comments

Comments
 (0)