Skip to content

Commit 033f01a

Browse files
committed
writeup fixes, additional resources, fix alt-texts
1 parent e49e597 commit 033f01a

File tree

1 file changed

+44
-39
lines changed

1 file changed

+44
-39
lines changed

articles/modeling-simulation-workbench/tutorial-install-slurm.md

Lines changed: 44 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Tutorial: Install the Slurm Workload Manager on Azure Modeling and Simulation Workbench"
3-
description: "Tutorial on installing the Slurm Workload Manager in the Azure Modeling and Simulation Workbench"
3+
description: "Learn how to install the Slurm Workload Manager in the Azure Modeling and Simulation Workbench"
44
author: yousefi-msft
55
ms.author: yousefi
66
ms.service: modeling-simulation-workbench
@@ -10,19 +10,19 @@ ms.date: 10/02/2024
1010
#CustomerIntent: As an administrator, I want to learn how to install, setup, and configure the Slurm workload manager in the Azure Modeling and Simulation Workbench.
1111
---
1212

13-
# Tutorial: Set up the Slurm scheduler in Modeling and Simulation Workbench
13+
# Tutorial: Install the Slurm workload manager in the Azure Modeling and Simulation Workbench
1414

15-
The [Slurm](https://slurm.schedmd.com/overview.html) Workload Manager is a scheduler used in microelectronics design and other high-performance computing scenarios to manage jobs across clusters. The Modeling and Simulation Workbench can be deployed with a range of high-performance virtual machines (VM). Slurm is to manage scheduling of complex workloads across a set of compute nodes.
15+
The [Slurm](https://slurm.schedmd.com/overview.html) Workload Manager is a scheduler used in microelectronics design and other high-performance computing scenarios to manage jobs across compute clusters. The Modeling and Simulation Workbench can be deployed with a range of high-performance virtual machines (VM) ideal for large, compute-intensive workloads. Slurm clusters consist of a *controller node*, where the administrator manages, stages, and schedules jobs bound for the *compute nodes*, where the actual workloads are performed. A *node* is simply a part of the cluster, in this case a VM.
1616

17-
The Slurm installation package is already available on all VMs. This tutorial shows you how to install Slurm across your workbench.
17+
The Slurm installation package is already available on all Modeling and Simulation Workbench Chamber VMs. This tutorial shows you how to create VMs for your Slurm cluster and install Slurm.
1818

1919
In this tutorial, you learn how to:
2020

2121
> [!div class="checklist"]
2222
>
2323
> * Create a cluster for Slurm
2424
> * Create an inventory of VMs
25-
> * Designate a controller and compute nodes and install Slurm on each
25+
> * Designate controller and compute nodes and install Slurm on each
2626
2727
If you don’t have a Azure subscription, [create a free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F).
2828

@@ -34,56 +34,60 @@ If you don’t have a Azure subscription, [create a free account](https://azure.
3434

3535
## Sign in to the Azure portal and navigate to your workbench
3636

37+
If you are not already signed into the Azure portal, go to [https://portal.azure.com](https://portal.azure.com). Navigate to your workbench, then the chamber where you will create your Slurm cluster.
38+
3739
## Create a cluster for Slurm
3840

39-
Slurm requires one node to serve as the controller and a set of worker nodes. The controller is traditionally a modestly sized VM as it isn't used for workloads and is left deployed between jobs, while the worker nodes are sized for the workload.
41+
Slurm requires one node to serve as the controller and a set of compute nodes where workloads will execute. The controller is traditionally a modestly sized VM as it isn't used for computational tasks and is left deployed between jobs, while the compute nodes are sized for the workload. Learn about the different VMs available in Modeling and Simulation Workbench on the [VM Offerings page](./concept-vm-offerings.md).
4042

4143
### Create the Slurm controller node
4244

43-
1. From the chamber overview page, select **Chamber VM** from the **Settings** menu, then either the **+ Create** button at the top or the **Create chamber VM** button in center of the page.
44-
:::image type="content" source="media/tutorial-slurm/create-chamber-vm.png" alt-text="Screen shot of chamber VM overview page with Chamber VM in Settings and the create options on the page highlighted by red outlines.":::
45+
1. From the chamber overview page, select **Chamber VM** from the **Settings** menu, then either the **+ Create** button on action menu along the top or the blue **Create chamber VM** button in center of the page.
46+
:::image type="content" source="media/tutorial-slurm/create-chamber-vm.png" alt-text="Screenshot of chamber VM overview page with Chamber VM in Settings and the create options on the page highlighted by red outlines.":::
4547
1. On the **Create chamber VM** page:
46-
* Enter a **Name** for the VM
47-
* Select a VM. For the controller, you can select the smallest VM available, which is the D4s_v4.
48-
* Leave the **Chamber VM image type** and **Chamber VM count** as the default of "Semiconductor" and 1.
48+
* Enter a **Name** for the VM. Select something to indicate that this is the controller node.
49+
* Select a VM size. For the controller, you can select the smallest VM available.
50+
* Leave the **Chamber VM image type** and **Chamber VM count** as the default of "Semiconductor" and "1".
4951
* Select **Review + create**.
50-
:::image type="content" source="media/tutorial-slurm/configure-create-chamber-vm.png" alt-text="Create chamber VM page with the name and VM size textboxes and the create button highlighted in red outline.":::
52+
:::image type="content" source="media/tutorial-slurm/configure-create-chamber-vm.png" alt-text="Screenshot of create chamber VM page with the name and VM size textboxes and the create button highlighted in red outline.":::
5153
1. After the validation check passes, select the **Create** button.
5254

53-
Once the VM deploys, it's available in the desktop dashboard.
55+
Once the VM deploys, it's available in the connector desktop dashboard.
5456

55-
### Create the Slurm worker cluster
57+
### Create the Slurm compute cluster
5658

57-
A *cluster* is a collection of VMs, referred to as *nodes* that perform the actual work, at the direction of the controller. Similar to the steps in creating the controller, return to the **Chamber VM** page and create a cluster. The Modeling and Simulation Workbench makes it easy to create identical copies of a VM.
59+
A *cluster* is a collection of VMs, individually referred to as *nodes* that perform the actual work. The compute nodes have their workloads dispatched and managed by the controller node. Similar to the steps taken in creating the controller, return to the **Chamber VM** page and create a cluster. The Modeling and Simulation Workbench allows you to create identical VMs.
5860

5961
1. On the **Create chamber VM** page:
60-
* Enter a **Name** for the VM cluster. Use a name that identifies these VMs as nodes. For example, include the word "node" in the name.
62+
* Enter a **Name** for the VM cluster. Use a name that identifies these VMs as compute nodes. For example, include the word "node" in the name.
6163
* Select a VM appropriate for the workload. Refer to the [VM Offerings](concept-vm-offerings.md) page for guidance on VM capabilities and sizes.
6264
* Leave the **Chamber VM image type** as the default of "Semiconductor".
6365
* In the **Chamber VM count** box, enter the number of nodes required.
6466
* Select **Review + create**.
6567
1. After the validation check passes, select the **Create** button.
6668

67-
VMs are deployed in parallel and appear in the dashboard. It isn't customary or even necessary access worker nodes, however you can ssh to worker nodes if needed.
69+
VMs are deployed in parallel and appear in the dashboard. It isn't typically necessary to access worker nodes, however you can ssh to worker nodes in the same chamber if needed.
6870

69-
### Connect to controller node VM
71+
### Connect to the controller node
7072

71-
Slurm configuration is performed on the controller node.
73+
Slurm installation is performed from the controller node.
7274

73-
1. Navigate to the connector. From the **Settings** menu of the chamber, select **Connector**, then the connector that appears in the resource list.
74-
:::image type="content" source="media/tutorial-slurm/connector-overview.png" alt-text="Screenshot of connector overview page with Connector in Settings and the connector highlighted with a red rectangle.":::
75+
1. Navigate to the connector. From the **Settings** menu of the chamber, select **Connector**, then select the sole connector that appears in the resource list.
76+
:::image type="content" source="media/tutorial-slurm/connector-overview.png" alt-text="Screenshot of connector overview page with Connector in Settings and the target connector highlighted with a red rectangle.":::
7577
1. From the connector page, select the **Desktop dashboard** URL.
7678
1. The desktop dashboard opens. Select your controller VM.
7779

7880
## Create an inventory of VMs
7981

82+
Slurm installation requires that you have a technical inventory of the compute nodes, as well as their host names.
83+
8084
### Get a list of available VMs
8185

8286
Configuring Slurm requires an inventory of nodes. From the controller node:
8387

8488
1. Open a terminal in your desktop.
8589
:::image type="content" source="media/tutorial-slurm/open-terminal.png" alt-text="Screenshot of desktop with terminal button highlighted in red.":::
86-
1. Execute the following bash script to print a list of all VMs in the chamber. In this example, we have one controller and five nodes. The command prints the IP address in the first column and the hostname in the second.
90+
1. Execute the following bash script to print a list of all VMs in the chamber. In this example, we have one controller and five nodes. The command prints the IP addresses in the first column and the hostnames in the second.
8791

8892
```bash
8993
$ ip=$(hostname -i | cut -d'.' -f1-3)
@@ -96,11 +100,11 @@ Configuring Slurm requires an inventory of nodes. From the controller node:
96100
10.163.4.9 wrkldvmslurm-nod034b970
97101
```
98102

99-
1. Create a file with just the worker nodes, one host per line and call it *slurm_worker.txt*. For the remaining steps of the tutorial, you'll use this list to configure the compute nodes from your controller. In some steps, the nodes need to be in comma-delimited format. For those instances, we use a Linux shortcut to format without having to create a new file. Create a file of your Slurm worker nodes as your hostnames are different than what's listed in this tutorial even if you created identically named VMs.
103+
1. Create a file with just the worker nodes, one host per line and call it *slurm_worker.txt*. For the remaining steps of this tutorial, you'll use this list to configure the compute nodes from your controller. In some steps, the nodes need to be in a comma-delimited format. In those instances, we use a command-line shortcut to format without having to create a new file. Create the file of your Slurm worker nodes as your hostnames will not match the name you entered in the portal.
100104
101-
### Get technical specifications about compute nodes
105+
### Gather technical specifications about the compute nodes
102106
103-
Assuming that you created all the worker nodes in your cluster using the same VM, choose any of the nodes to retrieve technical information about the worker cluster. Select one of the nodes from the list you created earlier and execute the following commands:
107+
Assuming that you created all the worker nodes in your cluster using the same VM, choose any node to retrieve technical information about the platform. In this example, we use *head* to grab the first host name in the compute node list:
104108
105109
```bash
106110
$ ssh `head -1 ./slurm_worker.txt` lscpu
@@ -135,14 +139,14 @@ NUMA node0 CPU(s): 0-3
135139
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid ept_ad fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_vnni arch_capabilities
136140
```
137141

138-
You'll be asked by the ssh client to verify the ECDSA key fingerprint of the remote machines. Executing the following command returns the technical inventory of the node. Take note of the following parameters:
142+
You'll be asked by the ssh client to verify the ECDSA key fingerprint of the remote machines. Take note of the following parameters:
139143

140144
* **CPU(s)**
141145
* **Socket(s)**
142146
* **Core(s) per socket**
143147
* **Thread(s) per core**
144148

145-
To discover the **available** memory of a worker node, execute *free* command for any similar compute worker from your controller and note the **available** output from the *free* command:
149+
Slurm also requires an estimate of available memory on the compute nodes. To obtain the available memory of a worker node, execute the *free* command on any of the compute nodes from your controller and note the **available** memory reported in the output. Again, we use the first worker node in our list using the *head* command.
146150

147151
```bash
148152
$ ssh `head -1 ./slurm_worker.txt` free
@@ -153,11 +157,11 @@ Swap: 0 0 0
153157

154158
Note the available memory listed in the "available" column.
155159

156-
## Installation
160+
## Install Slurm on your cluster
157161

158-
### Install MariaDB
162+
### Pre-requisite: Install MariaDB
159163

160-
Slurm requires the MySql fork of MariaDB to be installed from the Red Hat repository before it can be installed. Azure maintains a private Red Hat repository mirror and chamber VMs have access to this repository. Install and configure MariaDB. You're asked to create a database password. Create a secure
164+
Slurm requires the MySql fork of MariaDB to be installed from the Red Hat repository before it can be installed. Azure maintains a private Red Hat repository mirror and chamber VMs have access to this repository. Install and configure MariaDB with the following:
161165

162166
```bash
163167
sudo yum install -y mariadb-server
@@ -166,22 +170,20 @@ sudo systemctl enable mariadb
166170
mysql_secure_installation
167171
```
168172

169-
The secure installation script requires further configuration.
173+
The *mysql_secure_installation* script requires more configuration.
170174

171175
* The default, new installation password isn't set. Hit **Enter** when asked for current password.
172-
* Enter 'Y' or hit Enter when asked to set root password. Create a new, secure root password for MariaDB, then reenter to confirm. You'll need this password later to set up the Slurm controller in the following step.
176+
* Enter 'Y' when asked to set root password. Create a new, secure root password for MariaDB, then reenter to confirm. You'll need this password later when you set up the Slurm controller in the following step.
173177
* Enter 'Y' for the remaining questions for:
174178
* Reloading privileged tables
175179
* Removing anonymous users
176180
* Disabling remote root login
177181
* Removing tests databases
178-
* Reloading privilege tables.
179-
180-
### Install Slurm on the cluster
182+
* Reloading privilege tables
181183

182184
### Install Slurm on the controller
183185

184-
The Modeling and Simulation Workbench provides a setup script to aid installation. It requires the parameters you collected earlier. Replace the placeholder in these example commands with the parameters you collected. Execute these commands on the controller node. For the *\<clusternodes>*, substitute the comma-separated list of hostnames. The examples include a shortcut to do so, reformatting your compute node list in the proper comma-delimited format.
186+
The Modeling and Simulation Workbench provides a setup script to speed installation. It requires the parameters you collected earlier in this tutorial. Replace the placeholders in these example commands with the parameters you collected. Execute these commands on the controller node. The *\<clusternodes>* placeholder is a comma-separated, no space list of hostnames. The examples include a shortcut to do so, reformatting your compute node list into the proper comma-delimited format to prevent having to create another file. The format of the *sdwChamberSlurm* script is as follows:
185187

186188
```bash
187189
sudo /usr/sdw/slurm/sdwChamberSlurm.sh CONTROLLER <databaseSecret> <clusterNodes> <numberOfCpus> <numberOfSockets> <coresPerSocket> <threadsPerCore> <availableMemory>
@@ -208,7 +210,7 @@ Complete!
208210
```
209211

210212
> [!TIP]
211-
> If your installation shows an [ERROR] message, you may have mistyped a parameter. The installation wasn't successful. Review your information and repeat the step.
213+
> If your installation shows an [ERROR] message, check that you haven't mistyped any parameter. Review your information and repeat the step.
212214
213215
### Install Slurm on compute nodes
214216

@@ -234,7 +236,7 @@ chown -R munge:munge /etc/munge/munge.key
234236
END
235237
```
236238

237-
Using the same file of the node hostnames that you used in the setup, execute the bash script you created. From the controller, execute the following command:
239+
Using the same file of the node hostnames that you previously used, execute the bash script you just created on the node.
238240

239241
```bash
240242
$ for host in `cat ./slurm_nodes.txt`; do ssh $host sudo sh ~/node-munge.sh; done
@@ -272,7 +274,7 @@ rm munge.key
272274

273275
## Validate installation
274276

275-
To validate that Slurm installed, a Chamber Admin can execute the *sinfo* command on any Slurm node, either the controller or compute node. Run *sinfo* to verify the status.
277+
To validate that Slurm installed, a Chamber Admin can execute the *sinfo* command on any Slurm node, either the controller or a compute node.
276278

277279
```bash
278280
$ sinfo
@@ -342,3 +344,6 @@ chamberSlurmPartition1* up infinite 3 idle nodename[1-3]
342344
* [Slurm Workload Manager Quick Start Administrator Guide](https://slurm.schedmd.com/quickstart_admin.html)
343345
* [Slurm Workload Manager configuration](https://slurm.schedmd.com/slurm.conf.html)
344346
* [Slurm Accounting Storage configuration](https://slurm.schedmd.com/slurmdbd.conf.html)
347+
* [VM Offerings on Modeling and Simulation Workbench](./concept-vm-offerings.md)
348+
* [Create chamber storage](./how-to-guide-manage-chamber-storage.md)
349+
* [Create shared storage](./how-to-guide-manage-shared-storage.md)

0 commit comments

Comments
 (0)