You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/modeling-simulation-workbench/tutorial-install-slurm.md
+44-39Lines changed: 44 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
title: "Tutorial: Install the Slurm Workload Manager on Azure Modeling and Simulation Workbench"
3
-
description: "Tutorial on installing the Slurm Workload Manager in the Azure Modeling and Simulation Workbench"
3
+
description: "Learn how to install the Slurm Workload Manager in the Azure Modeling and Simulation Workbench"
4
4
author: yousefi-msft
5
5
ms.author: yousefi
6
6
ms.service: modeling-simulation-workbench
@@ -10,19 +10,19 @@ ms.date: 10/02/2024
10
10
#CustomerIntent: As an administrator, I want to learn how to install, setup, and configure the Slurm workload manager in the Azure Modeling and Simulation Workbench.
11
11
---
12
12
13
-
# Tutorial: Set up the Slurm scheduler in Modeling and Simulation Workbench
13
+
# Tutorial: Install the Slurm workload manager in the Azure Modeling and Simulation Workbench
14
14
15
-
The [Slurm](https://slurm.schedmd.com/overview.html) Workload Manager is a scheduler used in microelectronics design and other high-performance computing scenarios to manage jobs across clusters. The Modeling and Simulation Workbench can be deployed with a range of high-performance virtual machines (VM). Slurm is to manage scheduling of complex workloads across a set of compute nodes.
15
+
The [Slurm](https://slurm.schedmd.com/overview.html) Workload Manager is a scheduler used in microelectronics design and other high-performance computing scenarios to manage jobs across compute clusters. The Modeling and Simulation Workbench can be deployed with a range of high-performance virtual machines (VM) ideal for large, compute-intensive workloads. Slurm clusters consist of a *controller node*, where the administrator manages, stages, and schedules jobs bound for the *compute nodes*, where the actual workloads are performed. A *node* is simply a part of the cluster, in this case a VM.
16
16
17
-
The Slurm installation package is already available on all VMs. This tutorial shows you how to install Slurm across your workbench.
17
+
The Slurm installation package is already available on all Modeling and Simulation Workbench Chamber VMs. This tutorial shows you how to create VMs for your Slurm cluster and install Slurm.
18
18
19
19
In this tutorial, you learn how to:
20
20
21
21
> [!div class="checklist"]
22
22
>
23
23
> * Create a cluster for Slurm
24
24
> * Create an inventory of VMs
25
-
> * Designate a controller and compute nodes and install Slurm on each
25
+
> * Designate controller and compute nodes and install Slurm on each
26
26
27
27
If you don’t have a Azure subscription, [create a free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F).
28
28
@@ -34,56 +34,60 @@ If you don’t have a Azure subscription, [create a free account](https://azure.
34
34
35
35
## Sign in to the Azure portal and navigate to your workbench
36
36
37
+
If you are not already signed into the Azure portal, go to [https://portal.azure.com](https://portal.azure.com). Navigate to your workbench, then the chamber where you will create your Slurm cluster.
38
+
37
39
## Create a cluster for Slurm
38
40
39
-
Slurm requires one node to serve as the controller and a set of worker nodes. The controller is traditionally a modestly sized VM as it isn't used for workloads and is left deployed between jobs, while the worker nodes are sized for the workload.
41
+
Slurm requires one node to serve as the controller and a set of compute nodes where workloads will execute. The controller is traditionally a modestly sized VM as it isn't used for computational tasks and is left deployed between jobs, while the compute nodes are sized for the workload. Learn about the different VMs available in Modeling and Simulation Workbench on the [VM Offerings page](./concept-vm-offerings.md).
40
42
41
43
### Create the Slurm controller node
42
44
43
-
1. From the chamber overview page, select **Chamber VM** from the **Settings** menu, then either the **+ Create** button at the top or the **Create chamber VM** button in center of the page.
44
-
:::image type="content" source="media/tutorial-slurm/create-chamber-vm.png" alt-text="Screen shot of chamber VM overview page with Chamber VM in Settings and the create options on the page highlighted by red outlines.":::
45
+
1. From the chamber overview page, select **Chamber VM** from the **Settings** menu, then either the **+ Create** button on action menu along the top or the blue**Create chamber VM** button in center of the page.
46
+
:::image type="content" source="media/tutorial-slurm/create-chamber-vm.png" alt-text="Screenshot of chamber VM overview page with Chamber VM in Settings and the create options on the page highlighted by red outlines.":::
45
47
1. On the **Create chamber VM** page:
46
-
* Enter a **Name** for the VM
47
-
* Select a VM. For the controller, you can select the smallest VM available, which is the D4s_v4.
48
-
* Leave the **Chamber VM image type** and **Chamber VM count** as the default of "Semiconductor" and 1.
48
+
* Enter a **Name** for the VM. Select something to indicate that this is the controller node.
49
+
* Select a VM size. For the controller, you can select the smallest VM available.
50
+
* Leave the **Chamber VM image type** and **Chamber VM count** as the default of "Semiconductor" and "1".
49
51
* Select **Review + create**.
50
-
:::image type="content" source="media/tutorial-slurm/configure-create-chamber-vm.png" alt-text="Create chamber VM page with the name and VM size textboxes and the create button highlighted in red outline.":::
52
+
:::image type="content" source="media/tutorial-slurm/configure-create-chamber-vm.png" alt-text="Screenshot of create chamber VM page with the name and VM size textboxes and the create button highlighted in red outline.":::
51
53
1. After the validation check passes, select the **Create** button.
52
54
53
-
Once the VM deploys, it's available in the desktop dashboard.
55
+
Once the VM deploys, it's available in the connector desktop dashboard.
54
56
55
-
### Create the Slurm worker cluster
57
+
### Create the Slurm compute cluster
56
58
57
-
A *cluster* is a collection of VMs, referred to as *nodes* that perform the actual work, at the direction of the controller. Similar to the steps in creating the controller, return to the **Chamber VM** page and create a cluster. The Modeling and Simulation Workbench makes it easy to create identical copies of a VM.
59
+
A *cluster* is a collection of VMs, individually referred to as *nodes* that perform the actual work. The compute nodes have their workloads dispatched and managed by the controller node. Similar to the steps taken in creating the controller, return to the **Chamber VM** page and create a cluster. The Modeling and Simulation Workbench allows you to create identical VMs.
58
60
59
61
1. On the **Create chamber VM** page:
60
-
* Enter a **Name** for the VM cluster. Use a name that identifies these VMs as nodes. For example, include the word "node" in the name.
62
+
* Enter a **Name** for the VM cluster. Use a name that identifies these VMs as compute nodes. For example, include the word "node" in the name.
61
63
* Select a VM appropriate for the workload. Refer to the [VM Offerings](concept-vm-offerings.md) page for guidance on VM capabilities and sizes.
62
64
* Leave the **Chamber VM image type** as the default of "Semiconductor".
63
65
* In the **Chamber VM count** box, enter the number of nodes required.
64
66
* Select **Review + create**.
65
67
1. After the validation check passes, select the **Create** button.
66
68
67
-
VMs are deployed in parallel and appear in the dashboard. It isn't customary or even necessary access worker nodes, however you can ssh to worker nodes if needed.
69
+
VMs are deployed in parallel and appear in the dashboard. It isn't typically necessary to access worker nodes, however you can ssh to worker nodes in the same chamber if needed.
68
70
69
-
### Connect to controller node VM
71
+
### Connect to the controller node
70
72
71
-
Slurm configuration is performed on the controller node.
73
+
Slurm installation is performed from the controller node.
72
74
73
-
1. Navigate to the connector. From the **Settings** menu of the chamber, select **Connector**, then the connector that appears in the resource list.
74
-
:::image type="content" source="media/tutorial-slurm/connector-overview.png" alt-text="Screenshot of connector overview page with Connector in Settings and the connector highlighted with a red rectangle.":::
75
+
1. Navigate to the connector. From the **Settings** menu of the chamber, select **Connector**, then select the sole connector that appears in the resource list.
76
+
:::image type="content" source="media/tutorial-slurm/connector-overview.png" alt-text="Screenshot of connector overview page with Connector in Settings and the target connector highlighted with a red rectangle.":::
75
77
1. From the connector page, select the **Desktop dashboard** URL.
76
78
1. The desktop dashboard opens. Select your controller VM.
77
79
78
80
## Create an inventory of VMs
79
81
82
+
Slurm installation requires that you have a technical inventory of the compute nodes, as well as their host names.
83
+
80
84
### Get a list of available VMs
81
85
82
86
Configuring Slurm requires an inventory of nodes. From the controller node:
83
87
84
88
1. Open a terminal in your desktop.
85
89
:::image type="content" source="media/tutorial-slurm/open-terminal.png" alt-text="Screenshot of desktop with terminal button highlighted in red.":::
86
-
1. Execute the following bash script to print a list of all VMs in the chamber. In this example, we have one controller and five nodes. The command prints the IP address in the first column and the hostname in the second.
90
+
1. Execute the following bash script to print a list of all VMs in the chamber. In this example, we have one controller and five nodes. The command prints the IP addresses in the first column and the hostnames in the second.
87
91
88
92
```bash
89
93
$ ip=$(hostname -i | cut -d'.' -f1-3)
@@ -96,11 +100,11 @@ Configuring Slurm requires an inventory of nodes. From the controller node:
96
100
10.163.4.9 wrkldvmslurm-nod034b970
97
101
```
98
102
99
-
1. Create a file with just the worker nodes, one host per line and call it *slurm_worker.txt*. For the remaining steps of the tutorial, you'll use this list to configure the compute nodes from your controller. In some steps, the nodes need to be in comma-delimited format. For those instances, we use a Linux shortcut to format without having to create a new file. Create a file of your Slurm worker nodes as your hostnames are different than what's listed in this tutorial even ifyou created identically named VMs.
103
+
1. Create a file with just the worker nodes, one host per line and call it *slurm_worker.txt*. For the remaining steps of this tutorial, you'll use this list to configure the compute nodes from your controller. In some steps, the nodes need to be in a comma-delimited format. In those instances, we use a command-line shortcut to format without having to create a new file. Create the file of your Slurm worker nodes as your hostnames will not match the name you entered in the portal.
100
104
101
-
### Get technical specifications about compute nodes
105
+
### Gather technical specifications about the compute nodes
102
106
103
-
Assuming that you created all the worker nodes in your cluster using the same VM, choose any of the nodes to retrieve technical information about the worker cluster. Select one of the nodes from the list you created earlier and execute the following commands:
107
+
Assuming that you created all the worker nodes in your cluster using the same VM, choose any node to retrieve technical information about the platform. In this example, we use *head* to grab the first host name in the compute node list:
You'll be asked by the ssh client to verify the ECDSA key fingerprint of the remote machines. Executing the following command returns the technical inventory of the node. Take note of the following parameters:
142
+
You'll be asked by the ssh client to verify the ECDSA key fingerprint of the remote machines. Take note of the following parameters:
139
143
140
144
***CPU(s)**
141
145
***Socket(s)**
142
146
***Core(s) per socket**
143
147
***Thread(s) per core**
144
148
145
-
To discover the **available** memory of a worker node, execute *free*commandfor any similar compute worker from your controller and note the **available** output from the *free* command:
149
+
Slurm also requires an estimate of available memory on the compute nodes. To obtain the available memory of a worker node, execute the *free* command on any of the compute nodes from your controller and note the **available**memory reported in the output. Again, we use the first worker node in our list using the *head* command.
146
150
147
151
```bash
148
152
$ ssh `head -1 ./slurm_worker.txt` free
@@ -153,11 +157,11 @@ Swap: 0 0 0
153
157
154
158
Note the available memory listed in the "available" column.
155
159
156
-
## Installation
160
+
## Install Slurm on your cluster
157
161
158
-
### Install MariaDB
162
+
### Pre-requisite: Install MariaDB
159
163
160
-
Slurm requires the MySql fork of MariaDB to be installed from the Red Hat repository before it can be installed. Azure maintains a private Red Hat repository mirror and chamber VMs have access to this repository. Install and configure MariaDB. You're asked to create a database password. Create a secure
164
+
Slurm requires the MySql fork of MariaDB to be installed from the Red Hat repository before it can be installed. Azure maintains a private Red Hat repository mirror and chamber VMs have access to this repository. Install and configure MariaDB with the following:
The secure installation script requires further configuration.
173
+
The *mysql_secure_installation*script requires more configuration.
170
174
171
175
* The default, new installation password isn't set. Hit **Enter** when asked for current password.
172
-
* Enter 'Y'or hit Enter when asked to set root password. Create a new, secure root password forMariaDB, then reenter to confirm. You'll need this password later to set up the Slurm controllerin the following step.
176
+
* Enter 'Y' when asked to set root password. Create a new, secure root password for MariaDB, then reenter to confirm. You'll need this password later when you set up the Slurm controller in the following step.
173
177
* Enter 'Y' for the remaining questions for:
174
178
* Reloading privileged tables
175
179
* Removing anonymous users
176
180
* Disabling remote root login
177
181
* Removing tests databases
178
-
* Reloading privilege tables.
179
-
180
-
### Install Slurm on the cluster
182
+
* Reloading privilege tables
181
183
182
184
### Install Slurm on the controller
183
185
184
-
The Modeling and Simulation Workbench provides a setup script to aid installation. It requires the parameters you collected earlier. Replace the placeholderin these example commands with the parameters you collected. Execute these commands on the controller node. For the *\<clusternodes>*, substitute the comma-separatedlist of hostnames. The examples include a shortcut to do so, reformatting your compute node list in the proper comma-delimited format.
186
+
The Modeling and Simulation Workbench provides a setup script to speed installation. It requires the parameters you collected earlier in this tutorial. Replace the placeholders in these example commands with the parameters you collected. Execute these commands on the controller node. The *\<clusternodes>* placeholder is a comma-separated, no space list of hostnames. The examples include a shortcut to do so, reformatting your compute node list into the proper comma-delimited format to prevent having to create another file. The format of the *sdwChamberSlurm* script is as follows:
> If your installation shows an [ERROR] message, you may have mistyped a parameter. The installation wasn't successful. Review your information and repeat the step.
213
+
> If your installation shows an [ERROR] message, check that you haven't mistyped any parameter. Review your information and repeat the step.
Using the same file of the node hostnames that you used in the setup, execute the bash script you created. From the controller, execute the following command:
239
+
Using the same file of the node hostnames that you previously used, execute the bash script you just created on the node.
238
240
239
241
```bash
240
242
$ forhostin`cat ./slurm_nodes.txt`;do ssh $host sudo sh ~/node-munge.sh;done
@@ -272,7 +274,7 @@ rm munge.key
272
274
273
275
## Validate installation
274
276
275
-
To validate that Slurm installed, a Chamber Admin can execute the *sinfo*command on any Slurm node, either the controller or compute node. Run *sinfo* to verify the status.
277
+
To validate that Slurm installed, a Chamber Admin can execute the *sinfo* command on any Slurm node, either the controller or a compute node.
276
278
277
279
```bash
278
280
$ sinfo
@@ -342,3 +344,6 @@ chamberSlurmPartition1* up infinite 3 idle nodename[1-3]
0 commit comments