Skip to content

Commit 8621412

Browse files
authored
Slurm Cloud Burst - How to guide (#403)
* Freshness Review Pass " to indicate that they are markdown. If you have any specific details or requirements related to your pull request, please provide them and I can assist you further. * Updated the ms.date for freshness review pass * Add files via upload * Delete cycle_docs_v1/how-to/slurm-cloud-bursting-setup-on-azure.md.md * Create temp.md * Adding support media files * Deleting the temp file * Adding the how-to guide * Added the config steps * Added the supporting media * Fixed the file path with relative link * Removed multiple H1 issues * Update slurm-cloud-bursting-setup.md
1 parent 23db346 commit 8621412

File tree

4 files changed

+180
-0
lines changed

4 files changed

+180
-0
lines changed
Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
---
2+
title: Slurm Cloud Bursting Using Azure CycleCloud
3+
description: Learn how to configure Slurm Cloud bursting using Azure CycleCloud.
4+
author: vinilv
5+
ms.date: 09/12/2024
6+
ms.author: padmalathas
7+
---
8+
9+
# What is Cloud Bursting?
10+
11+
Cloud bursting is a configuration in cloud computing that allows an organization to handle peaks in IT demand by using a combination of private and public clouds. When the resources in a private cloud reach their maximum capacity, the overflow traffic is directed to a public cloud to ensure there is no interruption in services. This setup provides flexibility and cost savings, as you only pay for the additional resources when there is a demand for them.
12+
13+
For example, an application can run on a private cloud and "burst" to a public cloud only when necessary to meet peak demands. This approach helps avoid the costs associated with maintaining extra capacity that is not always in use
14+
15+
Cloud bursting can be used in various scenarios, such as enabling on-premises workloads to be sent to the cloud for processing, known as hybrid HPC (High-Performance Computing). This allows users to optimize their resource utilization and cost efficiency while accessing the scalability and flexibility of the cloud.
16+
17+
## Requirements to Setup Slurm Cloud Bursting Using CycleCloud on Azure
18+
19+
## Azure subscription account
20+
You must obtain an Azure subscription or be assigned as an Owner role of the subscription.
21+
22+
* To create an Azure subscription, go to the [Create a Subscription](/azure/cost-management-billing/manage/create-subscription#create-a-subscription) site.
23+
* To access an existing subscription, go to the [Azure portal](https://portal.azure.com/).
24+
25+
## Network infrastructure
26+
If you intend to create a Slurm cluster entirely within Azure, you must deploy both the head node(s) and the CycleCloud compute nodes within a single Azure Virtual Network (VNET).
27+
28+
![Slurm cluster](../images/slurm-cloud-burst/diagram.png)
29+
30+
However, if your goal is to establish a hybrid HPC cluster with the head node(s) located on your on-premises corporate network and the compute nodes in Azure, you will need to set up a [Site-to-Site](/azure/vpn-gateway/tutorial-site-to-site-portal) VPN or an [ExpressRoute](/azure/expressroute/) connection between your on-premises network and the Azure VNET. The head node(s) must have the capability to connect to Azure services over the Internet. You may need to coordinate with your network administrator to configure this connectivity.
31+
32+
## Network Ports and Security
33+
The following NSG rules must be configured for successful communication between Master node, CycleCloud server and Compute nodes.
34+
35+
| **Service** | **Port** | **Protocol** | **Direction** | **Purpose** | **Requirement** |
36+
|------------------------------------|-----------------|--------------|------------------|------------------------------------------------------------------------|---------------------------------------------------------------------------------|
37+
| **SSH (Secure Shell)** | 22 | TCP | Inbound/Outbound | Secure command-line access to the Slurm Master node | Open on both on-premises firewall and Azure NSGs |
38+
| **Slurm Control (slurmctld, slurmd)** | 6817, 6818 | TCP | Inbound/Outbound | Communication between Slurm Master and compute nodes | Open in on-premises firewall and Azure NSGs |
39+
| **Munge Authentication Service** | 4065 | TCP | Inbound/Outbound | Authentication between Slurm Master and compute nodes | Open on both on-premises network and Azure NSGs |
40+
| **CycleCloud Service** | 443 | TCP | Outbound | Communication between Slurm Master node and Azure CycleCloud | Allow outbound connections to Azure CycleCloud services from the Slurm Master node |
41+
42+
Please refer [Slurm Network Configuration Guide](https://slurm.schedmd.com/network.html)
43+
44+
## CycleCloud version and Project version
45+
46+
We are utilizing the following versions:
47+
- **[CycleCloud](/azure/cyclecloud/release-notes/8-6-0?view=cyclecloud-8):** 8.6.0-3223
48+
- **[cyclecloud-slurm Project](https://github.com/Azure/cyclecloud-slurm/releases/3.0.6):** 3.0.6
49+
50+
## Slurm and OS version
51+
52+
* Slurm Version: 23.02.7-1
53+
* OS version in Scheduler and execute nodes: Alma Linux release 8.7 (almalinux:almalinux-hpc:8_7-hpc-gen2:latest)
54+
55+
## NFS File server.
56+
A shared file system between the external Slurm Scheduler node and the CycleCloud cluster. You can use Azure NetApp Files, Azure Files, NFS, or other methods to mount the same file system on both sides. In this example, we are using a Scheduler VM as an NFS server.
57+
58+
## Steps
59+
60+
After we have the prerequisites ready, we can follow these steps to integrate the external Slurm Scheduler node with the CycleCloud cluster:
61+
62+
### 1. On CycleCloud VM:
63+
64+
* Ensure CycleCloud 8.6 VM is running and accessible via `cyclecloud` CLI.
65+
* Clone this repository and import a cluster using the provided CycleCloud template (`slurm-headless.txt`).
66+
* We are importing a cluster named `hpc1` using the `slurm-headless.txt` template.
67+
```sh
68+
git clone https://github.com/user/slurm-cloud-bursting-using-cyclecloud.git
69+
cyclecloud import_cluster hpc1 -c Slurm-HL -f slurm-cloud-bursting-using-cyclecloud/templates/slurm-headless.txt
70+
```
71+
Output:
72+
```
73+
[user@cc86 ~]$ cyclecloud import_cluster hpc1 -c Slurm-HL -f slurm-cloud-bursting-using-cyclecloud/cyclecloud-template/slurm-headless.txt
74+
Importing cluster Slurm-HL and creating cluster hpc1....
75+
----------
76+
hpc1 : off
77+
----------
78+
Resource group:
79+
Cluster nodes:
80+
Total nodes: 0
81+
```
82+
83+
### 2. Preparing Scheduler VM:
84+
85+
* Deploy a VM using the specified AlmaLinux image (If you have an existing Slurm Scheduler, you can skip this).
86+
* Run the Slurm scheduler installation script (`slurm-scheduler-builder.sh`) and provide the cluster name (`hpc1`) when prompted.
87+
* This script will install and configure Slurm Scheduler.
88+
```sh
89+
git clone https://github.com/user/slurm-cloud-bursting-using-cyclecloud.git
90+
cd slurm-cloud-bursting-using-cyclecloud/scripts
91+
sh slurm-scheduler-builder.sh
92+
```
93+
Output:
94+
```
95+
------------------------------------------------------------------------------------------------------------------------------
96+
Building Slurm scheduler for cloud bursting with Azure CycleCloud
97+
------------------------------------------------------------------------------------------------------------------------------
98+
99+
Enter Cluster Name: hpc1
100+
------------------------------------------------------------------------------------------------------------------------------
101+
102+
Summary of entered details:
103+
Cluster Name: hpc1
104+
Scheduler Hostname: masternode2
105+
NFSServer IP Address: 10.222.1.26
106+
```
107+
108+
### 3. CycleCloud UI:
109+
110+
* Access the CycleCloud UI, edit the `hpc1` cluster settings, and configure VM SKUs and networking settings.
111+
* Enter the NFS server IP address for `/sched` and `/shared` mounts in the Network Attached Storage section.
112+
* Save & Start `hpc1` cluster.
113+
114+
![CycleCloud UI](../images/slurm-cloud-burst/cyclecloudui_config.png)
115+
116+
### 4. On Slurm Scheduler Node:
117+
118+
* Integrate External Slurm Scheduler with CycleCloud using the `cyclecloud-integrator.sh` script.
119+
* Provide CycleCloud details (username, password, and URL) when prompted. (Try entering the details manually instead of copy and paste. The copy & paste might contain some whitespaces and it might create issues in building the connection.)
120+
```sh
121+
cd slurm-cloud-bursting-using-cyclecloud/scripts
122+
sh cyclecloud-integrator.sh
123+
```
124+
Output:
125+
```
126+
[root@masternode2 scripts]# sh cyclecloud-integrator.sh
127+
Please enter the CycleCloud details to integrate with the Slurm scheduler
128+
129+
Enter Cluster Name: hpc1
130+
Enter CycleCloud Username: user
131+
Enter CycleCloud Password:
132+
Enter CycleCloud URL (e.g., https://10.222.1.19): https://10.222.1.19
133+
------------------------------------------------------------------------------------------------------------------------------
134+
135+
Summary of entered details:
136+
Cluster Name: hpc1
137+
CycleCloud Username: user
138+
CycleCloud URL: https://10.222.1.19
139+
------------------------------------------------------------------------------------------------------------------------------
140+
```
141+
142+
### 5. User and Group Setup:
143+
144+
* Ensure consistent user and group IDs across all nodes.
145+
* Better to use a centralized User Management system like LDAP to ensure the UID and GID are consistent across all the nodes.
146+
* In this example, we are using the `users.sh` script to create a test user `user` and group for job submission. (User `user` exists in CycleCloud)
147+
```sh
148+
cd slurm-cloud-bursting-using-cyclecloud/scripts
149+
sh users.sh
150+
```
151+
152+
### 6. Testing & Job Submission:
153+
154+
* Log in as a test user (`user` in this example) on the Scheduler node.
155+
* Submit a test job to verify the setup.
156+
```sh
157+
su - user
158+
srun hostname &
159+
```
160+
Output:
161+
```
162+
[root@masternode2 scripts]# su - user
163+
Last login: Tue May 14 04:54:51 UTC 2024 on pts/0
164+
[user@masternode2 ~]$ srun hostname &
165+
[1] 43448
166+
[user@masternode2 ~]$ squeue
167+
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
168+
1 hpc hostname vinil CF 0:04 1 hpc1-hpc-1
169+
[user@masternode2 ~]$ hpc1-hpc-1
170+
```
171+
A new node is created in hpc1 cluster.
172+
173+
![CycleCloud UI New node](../images/slurm-cloud-burst/cyclecloud_ui_newnode.png)
174+
175+
176+
### Next Steps
177+
178+
* [GitHub repo - slurm-cloud-bursting-using-cyclecloud](https://github.com/vinil-v/slurm-cloud-bursting-using-cyclecloud)
179+
* [Azure CycleCloud Documentation](https://learn.microsoft.com)
180+
* [Slurm documentation](https://slurm.schedmd.com)
77.4 KB
Loading
74.5 KB
Loading
98.9 KB
Loading

0 commit comments

Comments
 (0)