|
| 1 | +--- |
| 2 | +title: Slurm Cloud Bursting Using Azure CycleCloud |
| 3 | +description: Learn how to configure Slurm Cloud bursting using Azure CycleCloud. |
| 4 | +author: vinilv |
| 5 | +ms.date: 09/12/2024 |
| 6 | +ms.author: padmalathas |
| 7 | +--- |
| 8 | + |
| 9 | +# What is Cloud Bursting? |
| 10 | + |
| 11 | +Cloud bursting is a configuration in cloud computing that allows an organization to handle peaks in IT demand by using a combination of private and public clouds. When the resources in a private cloud reach their maximum capacity, the overflow traffic is directed to a public cloud to ensure there is no interruption in services. This setup provides flexibility and cost savings, as you only pay for the additional resources when there is a demand for them. |
| 12 | + |
| 13 | +For example, an application can run on a private cloud and "burst" to a public cloud only when necessary to meet peak demands. This approach helps avoid the costs associated with maintaining extra capacity that is not always in use |
| 14 | + |
| 15 | +Cloud bursting can be used in various scenarios, such as enabling on-premises workloads to be sent to the cloud for processing, known as hybrid HPC (High-Performance Computing). This allows users to optimize their resource utilization and cost efficiency while accessing the scalability and flexibility of the cloud. |
| 16 | + |
| 17 | +## Requirements to Setup Slurm Cloud Bursting Using CycleCloud on Azure |
| 18 | + |
| 19 | +## Azure subscription account |
| 20 | +You must obtain an Azure subscription or be assigned as an Owner role of the subscription. |
| 21 | + |
| 22 | +* To create an Azure subscription, go to the [Create a Subscription](/azure/cost-management-billing/manage/create-subscription#create-a-subscription) site. |
| 23 | +* To access an existing subscription, go to the [Azure portal](https://portal.azure.com/). |
| 24 | + |
| 25 | +## Network infrastructure |
| 26 | +If you intend to create a Slurm cluster entirely within Azure, you must deploy both the head node(s) and the CycleCloud compute nodes within a single Azure Virtual Network (VNET). |
| 27 | + |
| 28 | + |
| 29 | + |
| 30 | +However, if your goal is to establish a hybrid HPC cluster with the head node(s) located on your on-premises corporate network and the compute nodes in Azure, you will need to set up a [Site-to-Site](/azure/vpn-gateway/tutorial-site-to-site-portal) VPN or an [ExpressRoute](/azure/expressroute/) connection between your on-premises network and the Azure VNET. The head node(s) must have the capability to connect to Azure services over the Internet. You may need to coordinate with your network administrator to configure this connectivity. |
| 31 | + |
| 32 | +## Network Ports and Security |
| 33 | +The following NSG rules must be configured for successful communication between Master node, CycleCloud server and Compute nodes. |
| 34 | + |
| 35 | +| **Service** | **Port** | **Protocol** | **Direction** | **Purpose** | **Requirement** | |
| 36 | +|------------------------------------|-----------------|--------------|------------------|------------------------------------------------------------------------|---------------------------------------------------------------------------------| |
| 37 | +| **SSH (Secure Shell)** | 22 | TCP | Inbound/Outbound | Secure command-line access to the Slurm Master node | Open on both on-premises firewall and Azure NSGs | |
| 38 | +| **Slurm Control (slurmctld, slurmd)** | 6817, 6818 | TCP | Inbound/Outbound | Communication between Slurm Master and compute nodes | Open in on-premises firewall and Azure NSGs | |
| 39 | +| **Munge Authentication Service** | 4065 | TCP | Inbound/Outbound | Authentication between Slurm Master and compute nodes | Open on both on-premises network and Azure NSGs | |
| 40 | +| **CycleCloud Service** | 443 | TCP | Outbound | Communication between Slurm Master node and Azure CycleCloud | Allow outbound connections to Azure CycleCloud services from the Slurm Master node | |
| 41 | + |
| 42 | +Please refer [Slurm Network Configuration Guide](https://slurm.schedmd.com/network.html) |
| 43 | + |
| 44 | +## CycleCloud version and Project version |
| 45 | + |
| 46 | +We are utilizing the following versions: |
| 47 | +- **[CycleCloud](/azure/cyclecloud/release-notes/8-6-0?view=cyclecloud-8):** 8.6.0-3223 |
| 48 | +- **[cyclecloud-slurm Project](https://github.com/Azure/cyclecloud-slurm/releases/3.0.6):** 3.0.6 |
| 49 | + |
| 50 | +## Slurm and OS version |
| 51 | + |
| 52 | +* Slurm Version: 23.02.7-1 |
| 53 | +* OS version in Scheduler and execute nodes: Alma Linux release 8.7 (almalinux:almalinux-hpc:8_7-hpc-gen2:latest) |
| 54 | + |
| 55 | +## NFS File server. |
| 56 | +A shared file system between the external Slurm Scheduler node and the CycleCloud cluster. You can use Azure NetApp Files, Azure Files, NFS, or other methods to mount the same file system on both sides. In this example, we are using a Scheduler VM as an NFS server. |
| 57 | + |
| 58 | +## Steps |
| 59 | + |
| 60 | +After we have the prerequisites ready, we can follow these steps to integrate the external Slurm Scheduler node with the CycleCloud cluster: |
| 61 | + |
| 62 | +### 1. On CycleCloud VM: |
| 63 | + |
| 64 | +* Ensure CycleCloud 8.6 VM is running and accessible via `cyclecloud` CLI. |
| 65 | +* Clone this repository and import a cluster using the provided CycleCloud template (`slurm-headless.txt`). |
| 66 | +* We are importing a cluster named `hpc1` using the `slurm-headless.txt` template. |
| 67 | + ```sh |
| 68 | + git clone https://github.com/user/slurm-cloud-bursting-using-cyclecloud.git |
| 69 | + cyclecloud import_cluster hpc1 -c Slurm-HL -f slurm-cloud-bursting-using-cyclecloud/templates/slurm-headless.txt |
| 70 | + ``` |
| 71 | + Output: |
| 72 | + ``` |
| 73 | + [user@cc86 ~]$ cyclecloud import_cluster hpc1 -c Slurm-HL -f slurm-cloud-bursting-using-cyclecloud/cyclecloud-template/slurm-headless.txt |
| 74 | + Importing cluster Slurm-HL and creating cluster hpc1.... |
| 75 | + ---------- |
| 76 | + hpc1 : off |
| 77 | + ---------- |
| 78 | + Resource group: |
| 79 | + Cluster nodes: |
| 80 | + Total nodes: 0 |
| 81 | + ``` |
| 82 | + |
| 83 | +### 2. Preparing Scheduler VM: |
| 84 | + |
| 85 | +* Deploy a VM using the specified AlmaLinux image (If you have an existing Slurm Scheduler, you can skip this). |
| 86 | +* Run the Slurm scheduler installation script (`slurm-scheduler-builder.sh`) and provide the cluster name (`hpc1`) when prompted. |
| 87 | +* This script will install and configure Slurm Scheduler. |
| 88 | + ```sh |
| 89 | + git clone https://github.com/user/slurm-cloud-bursting-using-cyclecloud.git |
| 90 | + cd slurm-cloud-bursting-using-cyclecloud/scripts |
| 91 | + sh slurm-scheduler-builder.sh |
| 92 | + ``` |
| 93 | + Output: |
| 94 | + ``` |
| 95 | + ------------------------------------------------------------------------------------------------------------------------------ |
| 96 | + Building Slurm scheduler for cloud bursting with Azure CycleCloud |
| 97 | + ------------------------------------------------------------------------------------------------------------------------------ |
| 98 | + |
| 99 | + Enter Cluster Name: hpc1 |
| 100 | + ------------------------------------------------------------------------------------------------------------------------------ |
| 101 | + |
| 102 | + Summary of entered details: |
| 103 | + Cluster Name: hpc1 |
| 104 | + Scheduler Hostname: masternode2 |
| 105 | + NFSServer IP Address: 10.222.1.26 |
| 106 | + ``` |
| 107 | + |
| 108 | +### 3. CycleCloud UI: |
| 109 | + |
| 110 | +* Access the CycleCloud UI, edit the `hpc1` cluster settings, and configure VM SKUs and networking settings. |
| 111 | +* Enter the NFS server IP address for `/sched` and `/shared` mounts in the Network Attached Storage section. |
| 112 | +* Save & Start `hpc1` cluster. |
| 113 | + |
| 114 | + |
| 115 | + |
| 116 | +### 4. On Slurm Scheduler Node: |
| 117 | + |
| 118 | +* Integrate External Slurm Scheduler with CycleCloud using the `cyclecloud-integrator.sh` script. |
| 119 | +* Provide CycleCloud details (username, password, and URL) when prompted. (Try entering the details manually instead of copy and paste. The copy & paste might contain some whitespaces and it might create issues in building the connection.) |
| 120 | + ```sh |
| 121 | + cd slurm-cloud-bursting-using-cyclecloud/scripts |
| 122 | + sh cyclecloud-integrator.sh |
| 123 | + ``` |
| 124 | + Output: |
| 125 | + ``` |
| 126 | + [root@masternode2 scripts]# sh cyclecloud-integrator.sh |
| 127 | + Please enter the CycleCloud details to integrate with the Slurm scheduler |
| 128 | +
|
| 129 | + Enter Cluster Name: hpc1 |
| 130 | + Enter CycleCloud Username: user |
| 131 | + Enter CycleCloud Password: |
| 132 | + Enter CycleCloud URL (e.g., https://10.222.1.19): https://10.222.1.19 |
| 133 | + ------------------------------------------------------------------------------------------------------------------------------ |
| 134 | +
|
| 135 | + Summary of entered details: |
| 136 | + Cluster Name: hpc1 |
| 137 | + CycleCloud Username: user |
| 138 | + CycleCloud URL: https://10.222.1.19 |
| 139 | + ------------------------------------------------------------------------------------------------------------------------------ |
| 140 | + ``` |
| 141 | + |
| 142 | +### 5. User and Group Setup: |
| 143 | + |
| 144 | +* Ensure consistent user and group IDs across all nodes. |
| 145 | +* Better to use a centralized User Management system like LDAP to ensure the UID and GID are consistent across all the nodes. |
| 146 | +* In this example, we are using the `users.sh` script to create a test user `user` and group for job submission. (User `user` exists in CycleCloud) |
| 147 | + ```sh |
| 148 | + cd slurm-cloud-bursting-using-cyclecloud/scripts |
| 149 | + sh users.sh |
| 150 | + ``` |
| 151 | + |
| 152 | +### 6. Testing & Job Submission: |
| 153 | + |
| 154 | +* Log in as a test user (`user` in this example) on the Scheduler node. |
| 155 | +* Submit a test job to verify the setup. |
| 156 | + ```sh |
| 157 | + su - user |
| 158 | + srun hostname & |
| 159 | + ``` |
| 160 | + Output: |
| 161 | + ``` |
| 162 | + [root@masternode2 scripts]# su - user |
| 163 | + Last login: Tue May 14 04:54:51 UTC 2024 on pts/0 |
| 164 | + [user@masternode2 ~]$ srun hostname & |
| 165 | + [1] 43448 |
| 166 | + [user@masternode2 ~]$ squeue |
| 167 | + JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 168 | + 1 hpc hostname vinil CF 0:04 1 hpc1-hpc-1 |
| 169 | + [user@masternode2 ~]$ hpc1-hpc-1 |
| 170 | + ``` |
| 171 | +A new node is created in hpc1 cluster. |
| 172 | + |
| 173 | + |
| 174 | + |
| 175 | + |
| 176 | +### Next Steps |
| 177 | + |
| 178 | +* [GitHub repo - slurm-cloud-bursting-using-cyclecloud](https://github.com/vinil-v/slurm-cloud-bursting-using-cyclecloud) |
| 179 | +* [Azure CycleCloud Documentation](https://learn.microsoft.com) |
| 180 | +* [Slurm documentation](https://slurm.schedmd.com) |
0 commit comments