Skip to content

Commit 7e75562

Browse files
author
Jill Grant
authored
Merge pull request #285931 from tomvcassidy/hpcScaffoldAndMigrationGuide
Hpc toc scaffold and migration guide
2 parents ce0b798 + a1df0ba commit 7e75562

23 files changed

+1797
-0
lines changed
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
- name: High-Performance Computing on-premises to cloud lift and shift
2+
href: index.yml
3+
- name: Get started
4+
expanded: true
5+
items:
6+
- name: Overview
7+
href: lift-and-shift-overview.md
8+
- name: Migration guide
9+
expanded: true
10+
items:
11+
- name: Proof-of-concept migration guide
12+
href: lift-and-shift-proof-of-concept.md
13+
- name: Production-level environment migration guide
14+
expanded: true
15+
items:
16+
- name: Overview
17+
href: lift-and-shift-production-level-overview.md
18+
- name: Deployment step 1 - Basic infrastructure
19+
items:
20+
- name: Overview
21+
href: lift-and-shift-step-1-overview.md
22+
- name: Resource group
23+
href: lift-and-shift-step-1-resource-group.md
24+
- name: Network access
25+
href: lift-and-shift-step-1-networking.md
26+
- name: Storage
27+
href: lift-and-shift-step-1-storage.md
28+
- name: Deployment step 2 - Base services
29+
items:
30+
- name: Overview
31+
href: lift-and-shift-step-2-overview.md
32+
- name: Job scheduler
33+
href: lift-and-shift-step-2-job-scheduler.md
34+
- name: Resource orchestrator
35+
href: lift-and-shift-step-2-resource-orchestrator.md
36+
- name: Identity management
37+
href: lift-and-shift-step-2-identity.md
38+
- name: Accounting
39+
href: lift-and-shift-step-2-accounting.md
40+
- name: Monitoring
41+
href: lift-and-shift-step-2-monitor.md
42+
- name: Deployment step 3 - Storage
43+
items:
44+
- name: Overview
45+
href: lift-and-shift-step-3-overview.md
46+
- name: Storage
47+
href: lift-and-shift-step-3-storage.md
48+
- name: Data migration
49+
href: lift-and-shift-step-3-data-migration.md
50+
- name: Deployment step 4 - Compute nodes
51+
items:
52+
- name: Overview
53+
href: lift-and-shift-step-4-overview.md
54+
- name: VM images
55+
href: lift-and-shift-step-4-vm-images.md
56+
- name: Deployment step 5 - End user entry point
57+
items:
58+
- name: Overview
59+
href: lift-and-shift-step-5-overview.md
60+
- name: End-user entry point
61+
href: lift-and-shift-step-5-end-user-entry-point.md
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
title: "End-to-end high-performance computing (HPC) lift and shift architecture overview"
3+
description: Learn about how to conduct a lift and shift migration of HPC infrastructure and workloads from an on-premises environment to the cloud.
4+
author: tomvcassidy
5+
ms.author: tomcassidy
6+
ms.date: 08/30/2024
7+
ms.topic: how-to
8+
ms.service: azure-virtual-machines
9+
ms.subservice: hpc
10+
---
11+
12+
# End-to-end HPC lift and shift architecture overview
13+
14+
"Lift and shift" in the context of High-Performance Computing (HPC) mostly refers to the process of migrating an on-premises environment and workload to the cloud. Ideally, modifications are kept to a minimum (for example, applications, job schedulers, and their configurations should remain mostly the same). Adjustments on storage and hardware are natural to happen because resources are different from on-premises to cloud platforms. With the lift and shift approach, organizations can start benefiting from the cloud more quickly.
15+
16+
The following figure represents a typical on-premises HPC cluster in a production environment, which the hardware manufacturer often delivers. Such on-premises environment comprises a set of compute nodes, which may or may not work with virtual machine images and containers. Such nodes execute workloads managed by a job scheduler, which can be Slurm, PBS, or LSF typically. The workloads come from multiple users that have identity management associated with them. Usually there are home directories, scratch disks, and long term storage. Some form of monitoring to check the performance of jobs and health of compute nodes are also available. Users can access the environment via command line, browsers, or some kind of remote visualization technology. The entire environment is hosted in a private network, so users have some mechanism to access the computing facility, either via VPN or via portal.
17+
18+
:::image type="content" source="media/on-premises-old-icons.png" alt-text="Diagram depicting existing on-premises environment architecture.":::
19+
20+
As we see throughout this document, the environment in the cloud following the Infrastructure-as-a-Service model, conceptually speaking, isn't so different. Some technologies need some updates and some steps during the migration from on-premises to the cloud are necessary.
21+
22+
This document therefore:
23+
24+
- Goes through the options for the migration process;
25+
- Provides pointers to products and best practices for each component;
26+
- And provides recommendations to avoid pitfalls in the process.
27+
28+
Before jumping into the architecture description, it's relevant to understand
29+
the different personas in this context, their needs, and expectations.
30+
31+
## Personas and user experience
32+
33+
There are different people who need to access the HPC environment. Their activities and how they interact with the environment vary quite a bit.
34+
35+
### End-user (engineer / scientist / researcher)
36+
37+
This persona represents the subject matter expert (for example, biologist, physicist, engineer, etc.) who wants to run experiments (that is, submit jobs) and analyze results. End-users interact with system administrators to fine-tune the computing environment whenever needed. They may have some experience using CLI-based tools, but some of them may rely only on web portals or graphical user interfaces via VDI to submit their jobs and interact with the generated results.
38+
39+
**New responsibilities in cloud HPC environment:**
40+
41+
- End-user shouldn't have any new responsibilities based on the work from both the HPC Administrator and Cloud Administrator. Depending on the on-premises environment, end-users have access to a larger capacity and variety of computing resources to become more productive.
42+
43+
### HPC administrator
44+
45+
This persona represents the one who has HPC expertise and is responsible for deploying the initial computing infrastructure and adapting it according to business and end-user needs. This persona is also responsible for verifying the health of the system and performing troubleshooting. HPC administrators are comfortable accessing the architecture and its components via CLI, SDKs, and web portals. They're also the first point of contact when end-users face any challenge with the computing environment.
46+
47+
**New responsibilities in cloud HPC environment:**
48+
49+
- Managing cloud resources and services (for example, virtual machines, storage, networking) via cloud management platforms.
50+
- Implementing and managing clusters and resources via new resource orchestration tools (for example, CycleCloud).
51+
- Optimizing application deployment by understanding infrastructure details (that is, VM types, storage, and network options).
52+
- Optimizing resource utilization and costs by using cloud-specific features such as autoscaling and spot instances.
53+
54+
### Cloud administrator
55+
56+
This persona works with the HPC administrator to help deploy and maintain the computing infrastructure. This persona isn't (necessarily) an HPC expert, but a Cloud expert with deep knowledge of the overall company IT infrastructure, including network configurations/policies, user access rights, and user devices. Depending on the case, the HPC administrator and Cloud administrator may be the same person.
57+
58+
**New responsibilities in cloud HPC environment:**
59+
60+
- Collaborating with HPC administrators to ensure seamless integration of HPC workloads with cloud infrastructure.
61+
- Monitoring and managing cloud infrastructure performance, security, and compliance.
62+
- Helping with the configuration of cloud-based networking and storage solutions to support HPC workloads.
63+
64+
### Business manager / owner
65+
66+
This persona represents the one who is responsible for the business, which includes taking care of budget and projects to meet organizational goals. For this persona, the accounting component of the architecture is relevant to understand costs for each project. This persona works with HPC admins and end-users to understand platform needs, including storage, network, computing resources. They also plan for future workloads.
67+
68+
**New responsibilities in cloud HPC environment:**
69+
70+
- Analyzing detailed cost reports and usage metrics provided by cloud service providers to manage budgets and forecast expenses.
71+
- Making strategic decisions based on cloud resource usage and cost optimization opportunities.
72+
- Planning and approving cloud infrastructure investments to support future HPC workloads and business objectives.
73+
74+
## Lift and shift architecture overview
75+
76+
:::image type="content" source="media/visio-lift-shift-arch-background.png" alt-text="Diagram depicting target HPC Cloud architecture.":::
77+
78+
A production HPC environment in the cloud comprises several components. There are some core components to stand up an environment, such as a job scheduler, a resource provider, an entry pointer for the user to access the environment, compute and storage devices, among others. As the environment gets into production, monitoring, observability, health checks, security, identity management, accountability, different storage options, among other components, start to play a critical role.
79+
80+
There are also extensions that could be in place, such as sign-in nodes, data movers, use of containers, license managers, among others that are dependent on the installation.
81+
82+
This production-level environment may have various components to be set up. Therefore, environment deployers and managers become key to automate its initial deployment and upgrade it along the way, respectively. More advanced installations can also have environment templates (or specifications) with software versions and configurations that are more optimal and tested properly. Once the environment is in production with all the required components in place, over time, adjustments may be required to meet user demands, including changes in VM types or storage options/capabilities.
83+
84+
## Instantiating the lift and shift HPC cloud architecture
85+
86+
Here we provide more details for each architecture component, including pointers to official Azure products, tech blogs with some best practices, git repositories, and links to non-product solutions.
87+
88+
**Quick start.** For a quick start solution to create an HPC environment in the cloud with basic building blocks, we recommend using [Azure CycleCloud Slurm workspace](https://techcommunity.microsoft.com/t5/azure-high-performance-computing/introducing-azure-cyclecloud-slurm-workspace-preview/ba-p/4158433).
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
title: "Production-level environment migration guide overview"
3+
description: Learn about what a production-level environment migration entails.
4+
author: tomvcassidy
5+
ms.author: tomcassidy
6+
ms.date: 08/30/2024
7+
ms.topic: how-to
8+
ms.service: azure-virtual-machines
9+
ms.subservice: hpc
10+
---
11+
12+
# Production-level environment migration guide overview
13+
14+
When you move an HPC infrastructure from the on-premises environment to the cloud, there are various aspects to be taken into account. This document provides guidance on how to create such HPC environment in the cloud. We recommend
15+
a two-phase approach. First, a proof-of-concept, and then a production-level environment. Once the production environment is up and running, only certain components should be modified over time, including changes on VM types and storage capabilities to best meet the varying requirements of users, projects, and business.
16+
17+
In this article and the following articles, we guide you through a product-level environment migration.
18+
19+
## Prerequisites
20+
21+
You need an Azure subscription to provision cloud resources.
22+
23+
## Migrating from on-premises to the cloud: production level
24+
25+
After the proof-of-concept phase, planning is required to get ready for creating a production-level HPC environment. This new environment can represent part of the on-premises infrastructure (for example, an HPC cluster from a group of clusters or queue/partition from an existing cluster), or the entire computing capability.
26+
27+
Due to component dependencies, the deployment of this HPC cloud environment is based on a sequence of deployments, which consists of:
28+
29+
1. Basic infrastructure, which includes creation of a resource group, network access and
30+
network security rules;
31+
1. Base services, which include identity management, job scheduler and resource;
32+
provisioner, along with their respective configurations;
33+
1. Storage;
34+
1. Compute nodes' specifications;
35+
1. End user entry point.
36+
37+
In the following articles, we cover each deployment step and the components involved. In the descriptions of the components, we highlight their relevant dependencies in more detail. It's also worth noting that the component deployment steps can be executed in several ways. We provide a few tips to help get started with the deployment components via the Azure portal. But at a production level, we recommend the creation of an environment deployer that leverages infrastructure-as-code (for example, via bicep, Terraform, or Azure CLI). By doing so, one can create an environment in an automated and replicable fashion.
38+
39+
For each step, certain topics need to be assessed before starting the migration process.
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
---
2+
title: "Proof-of-concept migration overview"
3+
description: Learn about what a proof-of-concept migration entails and follow the guide through one.
4+
author: tomvcassidy
5+
ms.author: tomcassidy
6+
ms.date: 08/30/2024
7+
ms.topic: how-to
8+
ms.service: azure-virtual-machines
9+
ms.subservice: hpc
10+
---
11+
12+
# Proof-of-concept migration overview
13+
14+
When you move an HPC infrastructure from the on-premises environment to the cloud, there are various aspects to be taken into account. This document provides guidance on how to create such HPC environment in the cloud. We recommend
15+
a two-phase approach. First, a proof-of-concept, and then a production-level environment. Once the production environment is up and running, only certain components should be modified over time, including changes on VM types and storage capabilities to best meet the varying requirements of users, projects, and business.
16+
17+
In this article, we guide you through a proof-of-concept migration.
18+
19+
## Prerequisites
20+
21+
You need an Azure subscription to provision cloud resources.
22+
23+
## Migrating from on-premises to the cloud: proof-of-concept (PoC)
24+
25+
We recommend starting with a proof-of-concept (PoC) by provisioning a simple cluster in Azure, using Azure CycleCloud as a resource orchestrator, with one well-known scheduler, such as Slurm, PBS, or LSF. This approach allows one to start understanding Azure technology, assess the functionality of user applications, and investigate performance/costs trade-offs in comparison to the on-premises environment.
26+
27+
If one is flexible with the job scheduler, or already uses Slurm scheduler, we recommend using Azure CycleCloud Slurm workspace, which is an offering that helps create a CycleCloud based cluster, with Slurm scheduler, and the basic setup for networking and storage options available. Some details on this process are available in the Resource Orchestrator section from this document.
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
---
2+
title: "Deployment step 1: basic infrastructure - network access component"
3+
description: Learn about the configuration of network access during migration deployment step one.
4+
author: tomvcassidy
5+
ms.author: tomcassidy
6+
ms.date: 08/30/2024
7+
ms.topic: how-to
8+
ms.service: azure-virtual-machines
9+
ms.subservice: hpc
10+
---
11+
12+
# Deployment step 1: basic infrastructure - network access component
13+
14+
Mechanism to allow users access cloud environment in a secure way. It's a common practice in production environments to have resources with private IP addresses, and with rules to define how resources should be accessed.
15+
16+
This component should:
17+
18+
- Allow users to access private network hosting the high performance computing (HPC) environment;
19+
- Refine network security rules such as source and target ports and IP addresses that can access resources.
20+
21+
## Define network needs
22+
23+
* **Estimate cluster size for proper network setup:**
24+
- Different subnets have different ranges of IP addresses.
25+
26+
* **Security rules:**
27+
- Understand how users access the HPC environment and security rules to be in places (for example, ports and IPs open/closed).
28+
29+
## Tools and Services
30+
31+
* **Private network access:**
32+
- In Azure, the two major components to help access private network are Azure Bastion and Azure VPN Gateway.
33+
34+
* **Network rules:**
35+
- Another key component for network setup is Azure Network security groups, which is used to filter network traffic between Azure resources in an Azure virtual network.
36+
37+
* **DNS:**
38+
- Azure DNS Private Resolver allows query Azure DNS private zones from an on-premises environment and vice versa without deploying VM based DNS servers.
39+
40+
## Best practices for network in HPC lift and shift architecture
41+
42+
* **Have good understanding on cluster sizes and services to be used:**
43+
- Different cluster sizes require different IP ranges, and proper planning helps avoid major changes in parts of the infrastructure. Also, some services may need exclusive subnets, and having clarity on those subnets is essential.
44+
45+
## Example steps for setup and deployment
46+
47+
Networking is a vast topic itself. In a production level environment, it's good practice to not use public IP addresses. So one could start by testing such functionality by provisioning a VM and using Bastion.
48+
49+
For instance
50+
51+
1. **Provision a VM via portal with no public IP address:**
52+
- Follow the standard steps to provision a VM (that is, setup resource group, network, VM image, disk, etc.)
53+
- During the VM create, a Virtual Network needs to be created if it's not already available
54+
- Make sure the VM doesn't have a public IP address
55+
56+
2. **Use bastion:**
57+
- Once the VM is provisioned, go to the VM via Azure portal
58+
- Select the option "Bastion" from "Connect" section.
59+
- Select option "Deploy Bastion"
60+
- Once the bastion is provisioned, the VM can be access through it.
61+
62+
## Resources
63+
64+
- VPN Gateway documentation: [product website](/azure/vpn-gateway/)
65+
- Azure Bastion documentation: [product website](/azure/bastion/)
66+
- Network Security groups: [product website](/azure/virtual-network/network-security-groups-overview)
67+
- Azure DNS Private Resolver: [product website](/azure/dns/dns-private-resolver-overview)

0 commit comments

Comments
 (0)