MicrosoftDocs
diff --git a/‎articles/high-performance-computing/TOC.yml
Lines changed: 61 additions & 0 deletions b/‎articles/high-performance-computing/TOC.yml
Lines changed: 61 additions & 0 deletions
diff --git a/‎articles/high-performance-computing/lift-and-shift-overview.md
Lines changed: 88 additions & 0 deletions b/‎articles/high-performance-computing/lift-and-shift-overview.md
Lines changed: 88 additions & 0 deletions
diff --git a/‎articles/high-performance-computing/lift-and-shift-production-level-overview.md
Lines changed: 39 additions & 0 deletions b/‎articles/high-performance-computing/lift-and-shift-production-level-overview.md
Lines changed: 39 additions & 0 deletions
diff --git a/‎articles/high-performance-computing/lift-and-shift-proof-of-concept.md
Lines changed: 27 additions & 0 deletions b/‎articles/high-performance-computing/lift-and-shift-proof-of-concept.md
Lines changed: 27 additions & 0 deletions
diff --git a/‎articles/high-performance-computing/lift-and-shift-step-1-networking.md
Lines changed: 67 additions & 0 deletions b/‎articles/high-performance-computing/lift-and-shift-step-1-networking.md
Lines changed: 67 additions & 0 deletions
@@ -0,0 +1,61 @@
+- name: High-Performance Computing on-premises to cloud lift and shift
+  href: index.yml
+- name: Get started
+  expanded: true
+  items:
+  - name: Overview
+    href: lift-and-shift-overview.md
+- name: Migration guide
+  expanded: true
+  items:
+  - name: Proof-of-concept migration guide
+    href: lift-and-shift-proof-of-concept.md
+  - name: Production-level environment migration guide
+    expanded: true
+    items:
+    - name: Overview
+      href: lift-and-shift-production-level-overview.md
+    - name: Deployment step 1 - Basic infrastructure
+      items:
+      - name: Overview
+        href: lift-and-shift-step-1-overview.md
+      - name: Resource group
+        href: lift-and-shift-step-1-resource-group.md
+      - name: Network access
+        href: lift-and-shift-step-1-networking.md
+      - name: Storage
+        href: lift-and-shift-step-1-storage.md
+    - name: Deployment step 2 - Base services
+      items:
+      - name: Overview
+        href: lift-and-shift-step-2-overview.md
+      - name: Job scheduler
+        href: lift-and-shift-step-2-job-scheduler.md
+      - name: Resource orchestrator
+        href: lift-and-shift-step-2-resource-orchestrator.md
+      - name: Identity management
+        href: lift-and-shift-step-2-identity.md
+      - name: Accounting
+        href: lift-and-shift-step-2-accounting.md
+      - name: Monitoring
+        href: lift-and-shift-step-2-monitor.md
+    - name: Deployment step 3 - Storage
+      items:
+      - name: Overview
+        href: lift-and-shift-step-3-overview.md
+      - name: Storage
+        href: lift-and-shift-step-3-storage.md
+      - name: Data migration
+        href: lift-and-shift-step-3-data-migration.md
+    - name: Deployment step 4 - Compute nodes
+      items:
+      - name: Overview
+        href: lift-and-shift-step-4-overview.md
+      - name: VM images
+        href: lift-and-shift-step-4-vm-images.md
+    - name: Deployment step 5 - End user entry point
+      items:
+      - name: Overview
+        href: lift-and-shift-step-5-overview.md
+      - name: End-user entry point
+        href: lift-and-shift-step-5-end-user-entry-point.md
@@ -0,0 +1,88 @@
+---
+title: "End-to-end high-performance computing (HPC) lift and shift architecture overview"
+description: Learn about how to conduct a lift and shift migration of HPC infrastructure and workloads from an on-premises environment to the cloud.
+author: tomvcassidy
+ms.author: tomcassidy
+ms.date: 08/30/2024
+ms.topic: how-to
+ms.service: azure-virtual-machines
+ms.subservice: hpc
+---
+
+# End-to-end HPC lift and shift architecture overview
+
+"Lift and shift" in the context of High-Performance Computing (HPC) mostly refers to the process of migrating an on-premises environment and workload to the cloud. Ideally, modifications are kept to a minimum (for example, applications, job schedulers, and their configurations should remain mostly the same). Adjustments on storage and hardware are natural to happen because resources are different from on-premises to cloud platforms. With the lift and shift approach, organizations can start benefiting from the cloud more quickly.
+
+The following figure represents a typical on-premises HPC cluster in a production environment, which the hardware manufacturer often delivers. Such on-premises environment comprises a set of compute nodes, which may or may not work with virtual machine images and containers. Such nodes execute workloads managed by a job scheduler, which can be Slurm, PBS, or LSF typically. The workloads come from multiple users that have identity management associated with them. Usually there are home directories, scratch disks, and long term storage. Some form of monitoring to check the performance of jobs and health of compute nodes are also available. Users can access the environment via command line, browsers, or some kind of remote visualization technology. The entire environment is hosted in a private network, so users have some mechanism to access the computing facility, either via VPN or via portal.
+
+:::image type="content" source="media/on-premises-old-icons.png" alt-text="Diagram depicting existing on-premises environment architecture.":::
+
+As we see throughout this document, the environment in the cloud following the Infrastructure-as-a-Service model, conceptually speaking, isn't so different. Some technologies need some updates and some steps during the migration from on-premises to the cloud are necessary.
+
+This document therefore:
+
+- Goes through the options for the migration process;
+- Provides pointers to products and best practices for each component;
+- And provides recommendations to avoid pitfalls in the process.
+
+Before jumping into the architecture description, it's relevant to understand
+the different personas in this context, their needs, and expectations.
+
+## Personas and user experience
+
+There are different people who need to access the HPC environment. Their activities and how they interact with the environment vary quite a bit.
+
+### End-user (engineer / scientist / researcher)
+
+This persona represents the subject matter expert (for example, biologist, physicist, engineer, etc.) who wants to run experiments (that is, submit jobs) and analyze results. End-users interact with system administrators to fine-tune the computing environment whenever needed. They may have some experience using CLI-based tools, but some of them may rely only on web portals or graphical user interfaces via VDI to submit their jobs and interact with the generated results.
+
+**New responsibilities in cloud HPC environment:**
+
+- End-user shouldn't have any new responsibilities based on the work from both the HPC Administrator and Cloud Administrator. Depending on the on-premises environment, end-users have access to a larger capacity and variety of computing resources to become more productive.
+
+### HPC administrator
+
+This persona represents the one who has HPC expertise and is responsible for deploying the initial computing infrastructure and adapting it according to business and end-user needs. This persona is also responsible for verifying the health of the system and performing troubleshooting. HPC administrators are comfortable accessing the architecture and its components via CLI, SDKs, and web portals. They're also the first point of contact when end-users face any challenge with the computing environment.
+
+**New responsibilities in cloud HPC environment:**
+
+- Managing cloud resources and services (for example, virtual machines, storage, networking) via cloud management platforms.
+- Implementing and managing clusters and resources via new resource orchestration tools (for example, CycleCloud).
+- Optimizing application deployment by understanding infrastructure details (that is, VM types, storage, and network options).
+- Optimizing resource utilization and costs by using cloud-specific features such as autoscaling and spot instances.
+
+### Cloud administrator
+
+This persona works with the HPC administrator to help deploy and maintain the computing infrastructure. This persona isn't (necessarily) an HPC expert, but a Cloud expert with deep knowledge of the overall company IT infrastructure, including network configurations/policies, user access rights, and user devices. Depending on the case, the HPC administrator and Cloud administrator may be the same person.
+
+**New responsibilities in cloud HPC environment:**
+
+- Collaborating with HPC administrators to ensure seamless integration of HPC workloads with cloud infrastructure.
+- Monitoring and managing cloud infrastructure performance, security, and compliance.
+- Helping with the configuration of cloud-based networking and storage solutions to support HPC workloads.
+
+### Business manager / owner
+
+This persona represents the one who is responsible for the business, which includes taking care of budget and projects to meet organizational goals. For this persona, the accounting component of the architecture is relevant to understand costs for each project. This persona works with HPC admins and end-users to understand platform needs, including storage, network, computing resources. They also plan for future workloads.
+
+**New responsibilities in cloud HPC environment:**
+
+- Analyzing detailed cost reports and usage metrics provided by cloud service providers to manage budgets and forecast expenses.
+- Making strategic decisions based on cloud resource usage and cost optimization opportunities.
+- Planning and approving cloud infrastructure investments to support future HPC workloads and business objectives.
+
+## Lift and shift architecture overview
+
+:::image type="content" source="media/visio-lift-shift-arch-background.png" alt-text="Diagram depicting target HPC Cloud architecture.":::
+
+A production HPC environment in the cloud comprises several components. There are some core components to stand up an environment, such as a job scheduler, a resource provider, an entry pointer for the user to access the environment, compute and storage devices, among others. As the environment gets into production, monitoring, observability, health checks, security, identity management, accountability, different storage options, among other components, start to play a critical role.
+
+There are also extensions that could be in place, such as sign-in nodes, data movers, use of containers, license managers, among others that are dependent on the installation.
+
+This production-level environment may have various components to be set up. Therefore, environment deployers and managers become key to automate its initial deployment and upgrade it along the way, respectively. More advanced installations can also have environment templates (or specifications) with software versions and configurations that are more optimal and tested properly. Once the environment is in production with all the required components in place, over time, adjustments may be required to meet user demands, including changes in VM types or storage options/capabilities.
+
+## Instantiating the lift and shift HPC cloud architecture
+
+Here we provide more details for each architecture component, including pointers to official Azure products, tech blogs with some best practices, git repositories, and links to non-product solutions.
+
+**Quick start.** For a quick start solution to create an HPC environment in the cloud with basic building blocks, we recommend using [Azure CycleCloud Slurm workspace](https://techcommunity.microsoft.com/t5/azure-high-performance-computing/introducing-azure-cyclecloud-slurm-workspace-preview/ba-p/4158433).
@@ -0,0 +1,39 @@
+---
+title: "Production-level environment migration guide overview"
+description: Learn about what a production-level environment migration entails.
+author: tomvcassidy
+ms.author: tomcassidy
+ms.date: 08/30/2024
+ms.topic: how-to
+ms.service: azure-virtual-machines
+ms.subservice: hpc
+---
+
+# Production-level environment migration guide overview
+
+When you move an HPC infrastructure from the on-premises environment to the cloud, there are various aspects to be taken into account. This document provides guidance on how to create such HPC environment in the cloud. We recommend
+a two-phase approach. First, a proof-of-concept, and then a production-level environment. Once the production environment is up and running, only certain components should be modified over time, including changes on VM types and storage capabilities to best meet the varying requirements of users, projects, and business.
+
+In this article and the following articles, we guide you through a product-level environment migration.
+
+## Prerequisites
+
+You need an Azure subscription to provision cloud resources.
+
+## Migrating from on-premises to the cloud: production level
+
+After the proof-of-concept phase, planning is required to get ready for creating a production-level HPC environment. This new environment can represent part of the on-premises infrastructure (for example, an HPC cluster from a group of clusters or queue/partition from an existing cluster), or the entire computing capability.
+
+Due to component dependencies, the deployment of this HPC cloud environment is based on a sequence of deployments, which consists of:
+
+1. Basic infrastructure, which includes creation of a resource group, network access and
+   network security rules;
+1. Base services, which include identity management, job scheduler and resource;
+   provisioner, along with their respective configurations;
+1. Storage;
+1. Compute nodes' specifications;
+1. End user entry point.
+
+In the following articles, we cover each deployment step and the components involved. In the descriptions of the components, we highlight their relevant dependencies in more detail. It's also worth noting that the component deployment steps can be executed in several ways. We provide a few tips to help get started with the deployment components via the Azure portal. But at a production level, we recommend the creation of an environment deployer that leverages infrastructure-as-code (for example, via bicep, Terraform, or Azure CLI). By doing so, one can create an environment in an automated and replicable fashion.
+
+For each step, certain topics need to be assessed before starting the migration process.
@@ -0,0 +1,27 @@
+---
+title: "Proof-of-concept migration overview"
+description: Learn about what a proof-of-concept migration entails and follow the guide through one.
+author: tomvcassidy
+ms.author: tomcassidy
+ms.date: 08/30/2024
+ms.topic: how-to
+ms.service: azure-virtual-machines
+ms.subservice: hpc
+---
+
+# Proof-of-concept migration overview
+
+When you move an HPC infrastructure from the on-premises environment to the cloud, there are various aspects to be taken into account. This document provides guidance on how to create such HPC environment in the cloud. We recommend
+a two-phase approach. First, a proof-of-concept, and then a production-level environment. Once the production environment is up and running, only certain components should be modified over time, including changes on VM types and storage capabilities to best meet the varying requirements of users, projects, and business.
+
+In this article, we guide you through a proof-of-concept migration.
+
+## Prerequisites
+
+You need an Azure subscription to provision cloud resources.
+
+## Migrating from on-premises to the cloud: proof-of-concept (PoC)
+
+We recommend starting with a proof-of-concept (PoC) by provisioning a simple cluster in Azure, using Azure CycleCloud as a resource orchestrator, with one well-known scheduler, such as Slurm, PBS, or LSF. This approach allows one to start understanding Azure technology, assess the functionality of user applications, and investigate performance/costs trade-offs in comparison to the on-premises environment.
+
+If one is flexible with the job scheduler, or already uses Slurm scheduler, we recommend using Azure CycleCloud Slurm workspace, which is an offering that helps create a CycleCloud based cluster, with Slurm scheduler, and the basic setup for networking and storage options available. Some details on this process are available in the Resource Orchestrator section from this document.
@@ -0,0 +1,67 @@
+---
+title: "Deployment step 1: basic infrastructure - network access component"
+description: Learn about the configuration of network access during migration deployment step one.
+author: tomvcassidy
+ms.author: tomcassidy
+ms.date: 08/30/2024
+ms.topic: how-to
+ms.service: azure-virtual-machines
+ms.subservice: hpc
+---
+
+# Deployment step 1: basic infrastructure - network access component
+
+Mechanism to allow users access cloud environment in a secure way. It's a common practice in production environments to have resources with private IP addresses, and with rules to define how resources should be accessed.
+
+This component should:
+
+- Allow users to access private network hosting the high performance computing (HPC) environment;
+- Refine network security rules such as source and target ports and IP addresses that can access resources.
+
+## Define network needs
+
+* **Estimate cluster size for proper network setup:**
+   - Different subnets have different ranges of IP addresses.
+
+* **Security rules:**
+   - Understand how users access the HPC environment and security rules to be in places (for example, ports and IPs open/closed).
+
+## Tools and Services
+
+* **Private network access:**
+   - In Azure, the two major components to help access private network are Azure Bastion and Azure VPN Gateway.
+
+* **Network rules:**
+   - Another key component for network setup is Azure Network security groups, which is used to filter network traffic between Azure resources in an Azure virtual network.
+
+* **DNS:**
+   - Azure DNS Private Resolver allows query Azure DNS private zones from an on-premises environment and vice versa without deploying VM based DNS servers.
+
+## Best practices for network in HPC lift and shift architecture
+
+* **Have good understanding on cluster sizes and services to be used:**
+   - Different cluster sizes require different IP ranges, and proper planning helps avoid major changes in parts of the infrastructure. Also, some services may need exclusive subnets, and having clarity on those subnets is essential.
+
+## Example steps for setup and deployment
+
+Networking is a vast topic itself. In a production level environment, it's good practice to not use public IP addresses. So one could start by testing such functionality by provisioning a VM and using Bastion.
+
+For instance
+
+1. **Provision a VM via portal with no public IP address:**
+   - Follow the standard steps to provision a VM (that is, setup resource group, network, VM image, disk, etc.)
+   - During the VM create, a Virtual Network needs to be created if it's not already available
+   - Make sure the VM doesn't have a public IP address
+
+2. **Use bastion:**
+   - Once the VM is provisioned, go to the VM  via Azure portal
+   - Select the option "Bastion" from "Connect" section.
+   - Select option "Deploy Bastion"
+   - Once the bastion is provisioned, the VM can be access through it.
+
+## Resources
+
+- VPN Gateway documentation: [product website](/azure/vpn-gateway/)
+- Azure Bastion documentation: [product website](/azure/bastion/)
+- Network Security groups: [product website](/azure/virtual-network/network-security-groups-overview)
+- Azure DNS Private Resolver: [product website](/azure/dns/dns-private-resolver-overview)