Updating FAQ and concepts for azure nexus operator

sbtwist · sbtwist · commit 5d2a1f1dab3a · 2024-06-07T14:53:12.000-05:00
diff --git a/articles/operator-nexus/TOC.yml b/articles/operator-nexus/TOC.yml
@@ -13,6 +13,8 @@
       href: concepts-compute.md
     - name: Storage overview
       href: concepts-storage.md
+    - name: Cluster Deployment
+      href: concepts-cluster-deployment-overview.md
     - name: Network Fabric
       expanded: false
       items:
diff --git a/articles/operator-nexus/azure-operator-nexus-faq.md b/articles/operator-nexus/azure-operator-nexus-faq.md
@@ -10,7 +10,7 @@ ms.custom: template-reference
 ---
 
 # Azure Operator Nexus frequently asked questions (FAQ)
-The following sections covers some of the frequently asked questions for Azure Operator Nexus:
+The following sections cover some of the frequently asked questions for Azure Operator Nexus:
 
 ## Platform - General
 
@@ -24,13 +24,13 @@ You can interact with Operator Nexus like any other Azure services using AZ CLI,
 Yes, there are some resources that customer needs to create in the respective region under their Azure subscriptions. Some of these include creation of a pair of Network Fabric Controller and Cluster Manager resource, Log Analytics Workspace, a storage account. For more details, please refer to [Azure Operator Nexus documentation](howto-azure-operator-nexus-prerequisites.md).
 
 ### Does Azure Operator Nexus rely on connectivity with Azure? What happens when there's a disconnection?
-Yes, you need an ExpressRoute connection for its connectivity back to Azure and for Orchestration, Management and Operation purposes. During disconnection, the workloads will continue to run as is but you may lose the capability to orchestrate any new resources.
+Yes, you need an ExpressRoute connection for its connectivity back to Azure and for Orchestration, Management, and Operation purposes. During disconnection, the workloads continue to run as is but you may lose the capability to orchestrate any new resources.
 
 ### Do I have to use the BOM (Bill of Material) specified by Microsoft?
-Yes, to ensure carrier-grade performance and high degrees of automation, you'll need to use equipment specified as per one of our BOMs.
+Yes, to ensure carrier-grade performance and high degrees of automation, you need to use equipment specified as per one of our BOMs.
 
 ### How should I plan for a resilient Operator Nexus instance? How does Operator Nexus handle disaster recovery?
-Customers should design their services with Intra-rack redundancy, Inter-rack redundancy and globally load balancing across multiple instances. Also, for high availability, plan to spread your instances across multiple Azure regions.
+Customers should design their services with Intra-rack redundancy, Inter-rack redundancy, and globally load balancing across multiple instances. Also, for high availability, plan to spread your instances across multiple Azure regions.
 
 ### How do updates work to on-premises and to Azure components?
 Upgrades to Operator Nexus are made in two phases - Management bundle upgrades and Runtime bundle upgrades. Management bundle upgrades deals with the upgrades of Controllers in Azure, Cluster Managers in customer subscription and on-premises instances. In on-premises instances, it includes the Kubernetes controllers responsible for maintaining the state of infra resources. 
@@ -57,10 +57,22 @@ Yes, all you need is ExpressRoute connectivity to an Azure region. ExpressRoute
 Currently, we don't support resource moves. If you need to move resources, you can consider deleting the existing controllers and using the ARM template to create another one in another location.
 
 ### How many instances can be associated to a cluster manager/fabric controller pair? 
-The number of Azure Operator Nexus instances, a single pair of Network Fabric Controller and Cluster Manager can manage depends on multiple factors. It can be influenced by factors like size of Operator Nexus instances, ExpressRoute circuit bandwidth, number and frequency of optional metrics collection, number of workloads running in Instance, destination for workload telemetry data collection and other factors. 
+The number of Azure Operator Nexus instances, a single pair of Network Fabric Controller and Cluster Manager can manage depends on multiple factors. It can be influenced by factors like size of Operator Nexus instances, ExpressRoute circuit bandwidth, number, and frequency of optional metrics collection, number of workloads running in Instance, destination for workload telemetry data collection and other factors. 
           
 For more information, see [limits & quotas](reference-limits-and-quotas.md).
 
+### Is it viable to redeploy a cluster that is currently running? If so, what safeguards are in place to prevent accidental deployments?
+A running cluster can't undergo redeployment; instead, the user must delete and then redeploy it. If the cluster is already running, the deployment action prevents any new actions on the cluster, thereby mitigating the risk of accidental deployments.
+
+### Is it possible for the Azure Operator Nexus to enable the creation of clusters with the option to choose specific racks or subsets of instances within a rack?
+The Azure Operator Nexus doesn't offer the capability to select specific racks or subsets of instances within a rack as part of cluster deployment.
+
+### In the event of a network disruption between the cluster manager and on-premises servers during deployment, can the process be resumed once connectivity is restored?
+It depends on the cluster's status: if the cluster is failed, it needs to be deleted and redeployed. However, if the cluster is still deploying, it undergoes a reconcile process, allowing the deployment process to continue.
+
+### Does cluster deploy wait for all the computes node provisioned?
+The deployment process will continuously retry the provisioning of compute nodes until all nodes are successfully provisioned. When the cluster reaches the defined threshold, cluster status changes to running. However, the remaining nodes continue undergoing the provisioning process until they too are successfully provisioned.
+
 ## Compute
 
 ### Does Azure Operator Nexus support creation of Virtual Machines (VMs)?
diff --git a/articles/operator-nexus/concepts-cluster-deployment-overview.md b/articles/operator-nexus/concepts-cluster-deployment-overview.md
@@ -0,0 +1,47 @@
+---
+title: Azure Operator Nexus cluster deployment overview
+description: Get an overview of cluster deployment overview for Azure Operator Nexus.
+author: sbatchu
+ms.author: sbatchu
+ms.service: azure-operator-nexus
+ms.topic: conceptual
+ms.date: 06/07/2024
+ms.custom: template-concept
+---
+
+# Azure Operator Nexus cluster
+Azure Operator Nexus is built on basic constructs like compute servers, storage appliances, and network fabric devices. Azure Operator Nexus cluster represents an on-premises deployment of the platform. The lifecycle of platform-specific resources is dependent on the cluster state.
+
+## Cluster Deployment Overview
+
+During the cluster deployment, cluster undergoes various lifecycle phases, which have specific roles designated to ensure the target state is achieved.
+
+### Hardware Validation Phase:
+
+Hardware Validation is initiated during the cluster deployment process, assessing the state of hardware components for the machines provided through the Cluster's rack definition. Based on the results of these checks and any user skipped machines, a determination is done on whether sufficient nodes passed and/or are available to meet the thresholds necessary for deployment to continue.
+
+Hardware validation results for a given server are written into the Log Analytics Workspace(LAW), which is provided as part of the cluster creation. The results include the following categories:
+- system_info
+- drive_info
+- network_info
+- health_info
+- boot_info
+
+This article provides instructions on how to check hardware results information [Troubleshoot Hardware validation](troubleshoot-hardware-validation-failure.md)
+
+### Bootstrap Phase:
+
+Once the Hardware Validation is successful, bootstrap image is generated for cluster deploy action on the cluster manager. This image iso URL is used to bootstrap the ephemeral node, which would deploy the target cluster components, which are provisioning the kubernetes control plane (KCP), Nexus Management plane (NMP), and storage appliance. These various states are reflected in the cluster status, which these stages are executed as part of the ephemeral bootstrap workflow.
+
+The ephemeral bootstrap node sequentially provisions each KCP node, and if a KCP node fails to provision, the cluster deployment action fails, marking the cluster status as failed. The Bootstrap operator manages the provisioning process for bare-metal nodes using the PXE boot approach.
+
+After successful provisioning of KCP nodes, the deployment action proceeds to provision NMP nodes in parallel. If an NMP node fails to provision, the cluster deployment action fails, resulting in the cluster status being marked as failed.
+
+Upon successful provisioning of NMP nodes, a storage appliance is created before the deployment action proceeds with provisioning the compute nodes. Compute nodes are provisioned in parallel, and once the defined compute node threshold is met, the cluster status transitions from Deploying to Running. However, the remaining nodes continue undergoing the provisioning process until they too are successfully provisioned.
+
+
+## Cluster operations
+
+- **List cluster**: List cluster information in the provided resource group or subscription.
+- **Show cluster**: Get properties of the provided cluster.
+- **Update cluster**: Update properties or tags of the provided cluster.