Skip to content

Commit 3bc4f59

Browse files
author
manasareddybethi
committed
Update with clear info
1 parent c6d0921 commit 3bc4f59

File tree

1 file changed

+45
-32
lines changed

1 file changed

+45
-32
lines changed

articles/operator-nexus/howto-configure-cluster.md

Lines changed: 45 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,20 @@ Each Operator Nexus on-premises instance has a one-to-one association
4040
with a Network Fabric.
4141

4242
> [!IMPORTANT]
43-
> After creating the cluster, avoid applying patches to it until the `az networkcloud cluster show` CLI command displays redfish bmcConnectionStrings for the corresponding cluster. Patching the cluster before these strings are generated can result in version overrides and the loss of bmcConnectionStrings from the cluster's Custom Resource.
43+
> There's a known issue where updating a cluster immediately after creating it can cause cluster deployment failures. The problem happens when the resource is updated before the bmcConnectionString fields are populated in the `cluster.spec.computeRackDefinitions.bareMetalMachineConfigurationData` section. The bmcConnectionStrings are normally set within a few minutes of creating the Cluster.
44+
>
45+
> To avoid this issue, ensure that the bmcConnectionStrings contain nonempty values before updating the Cluster resource via Azure portal or the az networkcloud update command.
46+
>
47+
> To confirm the status, open the JSON properties for the Cluster (Operator Nexus) resource in Azure portal, or run an `az networkcloud cluster show` CLI command as shown in the following example. If the bmmConnectionString values show nonempty `redfish+https..` values, then it's safe to update the cluster. This issue will be fixed in a future release.
48+
>
49+
> Sample bmcConnectionString output for `az networkcloud cluster show -n cluster01 -g cluster01resourceGroup--query 'computeRackDefinitions[].bareMetalMachineConfigurationData[].bmcConnectionString' -o json` is as follows:
50+
>
51+
> ```
52+
> ["redfish+https://10.9.3.20/redfish/v1/Systems/System.Embedded.1",
53+
> "redfish+https://10.9.3.19/redfish/v1/Systems/System.Embedded.1",
54+
> "redfish+https://10.9.3.18/redfish/v1/Systems/System.Embedded.1",
55+
> "redfish+https://10.9.3.17/redfish/v1/Systems/System.Embedded.1"]
56+
> ```
4457
4558
### Create the Cluster using Azure CLI:
4659
@@ -114,11 +127,11 @@ az networkcloud cluster create --name "$CLUSTER_NAME" --location "$LOCATION" \
114127

115128
## Cluster Identity
116129

117-
Starting with the 2024-07-01 API version, a customer can assign managed identity to a Cluster. Both System-assigned and User-Assigned managed identities are supported.
130+
From the 2024-07-01 API version, a customer can assign managed identity to a Cluster. Both System-assigned and User-Assigned managed identities are supported.
118131

119132
Once added, the Identity can only be removed via the API call at this time.
120133

121-
See [Azure Operator Nexus Cluster Support for Managed Identities and User Provided Resources](./howto-cluster-managed-identity-user-provided-resources.md) for more information on managed identities for Operator Nexus Clusters.
134+
For more information on managed identities for Operator Nexus Clusters, see [Azure Operator Nexus Cluster Support for Managed Identities and User Provided Resources](./howto-cluster-managed-identity-user-provided-resources.md).
122135

123136
### Create the Cluster using Azure Resource Manager template editor
124137

@@ -131,20 +144,20 @@ You can find examples for an 8-Rack 2M16C SKU cluster using these two files:
131144
[cluster.parameters.jsonc](./cluster-parameters-jsonc-example.md)
132145

133146
> [!NOTE]
134-
> To get the correct formatting, copy the raw code file. The values within the cluster.parameters.jsonc file are customer specific and may not be a complete list. Update the value fields for your specific environment.
147+
> To get the correct formatting, copy the raw code file. The values within the cluster.parameters.jsonc file are customer specific and might not be a complete list. Update the value fields for your specific environment.
135148
136149
1. Navigate to [Azure portal](https://portal.azure.com/) in a web browser and sign in.
137150
1. Search for 'Deploy a custom template' in the Azure portal search bar, and then select it from the available services.
138-
1. Click on Build your own template in the editor.
139-
1. Click on Load file. Locate your cluster.jsonc template file and upload it.
140-
1. Click Save.
141-
1. Click Edit parameters.
142-
1. Click Load file. Locate your cluster.parameters.jsonc parameters file and upload it.
143-
1. Click Save.
151+
1. Select Build your own template in the editor.
152+
1. Select Load file. Locate your cluster.jsonc template file and upload it.
153+
1. Select Save.
154+
1. Select Edit parameters.
155+
1. Select Load file. Locate your cluster.parameters.jsonc parameters file and upload it.
156+
1. Select Save.
144157
1. Select the correct Subscription.
145158
1. Search for the Resource group to see if it already exists. If not, create a new Resource group.
146159
1. Make sure all Instance Details are correct.
147-
1. Click Review + create.
160+
1. Select Review + create.
148161

149162
### Cluster validation
150163

@@ -172,11 +185,11 @@ Cluster create Logs can be viewed in the following locations:
172185

173186
### Set deployment thresholds
174187

175-
There are two types of deployment thresholds that can be set on a cluster prior to cluster deployment. They are `compute-deployment-threshold` and `update-strategy`.
188+
There are two types of deployment thresholds that can be set on a cluster before cluster deployment: `compute-deployment-threshold` and `update-strategy`.
176189

177190
**--compute-deployment-threshold - The validation threshold indicating the allowable failures of compute nodes during environment hardware validation.**
178191

179-
If `compute-deployment-threshold` is not set, the defaults are as follows:
192+
If `compute-deployment-threshold` isn't set, the defaults are as follows:
180193

181194
```
182195
"strategyType": "Rack",
@@ -185,9 +198,9 @@ If `compute-deployment-threshold` is not set, the defaults are as follows:
185198
"waitTimeMinutes": 1
186199
```
187200

188-
If the customer requests a `compute-deployment-threshold` that it is different from the default of 80%, you can run the following cluster update command.
201+
If the customer requests a `compute-deployment-threshold` that is different from the default of 80%, you can run the following cluster update command.
189202

190-
The example below is for a customer requesting type "PercentSuccess" with a success rate of 97%.
203+
This example is for a customer requesting type "PercentSuccess" with a success rate of 97%.
191204

192205
```azurecli
193206
az networkcloud cluster update --name "<clusterName>" /
@@ -209,11 +222,11 @@ az networkcloud cluster show --resource-group "<resourceGroup>" --name "<cluster
209222
"value": 97
210223
```
211224

212-
In this example, if less than 97% of the compute nodes being deployed pass hardware validation, the cluster deployment will fail. **NOTE: All kubernetes control plane (KCP) and nexus management plane (NMP) must pass hardware validation.** If 97% or more of the compute nodes being deployed pass hardware validation, the cluster deployment will continue to the bootstrap provisioning phase. During compute bootstrap provisioning, the `update-strategy` (below) is used for compute nodes.
225+
In this example, if less than 97% of the compute nodes being deployed pass hardware validation, the cluster deployment fails. **NOTE: All kubernetes control plane (KCP) and nexus management plane (NMP) must pass hardware validation.** If 97% or more of the compute nodes being deployed pass hardware validation, the cluster deployment continues to the bootstrap provisioning phase. During compute bootstrap provisioning, the `update-strategy` is used for compute nodes.
213226

214227
**--update-strategy - The strategy for updating the cluster indicating the allowable compute node failures during bootstrap provisioning.**
215228

216-
If the customer requests an `update-strategy` threshold that it is different from the default of 80%, you can run the following cluster update command.
229+
If the customer requests an `update-strategy` threshold that is different from the default of 80%, you can run the following cluster update command.
217230

218231
```azurecli
219232
az networkcloud cluster update --name "<clusterName>" /
@@ -225,9 +238,9 @@ threshold-value="<thresholdValue>" wait-time-minutes=<waitTimeBetweenRacks> /
225238

226239
The strategy-type can be "Rack" (Rack by Rack) OR "PauseAfterRack" (Wait for customer response to continue).
227240

228-
The threshold-type can be "PercentSuccess" OR "CountSuccess".
241+
The threshold-type can be "PercentSuccess" OR "CountSuccess"
229242

230-
If updateStrategy is not set, the defaults are as follows:
243+
If updateStrategy isn't set, the defaults are as follows:
231244

232245
```
233246
"strategyType": "Rack",
@@ -236,7 +249,7 @@ If updateStrategy is not set, the defaults are as follows:
236249
"waitTimeMinutes": 1
237250
```
238251

239-
The example below is for a customer using Rack by Rack strategy with a Percent Success of 60% and a 1 minute pause.
252+
This example is for a customer using Rack by Rack strategy with a Percent Success of 60% and a 1-minute pause.
240253

241254
```azurecli
242255
az networkcloud cluster update --name "<clusterName>" /
@@ -259,9 +272,9 @@ az networkcloud cluster show --resource-group "<resourceGroup>" /
259272
"waitTimeMinutes": 1
260273
```
261274

262-
In this example, if less than 60% of the compute nodes being provisioned in a rack fail to provision (on a rack by rack basis), the cluster deployment will fail. If 60% or more of the compute nodes are successfully provisioned, cluster deployment moves on to the next rack of compute nodes.
275+
In this example, if less than 60% of the compute nodes being provisioned in a rack fail to provision (on a rack by rack basis), the cluster deployment fails. If 60% or more of the compute nodes are successfully provisioned, cluster deployment moves on to the next rack of compute nodes.
263276

264-
The example below is for a customer using Rack by Rack strategy with a threshold type CountSuccess of 10 nodes per rack and a 1 minute pause.
277+
This example is for a customer using Rack by Rack strategy with a threshold type CountSuccess of 10 nodes per rack and a 1-minute pause.
265278

266279
```azurecli
267280
az networkcloud cluster update --name "<clusterName>" /
@@ -284,10 +297,10 @@ az networkcloud cluster show --resource-group "<resourceGroup>" /
284297
"waitTimeMinutes": 1
285298
```
286299

287-
In this example, if less than 10 compute nodes being provisioned in a rack fail to provision (on a rack by rack basis), the cluster deployment will fail. If 10 or more of the compute nodes are successfully provisioned, cluster deployment moves on to the next rack of compute nodes.
300+
In this example, if less than 10 compute nodes being provisioned in a rack fail to provision (on a rack by rack basis), the cluster deployment fails. If 10 or more of the compute nodes are successfully provisioned, cluster deployment moves on to the next rack of compute nodes.
288301

289302
> [!NOTE]
290-
> Deployment thresholds cannot be changed after cluster deployment has started.
303+
> Deployment thresholds can't be changed after cluster deployment starts.
291304
292305
## Deploy Cluster
293306

@@ -315,7 +328,7 @@ az networkcloud cluster deploy \
315328

316329
> [!TIP]
317330
> To check the status of the `az networkcloud cluster deploy` command, it can be executed using the `--debug` flag.
318-
> This will allow you to obtain the `Azure-AsyncOperation` or `Location` header used to query the `operationStatuses` resource.
331+
> This command allows you to obtain the `Azure-AsyncOperation` or `Location` header used to query the `operationStatuses` resource.
319332
> See the section [Cluster Deploy Failed](#cluster-deploy-failed) for more detailed steps.
320333
> Optionally, the command can run asynchronously using the `--no-wait` flag.
321334
@@ -328,12 +341,12 @@ and any user skipped machines, a determination is done on whether sufficient nod
328341
passed and/or are available to meet the thresholds necessary for deployment to continue.
329342

330343
> [!IMPORTANT]
331-
> The hardware validation process will write the results to the specified `analyticsWorkspaceId` at Cluster Creation.
344+
> The hardware validation process writes the results to the specified `analyticsWorkspaceId` at Cluster Creation.
332345
> Additionally, the provided Service Principal in the Cluster object is used for authentication against the Log Analytics Workspace Data Collection API.
333-
> This capability is only visible during a new deployment (Green Field); existing cluster will not have the logs available retroactively.
346+
> This capability is only visible during a new deployment (Green Field); existing cluster doesn't have the logs available retroactively.
334347
335348
> [!NOTE]
336-
> The RAID controller is reset during Cluster deployment wiping all data from the server's virtual disks. Any Baseboard Management Controller (BMC) virtual disk alerts can typically be ignored unless there are additional physical disk and/or RAID controllers alerts.
349+
> The RAID controller is reset during Cluster deployment wiping all data from the server's virtual disks. Any Baseboard Management Controller (BMC) virtual disk alerts can typically be ignored unless there are more physical disk and/or RAID controllers alerts.
337350
338351
By default, the hardware validation process writes the results to the configured Cluster `analyticsWorkspaceId`.
339352
However, due to the nature of Log Analytics Workspace data collection and schema evaluation, there can be ingestion delay that can take several minutes or more.
@@ -391,8 +404,8 @@ metal machines that failed the hardware validation (for example, `COMP0_SVR0_SER
391404
}
392405
```
393406

394-
See the article [Tracking Asynchronous Operations Using Azure CLI](./howto-track-async-operations-cli.md) for another example.
395-
See the article [Troubleshoot BMM provisioning](./troubleshoot-bare-metal-machine-provisioning.md) for more information that may be helpful when specific machines fail validation or deployment.
407+
For another example, see the article [Tracking Asynchronous Operations Using Azure CLI](./howto-track-async-operations-cli.md).
408+
For more information that might be helpful when specific machines fail validation or deployment, see [Troubleshoot Bare Metal Machine(BMM) provisioning](./troubleshoot-bare-metal-machine-provisioning.md).
396409

397410
## Cluster deployment validation
398411

@@ -434,7 +447,7 @@ Cluster create Logs can be viewed in the following locations:
434447
Deleting a cluster deletes the resources in Azure and the cluster that resides in the on-premises environment.
435448

436449
> [!NOTE]
437-
> If there are any tenant resources that exist in the cluster, it will not be deleted until those resources are deleted.
450+
> If there are any tenant resources that exist in the cluster, it doesn't get deleted until those resources are deleted.
438451
439452
:::image type="content" source="./media/nexus-delete-failure.png" lightbox="./media/nexus-delete-failure.png" alt-text="Screenshot of the portal showing the failure to delete because of tenant resources.":::
440453

@@ -443,4 +456,4 @@ az networkcloud cluster delete --name "$CLUSTER_NAME" --resource-group "$CLUSTER
443456
```
444457

445458
> [!NOTE]
446-
> It is recommended to wait for 20 minutes after deleting cluster before trying to create a new cluster with the same name.
459+
> The recommendation is to wait for 20 minutes after deleting cluster before trying to create a new cluster with the same name.

0 commit comments

Comments
 (0)