You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/operator-nexus/howto-configure-cluster.md
+46-33Lines changed: 46 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,8 +39,21 @@ You should create the Network Fabric before this on-premises deployment.
39
39
Each Operator Nexus on-premises instance has a one-to-one association
40
40
with a Network Fabric.
41
41
42
-
> [!NOTE]
43
-
> After creating the cluster, avoid applying patches to it until the `az networkcloud cluster show` CLI command displays redfish bmcConnectionStrings for the corresponding cluster.
42
+
> [!IMPORTANT]
43
+
> There's a known issue where updating a cluster immediately after creating it can cause cluster deployment failures. The problem happens when the resource is updated before the bmcConnectionString fields are populated in the `cluster.spec.computeRackDefinitions.bareMetalMachineConfigurationData` section. The bmcConnectionStrings are normally set within a few minutes of creating the Cluster.
44
+
>
45
+
> To avoid this issue, ensure that the bmcConnectionStrings contain nonempty values before updating the Cluster resource via Azure portal or the az networkcloud update command.
46
+
>
47
+
> To confirm the status, open the JSON properties for the Cluster (Operator Nexus) resource in Azure portal, or run an `az networkcloud cluster show` CLI command as shown in the following example. If the bmmConnectionString values show nonempty `redfish+https..` values, then it's safe to update the cluster. This issue will be fixed in a future release.
48
+
>
49
+
> Sample bmcConnectionString output for `az networkcloud cluster show -n cluster01 -g cluster01resourceGroup--query 'computeRackDefinitions[].bareMetalMachineConfigurationData[].bmcConnectionString' -o json` is as follows:
Starting with the 2024-07-01 API version, a customer can assign managed identity to a Cluster. Both System-assigned and User-Assigned managed identities are supported.
130
+
From the 2024-07-01 API version, a customer can assign managed identity to a Cluster. Both System-assigned and User-Assigned managed identities are supported.
118
131
119
132
Once added, the Identity can only be removed via the API call at this time.
120
133
121
-
See [Azure Operator Nexus Cluster Support for Managed Identities and User Provided Resources](./howto-cluster-managed-identity-user-provided-resources.md) for more information on managed identities for Operator Nexus Clusters.
134
+
For more information on managed identities for Operator Nexus Clusters, see [Azure Operator Nexus Cluster Support for Managed Identities and User Provided Resources](./howto-cluster-managed-identity-user-provided-resources.md).
122
135
123
136
### Create the Cluster using Azure Resource Manager template editor
124
137
@@ -131,20 +144,20 @@ You can find examples for an 8-Rack 2M16C SKU cluster using these two files:
> To get the correct formatting, copy the raw code file. The values within the cluster.parameters.jsonc file are customer specific and may not be a complete list. Update the value fields for your specific environment.
147
+
> To get the correct formatting, copy the raw code file. The values within the cluster.parameters.jsonc file are customer specific and might not be a complete list. Update the value fields for your specific environment.
135
148
136
149
1. Navigate to [Azure portal](https://portal.azure.com/) in a web browser and sign in.
137
150
1. Search for 'Deploy a custom template' in the Azure portal search bar, and then select it from the available services.
138
-
1.Click on Build your own template in the editor.
139
-
1.Click on Load file. Locate your cluster.jsonc template file and upload it.
140
-
1.Click Save.
141
-
1.Click Edit parameters.
142
-
1.Click Load file. Locate your cluster.parameters.jsonc parameters file and upload it.
143
-
1.Click Save.
151
+
1.Select Build your own template in the editor.
152
+
1.Select Load file. Locate your cluster.jsonc template file and upload it.
153
+
1.Select Save.
154
+
1.Select Edit parameters.
155
+
1.Select Load file. Locate your cluster.parameters.jsonc parameters file and upload it.
156
+
1.Select Save.
144
157
1. Select the correct Subscription.
145
158
1. Search for the Resource group to see if it already exists. If not, create a new Resource group.
146
159
1. Make sure all Instance Details are correct.
147
-
1.Click Review + create.
160
+
1.Select Review + create.
148
161
149
162
### Cluster validation
150
163
@@ -172,11 +185,11 @@ Cluster create Logs can be viewed in the following locations:
172
185
173
186
### Set deployment thresholds
174
187
175
-
There are two types of deployment thresholds that can be set on a cluster prior to cluster deployment. They are`compute-deployment-threshold` and `update-strategy`.
188
+
There are two types of deployment thresholds that can be set on a cluster before cluster deployment:`compute-deployment-threshold` and `update-strategy`.
176
189
177
190
**--compute-deployment-threshold - The validation threshold indicating the allowable failures of compute nodes during environment hardware validation.**
178
191
179
-
If `compute-deployment-threshold`is not set, the defaults are as follows:
192
+
If `compute-deployment-threshold`isn't set, the defaults are as follows:
180
193
181
194
```
182
195
"strategyType": "Rack",
@@ -185,9 +198,9 @@ If `compute-deployment-threshold` is not set, the defaults are as follows:
185
198
"waitTimeMinutes": 1
186
199
```
187
200
188
-
If the customer requests a `compute-deployment-threshold` that it is different from the default of 80%, you can run the following cluster update command.
201
+
If the customer requests a `compute-deployment-threshold` that is different from the default of 80%, you can run the following cluster update command.
189
202
190
-
The example below is for a customer requesting type "PercentSuccess" with a success rate of 97%.
203
+
This example is for a customer requesting type "PercentSuccess" with a success rate of 97%.
191
204
192
205
```azurecli
193
206
az networkcloud cluster update --name "<clusterName>" /
@@ -209,11 +222,11 @@ az networkcloud cluster show --resource-group "<resourceGroup>" --name "<cluster
209
222
"value": 97
210
223
```
211
224
212
-
In this example, if less than 97% of the compute nodes being deployed pass hardware validation, the cluster deployment will fail. **NOTE: All kubernetes control plane (KCP) and nexus management plane (NMP) must pass hardware validation.** If 97% or more of the compute nodes being deployed pass hardware validation, the cluster deployment will continue to the bootstrap provisioning phase. During compute bootstrap provisioning, the `update-strategy` (below) is used for compute nodes.
225
+
In this example, if less than 97% of the compute nodes being deployed pass hardware validation, the cluster deployment fails. **NOTE: All kubernetes control plane (KCP) and nexus management plane (NMP) must pass hardware validation.** If 97% or more of the compute nodes being deployed pass hardware validation, the cluster deployment continues to the bootstrap provisioning phase. During compute bootstrap provisioning, the `update-strategy` is used for compute nodes.
213
226
214
227
**--update-strategy - The strategy for updating the cluster indicating the allowable compute node failures during bootstrap provisioning.**
215
228
216
-
If the customer requests an `update-strategy` threshold that it is different from the default of 80%, you can run the following cluster update command.
229
+
If the customer requests an `update-strategy` threshold that is different from the default of 80%, you can run the following cluster update command.
217
230
218
231
```azurecli
219
232
az networkcloud cluster update --name "<clusterName>" /
The strategy-type can be "Rack" (Rack by Rack) OR "PauseAfterRack" (Wait for customer response to continue).
227
240
228
-
The threshold-type can be "PercentSuccess" OR "CountSuccess".
241
+
The threshold-type can be "PercentSuccess" OR "CountSuccess"
229
242
230
-
If updateStrategy is not set, the defaults are as follows:
243
+
If updateStrategy isn't set, the defaults are as follows:
231
244
232
245
```
233
246
"strategyType": "Rack",
@@ -236,7 +249,7 @@ If updateStrategy is not set, the defaults are as follows:
236
249
"waitTimeMinutes": 1
237
250
```
238
251
239
-
The example below is for a customer using Rack by Rack strategy with a Percent Success of 60% and a 1minute pause.
252
+
This example is for a customer using Rack by Rack strategy with a Percent Success of 60% and a 1-minute pause.
240
253
241
254
```azurecli
242
255
az networkcloud cluster update --name "<clusterName>" /
@@ -259,9 +272,9 @@ az networkcloud cluster show --resource-group "<resourceGroup>" /
259
272
"waitTimeMinutes": 1
260
273
```
261
274
262
-
In this example, if less than 60% of the compute nodes being provisioned in a rack fail to provision (on a rack by rack basis), the cluster deployment will fail. If 60% or more of the compute nodes are successfully provisioned, cluster deployment moves on to the next rack of compute nodes.
275
+
In this example, if less than 60% of the compute nodes being provisioned in a rack fail to provision (on a rack by rack basis), the cluster deployment fails. If 60% or more of the compute nodes are successfully provisioned, cluster deployment moves on to the next rack of compute nodes.
263
276
264
-
The example below is for a customer using Rack by Rack strategy with a threshold type CountSuccess of 10 nodes per rack and a 1minute pause.
277
+
This example is for a customer using Rack by Rack strategy with a threshold type CountSuccess of 10 nodes per rack and a 1-minute pause.
265
278
266
279
```azurecli
267
280
az networkcloud cluster update --name "<clusterName>" /
@@ -284,10 +297,10 @@ az networkcloud cluster show --resource-group "<resourceGroup>" /
284
297
"waitTimeMinutes": 1
285
298
```
286
299
287
-
In this example, if less than 10 compute nodes being provisioned in a rack fail to provision (on a rack by rack basis), the cluster deployment will fail. If 10 or more of the compute nodes are successfully provisioned, cluster deployment moves on to the next rack of compute nodes.
300
+
In this example, if less than 10 compute nodes being provisioned in a rack fail to provision (on a rack by rack basis), the cluster deployment fails. If 10 or more of the compute nodes are successfully provisioned, cluster deployment moves on to the next rack of compute nodes.
288
301
289
302
> [!NOTE]
290
-
> Deployment thresholds cannot be changed after cluster deployment has started.
303
+
> Deployment thresholds can't be changed after cluster deployment starts.
291
304
292
305
## Deploy Cluster
293
306
@@ -315,7 +328,7 @@ az networkcloud cluster deploy \
315
328
316
329
> [!TIP]
317
330
> To check the status of the `az networkcloud cluster deploy` command, it can be executed using the `--debug` flag.
318
-
> This will allow you to obtain the `Azure-AsyncOperation` or `Location` header used to query the `operationStatuses` resource.
331
+
> This command allows you to obtain the `Azure-AsyncOperation` or `Location` header used to query the `operationStatuses` resource.
319
332
> See the section [Cluster Deploy Failed](#cluster-deploy-failed) for more detailed steps.
320
333
> Optionally, the command can run asynchronously using the `--no-wait` flag.
321
334
@@ -328,12 +341,12 @@ and any user skipped machines, a determination is done on whether sufficient nod
328
341
passed and/or are available to meet the thresholds necessary for deployment to continue.
329
342
330
343
> [!IMPORTANT]
331
-
> The hardware validation process will write the results to the specified `analyticsWorkspaceId` at Cluster Creation.
344
+
> The hardware validation process writes the results to the specified `analyticsWorkspaceId` at Cluster Creation.
332
345
> Additionally, the provided Service Principal in the Cluster object is used for authentication against the Log Analytics Workspace Data Collection API.
333
-
> This capability is only visible during a new deployment (Green Field); existing cluster will not have the logs available retroactively.
346
+
> This capability is only visible during a new deployment (Green Field); existing cluster doesn't have the logs available retroactively.
334
347
335
348
> [!NOTE]
336
-
> The RAID controller is reset during Cluster deployment wiping all data from the server's virtual disks. Any Baseboard Management Controller (BMC) virtual disk alerts can typically be ignored unless there are additional physical disk and/or RAID controllers alerts.
349
+
> The RAID controller is reset during Cluster deployment wiping all data from the server's virtual disks. Any Baseboard Management Controller (BMC) virtual disk alerts can typically be ignored unless there are more physical disk and/or RAID controllers alerts.
337
350
338
351
By default, the hardware validation process writes the results to the configured Cluster `analyticsWorkspaceId`.
339
352
However, due to the nature of Log Analytics Workspace data collection and schema evaluation, there can be ingestion delay that can take several minutes or more.
@@ -391,8 +404,8 @@ metal machines that failed the hardware validation (for example, `COMP0_SVR0_SER
391
404
}
392
405
```
393
406
394
-
See the article [Tracking Asynchronous Operations Using Azure CLI](./howto-track-async-operations-cli.md) for another example.
395
-
See the article [Troubleshoot BMM provisioning](./troubleshoot-bare-metal-machine-provisioning.md) for more information that may be helpful when specific machines fail validation or deployment.
407
+
For another example, see the article [Tracking Asynchronous Operations Using Azure CLI](./howto-track-async-operations-cli.md).
408
+
For more information that might be helpful when specific machines fail validation or deployment, see [Troubleshoot Bare Metal Machine(BMM) provisioning](./troubleshoot-bare-metal-machine-provisioning.md).
396
409
397
410
## Cluster deployment validation
398
411
@@ -434,7 +447,7 @@ Cluster create Logs can be viewed in the following locations:
434
447
Deleting a cluster deletes the resources in Azure and the cluster that resides in the on-premises environment.
435
448
436
449
> [!NOTE]
437
-
> If there are any tenant resources that exist in the cluster, it will not be deleted until those resources are deleted.
450
+
> If there are any tenant resources that exist in the cluster, it doesn't get deleted until those resources are deleted.
438
451
439
452
:::image type="content" source="./media/nexus-delete-failure.png" lightbox="./media/nexus-delete-failure.png" alt-text="Screenshot of the portal showing the failure to delete because of tenant resources.":::
0 commit comments