Skip to content

Commit e368f0e

Browse files
author
Manika Dhiman
committed
Merge branch 'main' of https://github.com/MicrosoftDocs/azure-stack-docs-pr into md-known-issue
2 parents 1e7ce96 + 9896d89 commit e368f0e

File tree

3 files changed

+248
-22
lines changed

3 files changed

+248
-22
lines changed
Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
---
2+
title: Troubleshoot deployment validation issues in Azure Stack HCI, version 23H2 via Azure portal
3+
description: Learn how to troubleshoot the deployment validation failures for Azure Stack HCI, version 23H2 when deployed via the Azure portal.
4+
ms.topic: how-to
5+
ms.author: alkohli
6+
author: alkohli
7+
ms.date: 08/21/2024
8+
---
9+
10+
11+
# Troubleshoot Azure portal deployment validation issues for Azure Stack HCI, version 23H2
12+
13+
> Applies to: Azure Stack HCI, version 23H2 running 2405 or later
14+
15+
This article provides guidance on how to troubleshoot deployment validation issues experienced during the deployment of your Azure Stack HCI cluster via the Azure portal.
16+
17+
## Error - deployment validation failure
18+
19+
When deploying Azure Stack HCI, version 23H2 via the Azure portal, you might encounter a deployment validation failure.
20+
The "Azure Stack HCI Network - Check network requirements" validation task fail with the following error:
21+
22+
```
23+
Could not complete the operation. 400: Resource creation validation failed. Details:
24+
[{"Code":"AnswerFileValidationFailed","Message":"Errors in Value Validation:\r\nPhysicalNodesValidator
25+
found error at deploymentdata.physicalnodes[0].ipv4address: The specified for
26+
\u0027deploymentdata.physicalnodes[0].ipv4address\u0027 is not a valid IPv4 address.
27+
Example: 192.168.0.1 or 192.168.0.1","Target":null,"Details":null}].
28+
```
29+
30+
If you go to the **Networking** tab in Azure portal deployment, within the **Network Intent** configuration, you could see the following error:
31+
32+
```
33+
The selected physical network adapter is not binded to the management virtual switch.
34+
```
35+
36+
## Cause
37+
38+
This issue occurs on deployments triggered after August 6. The issue happens if the deployment validation was triggered on the cluster and the validation result was a failure, with subsequent validation retries.
39+
40+
The issue occurs for the following reason:
41+
42+
- Validation on the device creates a VM switch for network related tests and is deleted at the end of tests.
43+
- `DeviceManagementExtension` extension isn't detecting the deletion of the VM switch.
44+
45+
## Recommended resolution
46+
47+
The multi-step resolution process includes the following steps:
48+
49+
- [Remove the lock from the seed node](#remove-the-lock-from-the-seed-node)
50+
- [Remove the validation error](#remove-the-validation-error)
51+
- [Clean up the Edge Device Azure Resource with incorrect VM switch information](#clean-up-the-edge-device-azure-resource-with-incorrect-vm-switch-information)
52+
- [Refresh the cloud data](#refresh-the-cloud-edgedevices-data)
53+
- [Restart the deployment via Azure portal](#restart-the-deployment-via-azure-portal)
54+
- [Recreate the lock on the seed node resource](#recreate-the-lock-on-the-seed-node-resource)
55+
56+
> [!NOTE]
57+
> All the steps in this article need to be performed on the seed node.
58+
59+
### Remove the lock from the seed node
60+
61+
Follow these steps to remove the lock from the seed node:
62+
63+
1. To remove the lock, in the Azure portal, go to the object via the resource group or within Machines - Azure Arc.
64+
1. In the left-pane, go to **Settings > Locks**. You should see a lock named **DoNotDelete**. This is the automatic resource lock that is created when the node is onboarded.
65+
1. Select **Delete** against the lock.
66+
67+
If you attempt the steps in the next section without removing the lock, the **Delete** command fails with the following error:
68+
69+
```
70+
Some resources failed to be deleted (run with `--verbose` for more information):
71+
/subscriptions/<subid>/resourceGroups/<rgname>/providers/Microsoft.HybridCompute/machines/<Machine Name>/providers/Microsoft.AzureStackHCI/edgeDevices/default
72+
```
73+
74+
Here's the example output when run with the `--verbose` switch:
75+
76+
```Output
77+
(ScopeLocked) The scope '/subscriptions/<Subscription ID>/resourceGroups/<Resource Group Name>/providers/Microsoft.HybridCompute/machines/<Machine Name>/providers/Microsoft.AzureStackHCI/edgeDevices/default' cannot perform delete operation because following scope(s) are locked: '/subscriptions/<subid>/resourceGroups/<rgname>/providers/Microsoft.HybridCompute/machines/<Machine Name>'. Please remove the lock and try again.
78+
Code: ScopeLocked
79+
Message: The scope '/subscriptions/<subid>/resourceGroups/<rgname>/providers/Microsoft.HybridCompute/machines/<Machine Name>/providers/Microsoft.AzureStackHCI/edgeDevices/default' cannot perform delete operation because following scope(s) are locked: '/subscriptions/<subid>/resourceGroups/<rgname>/providers/Microsoft.HybridCompute/machines/<Machine Name>'. Please remove the lock and try again.
80+
```
81+
82+
### Remove the validation error
83+
84+
With the lock removed, follow these steps to remove the validation error.
85+
86+
1. Connect to the seed node. Run the following PowerShell command:
87+
88+
```PowerShell
89+
Get-VMSwitch
90+
```
91+
92+
1. Check the output of the `Get-VMSwitch` command for any unexpected VM switches, for example, the switch that gets created during the Network Validation step and has a name similar to: `"ConvergedSwitch(compute_management)"`. The exact name of the switch depends on the chosen network intent configuration.
93+
94+
1. If a VM switch that you didn't intentionally create exists, remove the switch. Run the following PowerShell command:
95+
96+
```PowerShell
97+
Remove-VMSwitch -Name "<VM Switch Name>" -Force
98+
```
99+
100+
Make sure to use the VM switch name from the `Get-VMSwitch` command. If you didn't intentionally create a VM switch, the `Get-VMSwitch` command has no results. The failure occurs because the Network Validation Step cleaned up the VM switch, but the `DeviceManagementExtension` didn't detect the cleanup.
101+
102+
Continue with the cleanup steps.
103+
104+
### Clean up the Edge Device Azure Resource with incorrect VM switch information
105+
106+
After the VM switch on the device is removed, clean up the Edge Device ARM resource containing the incorrect VM switch information via the Azure CLI.
107+
108+
1. On a client that can access to Azure, verify install or install AZ CLI: [Install Azure CLI on Windows](/cli/azure/install-azure-cli-windows?tabs=azure-cli)
109+
- You can verify install by running: `az`
110+
- If installed, this outputs a `"Welcome to Azure CLI!"` message with available commands.
111+
112+
1. Sign in to Azure with Azure CLI. Run the following command:
113+
114+
```AzureCLI
115+
az login --tenant <tenant ID> --use-device-code
116+
```
117+
118+
For more information, [Sign in interactively with Azure CLI](/cli/azure/authenticate-azure-cli-interactively)
119+
120+
1. To set a specific subscription, run the following command:
121+
122+
```AzureCLI
123+
az account set --subscription "<Subscription ID>"
124+
```
125+
126+
Replace the value in the above example command with the appropriate value for `<Subscription ID>`.
127+
128+
1. Output the data stored within the `edgeDevices` resource that has the incorrectly stored VM Switch information. Run the following command:
129+
130+
```AzureCLI
131+
az resource show --ids "/subscriptions/<Subscription ID>/resourceGroups/<Resource Group Name>/providers/Microsoft.HybridCompute/machines/<Machine Name>/providers/Microsoft.AzureStackHCI/edgeDevices/default"
132+
```
133+
134+
Replace the values in the above example command with the appropriate values for:`<Subscription ID>`, `<Resource Group Name>`, and `<Machine Name>`.
135+
136+
Here's an example output:
137+
138+
```output
139+
az resource show --ids "/subscriptions/<Subscription ID>/resourceGroups/<Resource Group Name>/providers/Microsoft.HybridCompute/machines/ASRR1N26R15U33/providers/Microsoft.AzureStackHCI/edgeDevices/default"
140+
```
141+
142+
The output of this command shows quite a bit of detail about the \<Machine Name\> used in the command. Near the bottom of the output, there's a section for `"switchDetails"`, which will more than likely show the following (which is the Validation VM Switch that was created and cleaned up on the device, but wasn't detected by the DeviceManagementExtension and updated cloud-side):
143+
`"switchName": "ConvergedSwitch(managementcompute)",`
144+
`"switchType": "External"`
145+
146+
1. After confirming the `show` command worked by outputting the `edgeDevices` data, and likely confirming the `"switchDetails"`, it is time to `delete` the resource from ARM so it can be refreshed appropriately from the seed node.
147+
148+
> [!NOTE]
149+
> Deleting the `edgeDevices` data is a safe action to perform, but it should only be performed when explicitly stated. Do not perform this action unless advised to do so.
150+
151+
1. Delete the `edgeDevices` resource, which has the incorrectly stored VM switch information. Run the following command:
152+
153+
```AzureCLI
154+
az resource delete --ids "/subscriptions/<Subscription ID>/resourceGroups/<Resource Group Name>/providers/Microsoft.HybridCompute/machines/<Machine Name>/providers/Microsoft.AzureStackHCI/edgeDevices/default"
155+
```
156+
157+
Replace the values (remember to remove the \<\> characters as well) with the appropriate values for:
158+
`<subGUID>`
159+
`<resourceGROUPNAME>`
160+
`<Machine Name>`
161+
162+
This is the same resource `--ids` from the `show`, so you can just use that same string. In fact, you could just "up arrow" in the console and replace `show` with `delete`.
163+
164+
Here's an example output:
165+
166+
```Output
167+
`az resource delete --ids "/subscriptions/<Subscription ID>/resourceGroups/<Resource Group Name>/providers/Microsoft.HybridCompute/machines/<Machine Name>/providers/Microsoft.AzureStackHCI/edgeDevices/default"
168+
```
169+
When run, there's no output from this command. The command works and returns the command prompt, or presents an error. It shouldn't present an error, but if it does, that will require more troubleshooting.
170+
171+
1. Verify the deletion of the resource by running the `show` command again. Here's an example output:
172+
173+
```Output
174+
(ResourceNotFound) The resource 'Microsoft.HybridCompute/machines/<Machine Name>/providers/Microsoft.AzureStackHCI/edgeDevices/default' could not be found.
175+
Code: ResourceNotFound
176+
Message: The resource 'Microsoft.HybridCompute/machines/<Machine Name>/providers/Microsoft.AzureStackHCI/edgeDevices/default' could not be found.
177+
```
178+
179+
### Refresh the cloud `edgeDevices` data
180+
181+
With the ARM resource and all the unintentional VM switches removed, refresh the cloud-side `edgeDevices` data again.
182+
183+
Follow these steps to refresh the cloud data:
184+
185+
1. Restart the `DeviceManagementService` on the seed node. Run the following PowerShell command:
186+
187+
```PowerShell
188+
Restart-Service DeviceManagementService
189+
```
190+
191+
1. Wait a few minutes and then verify that the cloud `edgeDevices` data is updated and reflects the current state. Run the `show` command again and review the output. Make sure that the output no longer contains any unexpected VM switches, namely:
192+
193+
`"switchName": "ConvergedSwitch(managementcompute)",`
194+
`"switchType": "External"`
195+
196+
### Restart the deployment via Azure portal
197+
198+
With device and cloud data now back in sync, you can go to the Azure portal and provide the deployment inputs. The previous step prevents any cached information from previous attempts.
199+
200+
Follow these steps in the Azure portal:
201+
202+
1. On the **Basics** tab, provide your inputs (by selecting from the dropdowns once again) to the fields from the top.
203+
204+
1. Uncheck the nodes at the bottom of the page.
205+
206+
1. Revalidate the reselected nodes.
207+
208+
1. Confirm the information on the subsequent pages. You should see the following changes:
209+
- On the **Networking** page, you should no longer see the `The selected physical network adapter is not binded to the management virtual Switch` error that might have been seen previously.
210+
- On the **Validation** page at the end, if you're past the original issue, the `deploymentdata.physicalnodes[0].ipv4address is not a valid IPv4 address` error won't be displayed.
211+
212+
1. If no other validation issues occur, start the deployment.
213+
214+
### Recreate the lock on the seed node resource
215+
216+
After the mitigation is complete, we strongly recommend that you recreate the lock on the resource.
217+
218+
Follow these steps to recreate the lock:
219+
220+
1. In the Azure portal, go to the object via the resource group or within **Machines - Azure Arc**.
221+
1. Go to **Settings > Locks**.
222+
1. Select **+ Add** at the top of the page.
223+
1. For **Lock name**, enter **DoNotDelete**.
224+
1. For **Lock type**, select **Delete** from the dropdown.
225+
1. Select **OK** to save the lock.

azure-stack/hci/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -481,6 +481,8 @@ items:
481481
items:
482482
- name: Collect logs
483483
href: manage/collect-logs.md
484+
- name: Troubleshoot deployment validation issues
485+
href: manage/troubleshoot-deployment.md
484486
- name: Get support for deployment issues
485487
href: manage/get-support-for-deployment-issues.md
486488
- name: Get support for Azure Stack HCI

azure-stack/hci/upgrade/install-enable-network-atc.md

Lines changed: 21 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ Before you install and enable Network ATC on your existing Azure Stack HCI, make
4141
## Steps to install and enable Network ATC
4242

4343
> [!IMPORTANT]
44-
> If you don't have running workloads on your nodes, just add your intent command as if this was a new cluster. You don't need to continue with the next set of instructions.
44+
> If you don't have running workloads on your nodes, execute [Step 4: Remove the existing configuration on the paused node without running VMs](#step-4-remove-the-existing-configuration-on-the-paused-node-without-running-vms) to remove any previous configurations that could conflict with Network ATC, then add your intent(s) following the standard procedures found in [Deploy host networking with Network ATC](../deploy/network-atc.md)
4545
4646
### Step 1: Install Network ATC
4747

@@ -51,21 +51,21 @@ In this step, you install Network ATC on every node in the cluster using the fol
5151
Install-WindowsFeature -Name NetworkATC
5252
```
5353

54-
### Step 2: Pause one node in the cluster
54+
### Step 2: Stop the Network ATC service
5555

56-
When you pause one node in the cluster, all workloads are moved to other nodes, making your machine available for changes. The paused node is then migrated to Network ATC. To pause your cluster node, use the following command:
56+
To prevent Network ATC from applying the intent while VMs are running, stop or disable the Network ATC service on all nodes that aren't paused. Use these commands:
5757

5858
```powershell
59-
Suspend-ClusterNode
59+
Set-Service -Name NetworkATC -StartupType Disabled
60+
Stop-Service -Name NetworkATC
6061
```
6162

62-
### Step 3: Stop the Network ATC service
63+
### Step 3: Pause one node in the cluster
6364

64-
To prevent Network ATC from applying the intent while VMs are running, stop or disable the Network ATC service on all nodes that aren't paused. Use these commands:
65+
When you pause one node in the cluster, all workloads are moved to other nodes, making your machine available for changes. The paused node is then migrated to Network ATC. To pause your cluster node, use the following command:
6566

6667
```powershell
67-
Set-Service -Name NetworkATC -StartupType Disabled
68-
Stop-Service -Name NetworkATC
68+
Suspend-ClusterNode
6969
```
7070

7171
### Step 4: Remove the existing configuration on the paused node without running VMs
@@ -98,7 +98,7 @@ If your nodes were configured via Virtual Machine Manager (VMM), those configura
9898

9999
### Step 5: Start the Network ATC service
100100

101-
As a precaution, to control the speed of the rollout, we paused the node and then stopped or disabled the Network ATC service in the previous steps. Since Network ATC intents are implemented cluster-wide, perform this step only once.
101+
As a precaution, to control the speed of the rollout, we paused the node and then stopped and disabled the Network ATC service in the previous steps. Since Network ATC intents are implemented cluster-wide, perform this step only once.
102102

103103
To start the Network ATC service, on the paused node only, run the following command:
104104

@@ -109,13 +109,9 @@ Set-service -Name NetworkATC -StartupType Automatic
109109

110110
### Step 6: Add the Network ATC intent
111111

112-
There are various intents that you can add. Identify the intent or intents you'd like using the examples in the next section.
112+
There are various intents that you can add. Identify the intent or intents you'd like by using the examples in the next section.
113113

114-
To add the Network ATC intent, run the following command:
115-
116-
```powershell
117-
Set-Service -Name NetworkATC -StartupType Automatic
118-
```
114+
To add the Network ATC intent, run the `Add-NetIntent` command with the appropriate options for the intent you want to deploy.
119115

120116
### Example intents
121117

@@ -153,7 +149,7 @@ In this example, there's a single intent managed across cluster nodes.
153149
Here's an example to implement this host network pattern:
154150
155151
```powershell
156-
Add-Netintent -Name MgmtComputeStorage -Management -Compute -Storage -AdapterName pNIC1, pNIC2
152+
Add-NetIntent -Name MgmtComputeStorage -Management -Compute -Storage -AdapterName pNIC1, pNIC2
157153
```
158154
159155
#### Group compute and storage traffic on one intent with a separate management intent
@@ -227,27 +223,30 @@ ProvisioningStatus : Completed
227223

228224
Ensure that each intent added has an entry for the host you're working on. Also, make sure the **ConfigurationStatus** shows **Success**.
229225

230-
If the **ConfigurationStatus** shows **Failed**, check to see if the error message indicates the reason for the failure. For some examples of failure resolutions, see [Common Error Messages](../deploy/network-atc.md#common-error-messages).
226+
If the **ConfigurationStatus** shows **Failed**, check to see if the error message indicates the reason for the failure. You can also review the Microsoft-Windows-Networking-NetworkATC/Admin event logs for more details on the reason for the failure. For some examples of failure resolutions, see [Common Error Messages](../deploy/network-atc.md#common-error-messages).
231227

232228
### Step 8: Rename the VMSwitch on other nodes
233229

234230
In this step, you move from the node deployed with Network ATC to the next node and migrate the VMs from this second node. You must verify that the second node has the same `VMSwitch` name as the node deployed with Network ATC.
235231

236-
This is a nondisruptive change and can be done on all the nodes simultaneously. Run the following command:
232+
> [!IMPORTANT]
233+
> After the virtual switch is renamed, you must disconnect and reconnect each virtual machine so that it can appropriately cache the new name of the virtual switch. This is a disruptive action that requires planning to complete. If you do not perform this action, live migrations will fail with an error indicating the virtual switch doesn't exist on the destination.
234+
235+
Renaming the virtual switch is a non-disruptive change and can be done on all the nodes simultaneously. Run the following command:
237236

238237
```powershell
239238
#Run on the node where you configured Network ATC
240-
Get-VMSwitch | ft name
239+
Get-VMSwitch | ft Name
241240
242241
#Run on the next node to rename the virtual switch
243242
Rename-VMSwitch -Name 'ExistingName' -NewName 'NewATCName'
244243
```
245244

246-
After your switch is renamed, disconnect and reconnect your vNICs for the `VMSwitch` name change to go through. Once the change goes through, on each node, run the following commands:
245+
After your switch is renamed, disconnect and reconnect your vNICs for the `VMSwitch` name change to go through. The command below can be used to perform this action for all VMs:
247246

248247
```powershell
249248
$VMSW = Get-VMSwitch
250-
$VMs = get-vm
249+
$VMs = Get-VM
251250
$VMs | %{Get-VMNetworkAdapter -VMName $_.name | Disconnect-VMNetworkAdapter ; Get-VMNetworkAdapter -VMName $_.name | Connect-VMNetworkAdapter -SwitchName $VMSW.name}
252251
```
253252

@@ -265,7 +264,7 @@ Resume-ClusterNode
265264
```
266265

267266
> [!NOTE]
268-
> To apply the Network ATC settings across the cluster, repeat steps 1 through 5, step 7, and step 9 for each node of the cluster.
267+
> To apply the Network ATC settings across the cluster, repeat steps 1 through 5 (skip deleting the virtual switch as it was renamed), step 7, and step 9 for each node of the cluster.
269268
270269
## Next step
271270

0 commit comments

Comments
 (0)