Skip to content

Commit 0511779

Browse files
Merge pull request #227778 from themar-msft/themar-chaos-troubleshoot-vmss
Adds vmss chaos agent install troubleshooting
2 parents 9e15a52 + ba3a0b8 commit 0511779

File tree

1 file changed

+57
-25
lines changed

1 file changed

+57
-25
lines changed

articles/chaos-studio/troubleshooting.md

Lines changed: 57 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -16,68 +16,100 @@ As you use Chaos Studio, you may occasionally encounter some problems. This arti
1616
## General troubleshooting tips
1717

1818
The following sources are useful when troubleshooting issues with Chaos Studio:
19-
1. **The Activity Log**: The [Azure Activity Log](../azure-monitor/essentials/activity-log.md) has a record of all create, update, and delete operations in a subscription, including Chaos Studio operations like enabling a target and/or capabilities, installing the agent, and creating or running an experiment. Failures in the Activity Log indicate that a user action essential to using Chaos Studio may have failed to complete. Most service-direct faults also inject faults by executing an Azure Resource Manager operation, so the Activity Log will also have the record of faults being injected during an experiment for some service-direct faults.
20-
2. **Experiment Details**: Experiment execution details show the status and errors of an individual experiment run. Opening a specific fault in experiment details will show the resources that failed and the error messages for a failure. [Learn more about how to access experiment details](chaos-studio-run-experiment.md#view-experiment-history-and-details).
19+
1. **The Activity Log**: The [Azure Activity Log](../azure-monitor/essentials/activity-log.md) has a record of all create, update, and delete operations in a subscription. These records include Chaos Studio operations like enabling a target and/or capabilities, installing the agent, and creating or running an experiment. Failures in the Activity Log indicate that a user action essential to using Chaos Studio may have failed to complete. Most service-direct faults also inject faults by executing an Azure Resource Manager operation, so the Activity Log also has the record of faults that were injected during an experiment for some service-direct faults.
20+
2. **Experiment Details**: Experiment execution details show the status and errors of an individual experiment run. Opening a specific fault in experiment details shows the resources that failed and the error messages for a failure. [Learn more about how to access experiment details](chaos-studio-run-experiment.md#view-experiment-history-and-details).
2121
3. **Agent logs**: If using an agent-based fault, you may need to RDP or SSH in to the virtual machine to understand why the agent failed to run a fault. The instructions for accessing agent logs depend on the operating system:
22-
* **Chaos Windows agent**: Agent logs are located in the Windows Event Log in the Application category with the source AzureChaosAgent. The agent adds fault activity and regular health check (ability to authenticate to and communicate with the Chaos Studio agent service) events to this log.
22+
* **Chaos Windows agent**: Agent logs are in the Windows Event Log in the Application category with the source AzureChaosAgent. The agent adds fault activity and regular health check (ability to authenticate to and communicate with the Chaos Studio agent service) events to this log.
2323
* **Chaos Linux agent**: The Linux agent uses systemd to manage the agent process as a Linux service. To view the systemd journal for the agent (the events logged by the agent service), run the command `journalctl -u azure-chaos-agent`.
24-
4. **VM extension status**: If using an agent-based fault, you may also need to verify that the VM extension is installed and healthy. In the Azure portal, navigate to your virtual machine and go to **Extensions** or **Extensions + applications**. Click on the ChaosAgent extension and look for the following fields:
25-
* **Status** should show "Provisioning succeeded." Any other status indicates that the agent failed to install. Verify that all [system requirements](chaos-studio-limitations.md#limitations) are met and try re-installing the agent.
26-
* **Handler status** should show "Ready." Any other status indicates that the agent installed but cannot connect to the Chaos Studio service. Verify that all [network requirements](chaos-studio-limitations.md#limitations) are met and that the user-assigned managed identity has been added to the virtual machine and try rebooting.
24+
4. **VM extension status**: If using an agent-based fault, verify that the VM extension is installed and healthy. In the Azure portal, navigate to your virtual machine and go to **Extensions** or **Extensions + applications**. Click on the ChaosAgent extension and look for the following fields:
25+
* **Status** should show "Provisioning succeeded." Any other status indicates that the agent failed to install. Verify that you meet all [system requirements](chaos-studio-limitations.md#limitations) try reinstalling the agent.
26+
* **Handler status** should show "Ready." Any other status indicates that the agent installed but can't connect to the Chaos Studio service. Verify that you meet all [network requirements](chaos-studio-limitations.md#limitations) and that the user-assigned managed identity has been added to the virtual machine and try rebooting.
2727

2828
## Issues onboarding a resource
2929

30-
### Resources do not show up in the targets list in the Azure portal
31-
If you do not see the resources you would like to enable in the Chaos Studio targets list, it may be due to any of the following issues:
32-
* The resources are not in [a supported region for Chaos Studio](https://azure.microsoft.com/global-infrastructure/services/?products=chaos-studio).
33-
* The resources are not of [a supported resource type in Chaos Studio](chaos-studio-fault-providers.md).
34-
* The resources are in a subscription or resource group that are filtered out in the filters for the target list. Change the subscription and resource group filters to see your resources.
30+
### Resources don't show up in the targets list in the Azure portal
31+
If you don't see the resources you would like to enable in the Chaos Studio targets list, it may be due to any of the following issues:
32+
* The resources aren't in [a supported region for Chaos Studio](https://azure.microsoft.com/global-infrastructure/services/?products=chaos-studio).
33+
* The resources aren't of [a supported resource type in Chaos Studio](chaos-studio-fault-providers.md).
34+
* The resources are in a subscription or resource group that is filtered out in the filters for the target list. Change the subscription and resource group filters to see your resources.
3535

3636
### Target and/or capability enablement fails or doesn't show correctly in the target list
3737
If you see an error when enabling targets and/or capabilities, try the following steps:
38-
1. Verify that you have appropriate permissions to the resources you are onboarding. Enabling a target and/or capabilities requires Microsoft.Chaos/\* permission at the scope of the resource. Built-in roles such as Contributor have wildcard Read and Write permission, which includes permission to all Microsoft.Chaos operations.
38+
1. Verify that you have appropriate permissions to the resources you're onboarding. Enabling a target and/or capabilities requires Microsoft.Chaos/\* permission at the scope of the resource. Built-in roles such as Contributor have wildcard Read and Write permission, which includes permission to all Microsoft.Chaos operations.
3939
2. Wait a few minutes for the target and capability list to update. The Azure portal uses Azure Resource Graph to gather information on target and capability onboarding and it can take up to five minutes for the update to propagate.
4040
3. If the resource still shows "Not enabled", try the following steps:
4141
1. Attempt to enable the resource again.
4242
2. If resource enablement still fails, visit the Activity Log and find the failed target create operation to see detailed error information.
4343
4. If the resource shows "Enabled" but onboarding capabilities failed, try the following steps:
44-
1. Click the **Manage actions** button on the resource in the targets list. Check any capabilities that were not checked, and click **Save**.
44+
1. Click the **Manage actions** button on the resource in the targets list. Check any capabilities that weren't checked, and click **Save**.
4545
2. If capability enablement still fails, visit the Activity Log and find the failed target create operation to see detailed error information.
4646

4747
## Prerequisite issues
4848

4949
Some issues are caused by missing prerequisites.
5050

5151
### Agent-based faults fail on a virtual machine
52-
Agent-based faults may fail for a variety of reasons related to missing prerequisites:
52+
Agent-based faults may fail for various reasons related to missing prerequisites:
5353
* On Linux VMs, the [CPU Pressure](chaos-studio-fault-library.md#cpu-pressure), [Physical Memory Pressure](chaos-studio-fault-library.md#physical-memory-pressure), [Disk I/O pressure](chaos-studio-fault-library.md#disk-io-pressure-linux), and [Arbitrary Stress-ng Stress](chaos-studio-fault-library.md#arbitrary-stress-ng-stress) faults all require the [stress-ng utility](https://wiki.ubuntu.com/Kernel/Reference/stress-ng) to be installed on your virtual machine. For more information on how to install stress-ng, see the fault prerequisite sections.
5454
* On either Linux or Windows VMs, the user-assigned managed identity provided during agent-based target enablement must also be added to the virtual machine.
55-
* On either Linux or Windows VMs, the system-assigned managed identity for the experiment must be granted Reader role on the VM (seemingly elevated roles like Virtual Machine Contributor do not include the \*/Read operation that is necessary for the Chaos Studio agent service to read the microsoft-agent target proxy resource on the virtual machine).
55+
* On either Linux or Windows VMs, the system-assigned managed identity for the experiment must be granted Reader role on the VM (seemingly elevated roles like Virtual Machine Contributor don't include the \*/Read operation that is necessary for the Chaos Studio agent service to read the microsoft-agent target proxy resource on the virtual machine).
56+
57+
### Chaos agent won't install on Virtual Machine Scale Sets
58+
59+
Installing the Chaos agent on Virtual Machine Scale Sets may fail with without showing an error if the Virtual Machine Scale Sets upgrade policy is set to **Manual**. To check the Virtual Machine Scale Sets upgrade policy:
60+
61+
1. Log in to Azure portal.
62+
1. Select **Virtual Machine Scale Set**.
63+
1. From the left pane menu, choose **Upgrade policy**.
64+
1. Check the **Upgrade mode** to see if it's set to **Manual - Existing instances must be manually upgraded**.
65+
66+
If the Upgrade policy is set to **Manual**, you must upgrade your Virtual Machine Scale Sets instances so that Chaos agent installation completes.
67+
68+
#### Upgrade instances from Azure portal
69+
70+
You can upgrade your Virtual Machine Scale Sets instances from Azure portal:
71+
72+
1. Log in to Azure portal.
73+
1. Select **Virtual Machine Scale Set**.
74+
1. From the left pane menu, choose **Instances**.
75+
1. Select all instances and click **Upgrade**.
76+
77+
#### Upgrade instances with the Azure CLI
78+
79+
You can upgrade your Virtual Machine Scale Sets instances with Azure CLI:
80+
81+
- From the Azure CLI, use `az vmss update-instances` to manually upgrade your instances:
82+
83+
```azurecli
84+
az vmss update-instances --resource-group myResourceGroup --name myScaleSet --instance-ids {instanceIds}
85+
```
86+
87+
For more information, see [How to bring VMs up-to-date with the latest scale set model](/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-upgrade-scale-set#how-to-bring-vms-up-to-date-with-the-latest-scale-set-model)
5688
5789
### AKS Chaos Mesh faults fail
58-
AKS Chaos Mesh faults may fail for a variety of reasons related to missing prerequisites:
90+
AKS Chaos Mesh faults may fail for various reasons related to missing prerequisites:
5991
* Chaos Mesh must first be installed on the AKS cluster before using the AKS Chaos Mesh faults. Instructions can be found in the [Chaos Mesh faults on AKS tutorial](chaos-studio-tutorial-aks-portal.md#set-up-chaos-mesh-on-your-aks-cluster).
6092
* Chaos Mesh must be version 2.0.4 or greater. You can get the Chaos Mesh version by connecting to your AKS cluster and running `helm version chaos-mesh`.
61-
* Chaos Mesh must be installed with the namespace `chaos-testing`. Other namespace names for Chaos Mesh are not supported.
93+
* Chaos Mesh must be installed with the namespace `chaos-testing`. Other namespace names for Chaos Mesh aren't supported.
6294
* The Azure Kubernetes Service Cluster Admin role must be assigned to the system-assigned managed identity for the chaos experiment.
6395
6496
## Issues creating or designing an experiment
6597
66-
### When adding a fault, my resource does not show in the Target Resources list
67-
When adding a fault, if you do not see the resource you want to target with a fault in the Target Resources list, it may be due to any of the following issues:
98+
### My resource doesn't show in the Target Resources list when I add a fault
99+
When you add a fault, if you don't see the resource you want to target with a fault in the Target Resources list, it may be due to any of the following issues:
68100
* The **Subscription** filter is set to exclude the subscription in which your target is deployed. Click on the subscription filter and modify the selected subscriptions.
69101
* The resource hasn't been onboarded yet. Visit the **Targets** view and enable the target. After this completes, you need to close the Add Fault pane and reopen it to see an updated target list.
70102
* The resource hasn't been enabled for the target type of that fault yet. Consult the [fault library](chaos-studio-fault-library.md) to see which target type is used for the fault, then visit the **Targets** view and enable that target type (either agent-based for microsoft-agent faults or service-direct for all other target types). After this completes, you need to close the Add Fault pane and reopen it to see an updated target list.
71-
* The resource doesn't have the capability for that fault enabled yet. Consult the [fault library](chaos-studio-fault-library.md) to see the capability name for the fault, then visit the **Targets** view and click **Manage actions** on the target resource. Check the box for the capability that corresponds to the fault you are trying to run and click **Save**. After this completes, you need to close the Add Fault pane and reopen it to see an updated target list.
103+
* The resource doesn't have the capability for that fault enabled yet. Consult the [fault library](chaos-studio-fault-library.md) to see the capability name for the fault, then visit the **Targets** view and click **Manage actions** on the target resource. Check the box for the capability that corresponds to the fault you're trying to run and click **Save**. After this completes, you need to close the Add Fault pane and reopen it to see an updated target list.
72104
* The resource has just recently been onboarded and hasn't appeared in Azure Resource Graph yet. The Target Resources list is queried from Azure Resource Graph, and after enabling a new target it can take up to five minutes for the update to propagate to Azure Resource Graph. Wait a few minutes, then reopen the Add Fault pane.
73105
74-
### When creating an experiment, I get the error `The microsoft:agent provider requires a managed identity`
106+
### I get the error `The microsoft:agent provider requires a managed identity` when creating an experiment
75107
76-
This error happens when the agent has not been deployed to your virtual machine. For installation instructions, see [Create and run an experiment that uses agent-based faults](chaos-studio-tutorial-agent-based-portal.md).
108+
This error happens when the agent hasn't been deployed to your virtual machine. For installation instructions, see [Create and run an experiment that uses agent-based faults](chaos-studio-tutorial-agent-based-portal.md).
77109
78110
### When creating an experiment, I get the error `The content media type '<null>' is not supported. Only 'application/json' is supported.`
79111
80-
You may encounter this error if you are creating your experiment using an ARM template or the Chaos Studio REST API. The error indicates that there is malformed JSON in your experiment definition. Check to see if you have any syntax errors, such as mismatched braces or brackets ({} and \[\]), using a JSON linter like Visual Studio Code.
112+
You may encounter this error if you're creating your experiment using an ARM template or the Chaos Studio REST API. The error indicates that there's malformed JSON in your experiment definition. Check to see if you have any syntax errors, such as mismatched braces or brackets ({} and \[\]), using a JSON linter like Visual Studio Code.
81113
82114
## Issues running an experiment
83115
@@ -89,6 +121,6 @@ From the **Experiments** list in the Azure portal, click on the experiment name
89121
90122
### My agent-based fault failed with error: Verify that the target is correctly onboarded and proper read permissions are provided to the experiment msi.
91123
92-
This may happen if you onboarded the agent using the Azure portal, which has a known issue: Enabling an agent-based target does not assign the user-assigned managed identity to the virtual machine or virtual machine scale set.
124+
This may happen if you onboarded the agent using the Azure portal, which has a known issue: Enabling an agent-based target doesn't assign the user-assigned managed identity to the virtual machine or Virtual Machine Scale Set.
93125
94-
To resolve this, navigate to the virtual machine or virtual machine scale set in the Azure portal, go to **Identity**, open the **User assigned** tab, and **Add** your user-assigned identity to the virtual machine. Once complete, you may need to reboot the virtual machine for the agent to connect.
126+
To resolve this, navigate to the virtual machine or Virtual Machine Scale Set in the Azure portal, go to **Identity**, open the **User assigned** tab, and **Add** your user-assigned identity to the virtual machine. Once complete, you may need to reboot the virtual machine for the agent to connect.

0 commit comments

Comments
 (0)