Skip to content

Commit e8fe0f1

Browse files
authored
Merge pull request #192686 from shayoniseth/shseth/amatsg
AMA TSG
2 parents a4e401d + 4c57df7 commit e8fe0f1

8 files changed

+474
-1
lines changed

articles/azure-monitor/agents/azure-monitor-agent-manage.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ The following prerequisites must be met prior to installing the Azure Monitor ag
3232

3333
| Built-in Role | Scope(s) | Reason |
3434
|:---|:---|:---|
35-
| <ul><li>[Virtual Machine Contributor](../../role-based-access-control/built-in-roles.md#virtual-machine-contributor)</li><li>[Azure Connected Machine Resource Administrator](../../role-based-access-control/built-in-roles.md#azure-connected-machine-resource-administrator)</li></ul> | <ul><li>Virtual machines, virtual machine scale sets</li><li>Arc-enabled servers</li></ul> | To deploy the agent |
35+
| <ul><li>[Virtual Machine Contributor](../../role-based-access-control/built-in-roles.md#virtual-machine-contributor)</li><li>[Azure Connected Machine Resource Administrator](../../role-based-access-control/built-in-roles.md#azure-connected-machine-resource-administrator)</li></ul> | <ul><li>Virtual machines, scale sets</li><li>Arc-enabled servers</li></ul> | To deploy the agent |
3636
| Any role that includes the action *Microsoft.Resources/deployments/** | <ul><li>Subscription and/or</li><li>Resource group and/or </li></ul> | To deploy ARM templates |
3737
- For installing the agent on physical servers and virtual machines hosted *outside* of Azure (i.e. on-premises), you must [install the Azure Arc Connected Machine agent](../../azure-arc/servers/agent-overview.md) first (at no added cost)
3838
- [Managed system identity](../../active-directory/managed-identities-azure-resources/qs-configure-portal-windows-vm.md) must be enabled on Azure virtual machines. This is not required for Azure Arc-enabled servers. The system identity will be enabled automatically if the agent is installed via [creating and assigning a data collection rule using the Azure portal](data-collection-rule-azure-monitor-agent.md#create-rule-and-association-in-azure-portal).
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
---
2+
title: Rsyslog data not uploaded due to Full Disk space issue on AMA Linux Agent
3+
description: Guidance for troubleshooting rsyslog issues on Linux virtual machines, scale sets with Azure Monitor agent and Data Collection Rules.
4+
ms.topic: conceptual
5+
author: shseth
6+
ms.author: shseth
7+
ms.date: 5/3/2022
8+
ms.custom: references_region
9+
10+
---
11+
12+
# Rsyslog data not uploaded due to Full Disk space issue on AMA Linux Agent
13+
14+
## Symptom
15+
**Syslog data is not uploading**: When inspecting the error logs at `/var/opt/microsoft/azuremonitoragent/log/mdsd.err`, you'll see entries about *Error while inserting item to Local persistent store…No space left on device* similar to the following snippet:
16+
17+
```
18+
2021-11-23T18:15:10.9712760Z: Error while inserting item to Local persistent store syslog.error: IO error: No space left on device: While appending to file: /var/opt/microsoft/azuremonitoragent/events/syslog.error/000555.log: No space left on device
19+
```
20+
21+
## Cause
22+
Linux AMA buffers events to `/var/opt/microsoft/azuremonitoragent/events` prior to ingestion. On a default Linux AMA install, this directory will take ~650MB of disk space at idle. The size on disk will increase when under sustained logging load. It will get cleaned up about every 60 seconds and will reduce back to ~650 MB when the load returns to idle.
23+
24+
### Confirming the issue of Full Disk
25+
The `df` command shows almost no space available on `/dev/sda1`, as shown below:
26+
27+
```
28+
$ df -h
29+
Filesystem Size Used Avail Use% Mounted on
30+
udev 63G 0 63G 0% /dev
31+
tmpfs 13G 720K 13G 1% /run
32+
/dev/sda1 29G 29G 481M 99% /
33+
tmpfs 63G 0 63G 0% /dev/shm
34+
tmpfs 5.0M 0 5.0M 0% /run/lock
35+
tmpfs 63G 0 63G 0% /sys/fs/cgroup
36+
/dev/sda15 105M 4.4M 100M 5% /boot/efi
37+
/dev/sdb1 251G 61M 239G 1% /mnt
38+
tmpfs 13G 0 13G 0% /run/user/1000
39+
```
40+
41+
The `du` command can be used to inspect the disk to determine which files are causing the disk to be full. For example:
42+
43+
```
44+
/var/log$ du -h syslog*
45+
6.7G syslog
46+
18G syslog.1
47+
```
48+
49+
In some cases, `du` may not report any significantly large files/directories. It may be possible that a [file marked as (deleted) is taking up the space](https://unix.stackexchange.com/questions/182077/best-way-to-free-disk-space-from-deleted-files-that-are-held-open). This issue can happen when some other process has attempted to delete a file, but there remains a process with the file still open. The `lsof` command can be used to check for such files. In the example below, we see that `/var/log/syslog` is marked as deleted, but is taking up 3.6 GB of disk space. It hasn't been deleted because a process with PID 1484 still has the file open.
50+
51+
```
52+
$ sudo lsof +L1
53+
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
54+
none 849 root txt REG 0,1 8632 0 16764 / (deleted)
55+
rsyslogd 1484 syslog 14w REG 8,1 3601566564 0 35280 /var/log/syslog (deleted)
56+
```
57+
58+
### Issue: rsyslog default configuration logs all facilities to /var/log/syslog
59+
On some popular distros (for example Ubuntu 18.04 LTS), rsyslog ships with a default configuration file (`/etc/rsyslog.d/50-default.conf`) which will log events from nearly all facilities to disk at `/var/log/syslog`.
60+
61+
AMA doesn't rely on syslog events being logged to `/var/log/syslog`. Instead, it configures rsyslog to forward events over a socket directly to the azuremonitoragent service process (mdsd).
62+
63+
#### Fix: Remove high-volume facilities from /etc/rsyslog.d/50-default.conf
64+
If you're sending a high log volume through rsyslog, consider modifying the default rsyslog config to avoid logging these events to this location `/var/log/syslog`. The events for this facility would still be forwarded to AMA because of the config in `/etc/rsyslog.d/10-azuremonitoragent.conf`.
65+
66+
1. For example, to remove local4 events from being logged at `/var/log/syslog`, change this line in `/etc/rsyslog.d/50-default.conf` from this:
67+
```
68+
*.*;auth,authpriv.none -/var/log/syslog
69+
```
70+
71+
To this (add local4.none;):
72+
73+
```
74+
*.*;local4.none;auth,authpriv.none -/var/log/syslog
75+
```
76+
2. `sudo systemctl restart rsyslog`
77+
78+
### Issue: AMA Event Buffer is Filling Disk
79+
If you observe the `/var/opt/microsoft/azuremonitor/events` directory growing unbounded (10 GB or higher) and not reducing in size, [file a ticket](#file-a-ticket) with **Summary** as 'AMA Event Buffer is filling disk' and **Problem type** as 'I need help configuring data collection from a VM'.
80+
81+
[!INCLUDE [azure-monitor-agent-file-a-ticket](../../../includes/azure-monitor-agent/azure-monitor-agent-file-a-ticket.md)]
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
---
2+
title: Troubleshoot the Azure Monitor agent on Linux virtual machines and scale sets
3+
description: Guidance for troubleshooting issues on Linux virtual machines, scale sets with Azure Monitor agent and Data Collection Rules.
4+
ms.topic: conceptual
5+
author: shseth
6+
ms.author: shseth
7+
ms.date: 5/3/2022
8+
ms.custom: references_region
9+
10+
---
11+
12+
# Troubleshooting guidance for the Azure Monitor agent on Linux virtual machines and scale sets
13+
14+
[!INCLUDE [azure-monitor-agent-architecture](../../../includes/azure-monitor-agent/azure-monitor-agent-architecture-include.md)]
15+
16+
## Basic troubleshooting steps
17+
Follow the steps below to troubleshoot the latest version of the Azure Monitor agent running on your Linux virtual machine:
18+
19+
1. **Carefully review the [prerequisites here](./azure-monitor-agent-manage.md#prerequisites).**
20+
21+
2. **Verify that the extension was successfully installed and provisioned, which installs the agent binaries on your machine**:
22+
1. Open Azure portal > select your virtual machine > Open **Settings** : **Extensions + applications** blade from left menu > 'AzureMonitorLinuxAgent'should show up with Status: 'Provisioning succeeded'
23+
2. If you don't see the extension listed, check if machine can reach Azure and find the extension to install using the command below:
24+
```azurecli
25+
az vm extension image list-versions --location <machine-region> --name AzureMonitorLinuxAgent --publisher Microsoft.Azure.Monitor
26+
```
27+
3. Wait for 10-15 minutes as extension maybe in transitioning status. If it still doesn't show up as above, [uninstall and install the extension](./azure-monitor-agent-manage.md) again.
28+
4. Check if you see any errors in extension logs located at `/var/log/azure/Microsoft.Azure.Monitor.AzureMonitorLinuxAgent/` on your machine
29+
4. If none of the above helps, [file a ticket](#file-a-ticket) with **Summary** as 'AMA extension fails to install or provision' and **Problem type** as 'I need help with Azure Monitor Linux Agent'.
30+
31+
3. **Verify that the agent is running**:
32+
1. Check if the agent is emitting heartbeat logs to Log Analytics workspace using the query below. Skip if 'Custom Metrics' is the only destination in the DCR:
33+
```Kusto
34+
Heartbeat | where Category == "Azure Monitor Agent" and 'Computer' == "<computer-name>" | take 10
35+
```
36+
2. Check if the agent service is running
37+
```
38+
systemctl status azuremonitoragent
39+
```
40+
3. Check if you see any errors in core agent logs located at `/var/opt/microsoft/azuremonitoragent/log/mdsd.*` on your machine
41+
3. If none of the above helps, [file a ticket](#file-a-ticket) with **Summary** as 'AMA extension provisioned but not running' and **Problem type** as 'I need help with Azure Monitor Linux Agent'.
42+
43+
4. **Verify that the DCR exists and is associated with the virtual machine:**
44+
1. If using Log Analytics workspace as destination, verify that DCR exists in the same physical region as the Log Analytics workspace.
45+
2. Open Azure portal > select your data collection rule > Open **Configuration** : **Resources** blade from left menu > You should see the virtual machine listed here.
46+
3. If not listed, click 'Add' and select your virtual machine from the resource picker. Repeat across all DCRs.
47+
4. If none of the above helps, [file a ticket](#file-a-ticket) with **Summary** as 'DCR not found or associated' and **Problem type** as 'I need help configuring data collection from a VM'.
48+
49+
5. **Verify that agent was able to download the associated DCR(s) from AMCS service:**
50+
1. Check if you see the latest DCR downloaded at this location `/etc/opt/microsoft/azuremonitoragent/config-cache/configchunks/`
51+
2. If not, [file a ticket](#file-a-ticket) with **Summary** as 'AMA unable to download DCR config' and **Problem type** as 'I need help with Azure Monitor Linux Agent'.
52+
53+
54+
## Issues collecting Performance counters
55+
56+
## Issues collecting Syslog
57+
Here's how AMA collects syslog events:
58+
59+
- AMA installs an output configuration for the system syslog daemon during the installation process. The configuration file specifies the way events flow between the syslog daemon and AMA.
60+
- For `rsyslog` (most Linux distributions), the configuration file is `/etc/rsyslog.d/10-azuremonitoragent.conf`. For `syslog-ng`, the configuration file is `/etc/syslog-ng/conf.d/azuremonitoragent.conf`.
61+
- AMA listens to a UNIX domain socket to receive events from `rsyslog` / `syslog-ng`. The socket path for this communication is `/run/azuremonitoragent/default_syslog.socket`
62+
- The syslog daemon will use queues when AMA ingestion is delayed, or when AMA isn't reachable.
63+
- AMA ingests syslog events via the aforementioned socket and filters them based on facility / severity combination from DCR configuration in `/etc/opt/microsoft/azuremonitoragent/config-cache/configchunks/`. Any `facility` / `severity` not present in the DCR will be dropped.
64+
- AMA attempts to parse events in accordance with **RFC3164** and **RFC5424**. Additionally, it knows how to parse the message formats listed [here](./azure-monitor-agent-overview.md#data-sources-and-destinations).
65+
- AMA identifies the destination endpoint for Syslog events from the DCR configuration and attempts to upload the events.
66+
> [!NOTE]
67+
> AMA uses local persistency by default, all events received from `rsyslog` / `syslog-ng` are queued in `/var/opt/microsoft/azuremonitoragent/events` before being uploaded.
68+
69+
- The quality of service (QoS) file `/var/opt/microsoft/azuremonitoragent/log/mdsd.qos` provides CSV-format 15-minute aggregations of the processed events and contains the information on the amount of the processed syslog events in the given timeframe. **This file is useful in tracking Syslog event ingestion drops**.
70+
71+
For example, the below fragment shows that in the 15 minutes preceding 2022-02-28T19:55:23.5432920Z, the agent received 77 syslog events with facility daemon and level info and sent 77 of said events to the upload task. Additionally, the agent upload task received 77 and successfully uploaded all 77 of these daemon.info messages.
72+
73+
```
74+
#Time: 2022-02-28T19:55:23.5432920Z
75+
#Fields: Operation,Object,TotalCount,SuccessCount,Retries,AverageDuration,AverageSize,AverageDelay,TotalSize,TotalRowsRead,TotalRowsSent
76+
...
77+
MaRunTaskLocal,daemon.debug,15,15,0,60000,0,0,0,0,0
78+
MaRunTaskLocal,daemon.info,15,15,0,60000,46.2,0,693,77,77
79+
MaRunTaskLocal,daemon.notice,15,15,0,60000,0,0,0,0,0
80+
MaRunTaskLocal,daemon.warning,15,15,0,60000,0,0,0,0,0
81+
MaRunTaskLocal,daemon.error,15,15,0,60000,0,0,0,0,0
82+
MaRunTaskLocal,daemon.critical,15,15,0,60000,0,0,0,0,0
83+
MaRunTaskLocal,daemon.alert,15,15,0,60000,0,0,0,0,0
84+
MaRunTaskLocal,daemon.emergency,15,15,0,60000,0,0,0,0,0
85+
...
86+
MaODSRequest,https://e73fd5e3-ea2b-4637-8da0-5c8144b670c8_LogManagement,15,15,0,455067,476.467,0,7147,77,77
87+
```
88+
89+
**Troubleshooting steps**
90+
1. Review the [generic Linux AMA troubleshooting steps](#basic-troubleshooting-steps) first. If agent is emitting heartbeats, proceed to step 2.
91+
2. The parsed configuration is stored at `/etc/opt/microsoft/azuremonitoragent/config-cache/configchunks/`. Check that Syslog collection is defined and the log destinations are the same as constructed in DCR UI / DCR JSON.
92+
1. If yes, proceed to step 3. If not, the issue is in the configuration workflow.
93+
2. Investigate `mdsd.err`,`mdsd.warn`, `mdsd.info` files under `/var/opt/microsoft/azuremonitoragent/log` for possible configuration errors.
94+
3. If none of the above helps, [file a ticket](#file-a-ticket) with **Summary** as 'Syslog DCR not available' and **Problem type** as 'I need help configuring data collection from a VM'.
95+
3. Validate the layout of the Syslog collection workflow to ensure all necessary pieces are in place and accessible:
96+
1. For `rsyslog` users, ensure the `/etc/rsyslog.d/10-azuremonitoragent.conf` file is present, isn't empty, and is accessible by the `rsyslog` daemon (syslog user).
97+
2. For `syslog-ng` users, ensure the `/etc/syslog-ng/conf.d/azuremonitoragent.conf` file is present, isn't empty, and is accessible by the `syslog-ng` daemon (syslog user).
98+
3. Ensure the file `/run/azuremonitoragent/default_syslog.socket` exists and is accessible by `rsyslog` or `syslog-ng` respectively.
99+
4. Check for a corresponding drop in count of processed syslog events in `/var/opt/microsoft/azuremonitoragent/log/mdsd.qos`. If such drop isn't indicated in the file, [file a ticket](#file-a-ticket) with **Summary** as 'Syslog data dropped in pipeline' and **Problem type** as 'I need help with Azure Monitor Linux Agent'.
100+
5. Check that syslog daemon queue isn't overflowing, causing the upload to fail, by referring the guidance here: [Rsyslog data not uploaded due to Full Disk space issue on AMA Linux Agent](./azure-monitor-agent-troubleshoot-linux-vm-rsyslog.md)
101+
4. To debug syslog events ingestion further, you can append trace flag **-T 0x2002** at the end of **MDSD_OPTIONS** in the file `/etc/default/azuremonitoragent`, and restart the agent:
102+
```
103+
export MDSD_OPTIONS="-A -c /etc/opt/microsoft/azuremonitoragent/mdsd.xml -d -r $MDSD_ROLE_PREFIX -S $MDSD_SPOOL_DIRECTORY/eh -L $MDSD_SPOOL_DIRECTORY/events -e $MDSD_LOG_DIR/mdsd.err -w $MDSD_LOG_DIR/mdsd.warn -o $MDSD_LOG_DIR/mdsd.info -T 0x2002"
104+
```
105+
5. After the issue is reproduced with the trace flag on, you'll find more debug information in `/var/opt/microsoft/azuremonitoragent/log/mdsd.info`. Inspect the file for the possible cause of syslog collection issue, such as parsing / processing / configuration / upload errors.
106+
> [!WARNING]
107+
> Ensure to remove trace flag setting **-T 0x2002** after the debugging session, since it generates many trace statements that could fill up the disk more quickly or make visually parsing the log file difficult.
108+
6. If none of the above helps, [file a ticket](#file-a-ticket) with **Summary** as 'AMA fails to collect syslog events' and **Problem type** as 'I need help with Azure Monitor Linux Agent'.
109+
110+
111+
[!INCLUDE [azure-monitor-agent-file-a-ticket](../../../includes/azure-monitor-agent/azure-monitor-agent-file-a-ticket.md)]

0 commit comments

Comments
 (0)