Skip to content

Commit 1cab635

Browse files
Merge pull request #267764 from bwren/dcr-monitor
DCR diagnostics
2 parents 61c76fd + b38dadd commit 1cab635

File tree

6 files changed

+126
-0
lines changed

6 files changed

+126
-0
lines changed
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
---
2+
title: Monitor and troubleshoot DCR data collection in Azure Monitor
3+
description: Configure log collection for monitoring and troubleshooting of DCR-based data collection in Azure Monitor.
4+
ms.topic: conceptual
5+
author: bwren
6+
ms.author: bwren
7+
ms.date: 03/01/2024
8+
---
9+
10+
# Monitor and troubleshoot DCR data collection in Azure Monitor
11+
This article provides detailed metrics and logs that you can use to monitor performance and troubleshoot any issues related to data collection in Azure Monitor. This telemetry is currently available for data collection scenarios defined by a [data collection rules (DCR)](./data-collection-rule-overview.md) such as Azure Monitor agent and Logs ingestion API.
12+
13+
> [!IMPORTANT]
14+
> This article only refers to data collection scenarios that use DCRs, including the following:
15+
>
16+
> - Logs collected using [Azure Monitor Agent (AMA)](../agents/agents-overview.md)
17+
> - Logs ingested using [Log Ingestion API](../logs/logs-ingestion-api-overview.md)
18+
> - Logs collected by other methods that use a [workspace transformation DCR](./data-collection-transformations.md#workspace-transformation-dcr)
19+
>
20+
> See the documentation for other scenarios for any monitoring and troubleshooting information that may be available.
21+
22+
DCR diagnostic features include metrics and error logs emitted during log processing. [DCR metrics](#dcr-metrics) provide information about the volume of data being ingested, the number and nature of any processing errors, and statistics related to data transformation. [DCR error logs](#dcr-metrics) are generated any time data processing is not successful and the data doesn’t reach its destination.
23+
24+
## DCR Error Logs
25+
26+
Error logs are generated when data reaches the Azure Monitor ingestion pipeline but fails to reach its destination. Examples of error conditions include:
27+
28+
- Log delivery errors
29+
- [Transformation](./data-collection-transformations.md) errors where the structure of the logs makes the transformation KQL invalid
30+
- Log Ingestion API calls:
31+
- with any HTTP response other than 200/202
32+
- with payload containing malformed data
33+
- with payload over any [ingestion limits](/azure/azure-monitor/service-limits#logs-ingestion-api)
34+
- throttling due to overage of API call limits
35+
36+
To avoid excessive logging of persistent errors related to the same data flow, some errors will be logged only a limited number of times each hour followed by a summary error message. The error is then muted until the end of the hour. The number of times a given error is logged may vary depending on the region where DCR is deployed.
37+
38+
Some log ingestion errors will not be logged because they can't be associated with a DCR. The following errors may not be logged:
39+
40+
- Failures caused by malformed call URI (HTTP response code 404)
41+
- Certain internal server errors (HTTP response code 500)
42+
43+
### Enable DCR error logs
44+
DCR error logs are implemented as [resource logs](./resource-logs.md) in Azure Monitor. Enable log collection by creating a [diagnostic setting](./diagnostic-settings.md) for the DCR. Each DCR will require its own diagnostic setting. See [Create diagnostic settings in Azure Monitor](./create-diagnostic-settings.md) for the detailed process. Select the category **Log Errors** and **Send to Log Analytics workspace**. You may want to select the same workspace that's used by the DCR, or you may want to consolidate all of your error logs in a single workspace.
45+
46+
### Retrieve DCR error logs
47+
Error logs are written to the [DCRLogErrors](/azure/azure-monitor/reference/tables/dcrlogerrors) table in the Log Analytics workspace you specified in the diagnostic setting. Following are sample queries you can use in [Log Analytics](../logs/log-analytics-overview.md) to retrieve these logs.
48+
49+
**Retrieve all error logs for a particular DCR**
50+
51+
```kusto
52+
DCRLogErrors
53+
| where _ResourceId == "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/my-resource-group/providers/microsoft.insights/datacollectionrules/my-dcr"
54+
```
55+
56+
**Retrieve all error logs for a particular input stream in a particular DCR**
57+
58+
```kusto
59+
DCRLogErrors
60+
| where _ResourceId == "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/my-resource-group/providers/microsoft.insights/datacollectionrules/my-dcr"
61+
| where InputStream == "Custom-MyTable_CL"
62+
```
63+
64+
65+
## DCR Metrics
66+
DCR metrics are collected automatically for all DCRs, and you can analyze them using [metrics explorer](./analyze-metrics.md) like the platform metrics for other Azure resources. *Input stream* is included as a dimension so if you have a DCR with multiple input streams, you can analyze each by [filtering or splitting](./analyze-metrics.md#use-dimension-filters-and-splitting). Some metrics include other dimensions as shown in the table below.
67+
68+
69+
| Metric | Dimensions | Description |
70+
|---|---|---|
71+
| Logs Ingestion Bytes per Min | Input Stream | Total number of bytes received per minute. |
72+
| Logs Ingestion Requests per Min | Input stream<br>HTTP response code | Number of calls received per minute |
73+
| Logs Rows Dropped per Min | Input stream | Number of log rows dropped during processing per minute. This includes rows dropped both due to filtering criteria in KQL transformation and rows dropped due to errors. |
74+
| Logs Rows Received per Min | Input stream | Number of log rows received for processing per minute. |
75+
| Logs Transformation Duration per Min | Input stream | Average KQL transformation runtime per minute. Represents KQL transformation code efficiency. Data flows with longer transformation run time can experience delays in data processing and greater data latency. |
76+
| Logs Transformation Errors per Min | Input stream<br>Error type | Number of processing errors encountered per minute |
77+
78+
79+
## Troubleshooting common issues
80+
If you're missing expected data in your Log Analytics workspace, follow these basic steps to troubleshoot the issue. This assumes that you enabled DCR logging as described above.
81+
82+
- Check metrics such as `Logs Ingestion Bytes per Min` and `Logs Rows Received per Min` to ensure that the data is reaching Azure Monitor. If not, then check your data source to ensure that it's sending data as expected.
83+
- Check `Logs Rows Dropped per Min` to see if any rows are being dropped. This may not indicate an error since the rows could be dropped by a transformation. If the rows dropped is the same as `Logs Rows Dropped per Min` though, then no data will be ingested in the workspace. Examine the `Logs Transformation Errors per Min` to see if there are any transformation errors.
84+
- Check `Logs Transformation Errors per Min` to determine if there are any errors from transformations applied to the incoming data. This could be due to changes in the data structure or the transformation itself.
85+
- Check `DCRLogErrors` for any ingestion errors that may have been logged. This can provide additional detail in identifying the root cause of the issue.
86+
87+
88+
89+
## Monitoring your log ingestion
90+
91+
The following signals could be useful for monitoring the health of your log collection with DCRs. Create alert rules to identify these conditions.
92+
93+
| Signal | Possible causes and actions |
94+
|---|---|
95+
| New entries in `DCRErrorLogs` or sudden change in `Log Transform Errors`. | - Problems with Log Ingestion API setup such as authentication, access to DCR or DCE, call payload issues.<br>- Changes in data structure causing KQL transformation failures.<br>- Changes in data destination configuration causing data delivery failures. |
96+
| Sudden change in `Logs Ingestion Bytes per Min` | - Changes in configuration of log ingestion on the client, including AMA settings.<br>- Changes in structure of logs sent.|
97+
| Sudden change in ratio between `Logs Ingestion Bytes per Min` and `Logs Rows Received per Min` | - Changes in the structure of logs sent. Examine the changes to make sure the data is properly processed with KQL transformation. |
98+
| Sudden change in `Logs Transformation Duration per Min` | - Changes in the structure of logs affecting the efficiency of log filtering criteria set in KQL transformation. Examine the changes to make sure the data is properly processed with KQL transformation. |
99+
| `Logs Ingestion Requests per Min` or `Logs Ingestion Bytes per Min` approaching Log Ingestion API service limits. | - Examine and optimize your DCR configuration to avoid throttling. |
100+
101+
## Alerts
102+
Rather than reactively troubleshooting issues, create alert rules to be proactively notified when a potential error condition occurs. The following table provides examples of alert rules you can create to monitor your log ingestion.
103+
104+
| Condition | Alert details |
105+
|:---|:---|
106+
| Sudden changes of rows dropped | Metric alert rule using a dynamic threshold for `Logs Rows Dropped per Min`. |
107+
| Number of API calls approaching service limits | Metric alert rule using a static threshold for `Logs Ingestion Requests per Min`. Set threshold near 12,000, which is the service limit for maximum requests/minute per DCR. |
108+
| Error logs | Log query alert using `DCRLogErrors`. Use a **Table rows** measure and **Threshold value** of **1** to be alerted whenever any errors are logged. |
109+
110+
## Next steps
111+
- [Read more about data collection rules.](./data-collection-rule-overview.md)
112+
- [Read more about ingestion-time transformations.](./data-collection-transformations.md)
113+

articles/azure-monitor/essentials/data-collection-transformations.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,9 @@ There are multiple methods to create transformations depending on the data colle
7070
| Transformation in workspace DCR | [Add workspace transformation to Azure Monitor Logs by using the Azure portal](../logs/tutorial-workspace-transformations-portal.md)<br>[Add workspace transformation to Azure Monitor Logs by using Resource Manager templates](../logs/tutorial-workspace-transformations-api.md)
7171
| Agent Transformations in a DCR | [Add transformation to Azure Monitor Log](../agents/azure-monitor-agent-transformation.md)
7272

73+
## Monitor transformations
74+
See [Monitor and troubleshoot DCR data collection in Azure Monitor](data-collection-monitor.md) for details on logs and metrics that monitor the health and performance of transformations. This includes identifying any errors that occur in the KQL and metrics to track their running duration.
75+
7376
## Cost for transformations
7477
While transformations themselves don't incur direct costs, the following scenarios can result in additional charges:
7578

articles/azure-monitor/logs/data-collection-troubleshoot.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ ms.date: 07/25/2023
88
# Troubleshoot why data is no longer being collected in Azure Monitor
99
This article provides guidance to detect when data collection in Azure Monitor stops and steps you can take to determine and correct the causes.
1010

11+
> [!IMPORTANT]
12+
> If you're troubleshooting data collection for a scenario that uses a data collection rule (DCR) such as Azure Monitor agent or Logs ingestion API, see [Monitor and troubleshoot DCR data collection in Azure Monitor](../essentials/data-collection-monitor.md) for additional troubleshooting information.
1113
1214
## Data collection status
1315
When data collection in a Log Analytics workspace stops, an event with a type of **Operation** is created in the workspace. Run the following query to check whether you're reaching the daily limit and missing data:

articles/azure-monitor/logs/data-ingestion-time.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,8 @@ After log records are ingested into the Azure Monitor pipeline (as identified in
8181

8282
Some solutions implement heavier algorithms to aggregate data and derive insights as data is streaming in. For example, Application Insights calculates application map data; Azure Network Performance Monitoring aggregates incoming data over 3-minute intervals, which effectively adds 3-minute latency.
8383

84+
If the data collection includes an [ingestion-time transformation](../essentials/data-collection-transformations.md), then this will add some latency to the pipeline. Use the metric [Logs Transformation Duration per Min](../essentials/data-collection-monitor.md) to monitor the efficiency of the transformation query.
85+
8486
Another process that adds latency is the process that handles custom logs. In some cases, this process might add a few minutes of latency to logs that are collected from files by the agent.
8587

8688
### New custom data types provisioning

articles/azure-monitor/logs/monitor-workspace.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,9 @@ The following table describes the categories from the `_LogOperation` function.
4848

4949
Ingestion operations are issues that occurred during data ingestion and include notification about reaching the Log Analytics workspace limits. Error conditions in this category might suggest data loss, so they're important to monitor. For service limits for Log Analytics workspaces, see [Azure Monitor service limits](../service-limits.md#log-analytics-workspaces).
5050

51+
> [!IMPORTANT]
52+
> If you're troubleshooting data collection for a scenario that uses a data collection rule (DCR) such as Azure Monitor agent or Logs ingestion API, see [Monitor and troubleshoot DCR data collection in Azure Monitor](../essentials/data-collection-monitor.md) for additional troubleshooting information.
53+
5154
#### Operation: Data collection stopped
5255

5356
"Data collection stopped due to daily limit of free data reached. Ingestion status = OverQuota"

articles/azure-monitor/toc.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -614,6 +614,9 @@ items:
614614
- name: Structure
615615
displayName: Data collection rules
616616
href: essentials/data-collection-rule-structure.md
617+
- name: Monitor
618+
displayName: Data collection rules
619+
href: essentials/data-collection-monitor.md
617620
- name: Stream resource log data
618621
href: essentials/resource-logs.md
619622
- name: Stream activity log data

0 commit comments

Comments
 (0)