You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/azure-monitor/essentials/metrics-aggregation-explained.md
+24-21Lines changed: 24 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ ms.reviewer: vitalyg
13
13
14
14
This article explains the aggregation of metrics in the Azure Monitor time-series database that back Azure Monitor [platform metrics](../data-platform.md) and [custom metrics](../essentials/metrics-custom-overview.md). This article also applies to standard [Application Insights metrics](../app/app-insights-overview.md).
15
15
16
-
This is a complex topic and not necessary to understand all the information in this article to use Azure Monitor metrics effectively.
16
+
The contents of this article are complex in nature and not necessary to understand to use Azure Monitor metrics effectively.
17
17
18
18
## Overview and terms
19
19
@@ -36,7 +36,7 @@ Metrics are a series of values stored with a time-stamp. In Azure, most metrics
36
36
37
37
## Aggregation types
38
38
39
-
There are five basic aggregation types available in the metrics explorer. Metrics explorer hides the aggregations that are irrelevant and cannot be used for a given metric.
39
+
There are five basic aggregation types available in the metrics explorer. Metrics explorer hides the aggregations that are irrelevant and can't be used for a given metric.
40
40
41
41
-**Sum** – the sum of all values captured over the aggregation interval. Sometimes referred to as the Total aggregation.
42
42
-**Count** – the number of measurements captured over the aggregation interval. Count doesn't look at the value of the measurement, only the number of records.
@@ -71,35 +71,38 @@ For time granularity of 1 minute (the smallest possible on the chart), you get 2
71
71
72
72
:::image type="content" source="media/metrics-aggregation-explained/24-hour-1-min-gran.png" alt-text="Screenshot showing data on a line graph set to 24-hour time range and 1-minute time granularity" border="true" lightbox="media/metrics-aggregation-explained/24-hour-1-min-gran.png":::
73
73
74
-
The charts look different for these summations as shown in the previous screenshots. Notice how this VM has a lot of output in a small time period relative to the rest of the time window.
74
+
The charts look different for these summations as shown in the previous screenshots. Notice how this VM has numerous outputs in a small time period relative to the rest of the time window.
75
75
76
76
The time granularity allows you to adjust the "signal-to-noise" ratio on a chart. Higher aggregations remove noise and smooth out spikes. Notice the variations at the bottom 1-minute chart and how they smooth out as you go to higher granularity values.
77
77
78
-
This smoothing behavior is important when you send this data to other systems--for example, alerts. Typically, you usually don't want to be alerted by very short spikes in CPU time over 90%. But if the CPU stays at 90% for 5 minutes, that's likely important. If you set up an alert rule on CPU (or any metric), making the time granularity larger can reduce the number of false alerts you receive.
78
+
This smoothing behavior is important when you send this data to other systems--for example, alerts. Typically, you usually don't want to be alerted by short spikes in CPU time over 90%. But if the CPU stays at 90% for 5 minutes, that's likely important. If you set up an alert rule on CPU (or any metric), making the time granularity larger can reduce the number of false alerts you receive.
79
79
80
-
It is important to establish what's "normal" for your workload to know what time interval is best. This is one of the benefits of [dynamic alerts](../alerts/alerts-dynamic-thresholds.md), which is a different topic not covered here.
80
+
It's important to establish what's "normal" for your workload to know what time interval is best. This is one of the benefits of [dynamic alerts](../alerts/alerts-dynamic-thresholds.md), which is a different topic not covered here.
81
81
82
82
## How the system collects metrics
83
83
84
84
Data collection varies by metric.
85
85
86
+
> [!NOTE]
87
+
> The examples below are simplified for illustration, and the actual metric data included in each aggregation is affected by the data available when the evaluation occurs.
88
+
86
89
### Measurement collection frequency
87
90
88
91
There are two types of collection periods.
89
92
90
-
-**Regular** - The metric is gathered at a consistent time interval that does not vary.
93
+
-**Regular** - The metric is gathered at a consistent time interval that doesn't vary.
91
94
92
-
-**Activity-based** - The metric is gathered based on when a transaction of a certain type occurs. Each transaction has a metric entry and a time stamp. They are not gathered at regular intervals so there are a varying number of records over a given time period.
95
+
-**Activity-based** - The metric is gathered based on when a transaction of a certain type occurs. Each transaction has a metric entry and a time stamp. They aren't gathered at regular intervals so there are a varying number of records over a given time period.
93
96
94
97
### Granularity
95
98
96
99
The minimum time granularity is 1 minute, but the underlying system may capture data faster depending on the metric. For example, CPU percentage for an Azure VM is captured at a time interval of 15 seconds. Because HTTP failures are tracked as transactions, they can easily exceed many more than one a minute. Other metrics such as SQL Storage are captured at a time interval of every 20 minutes. This choice is up to the individual resource provider and type. Most try to provide the smallest time interval possible.
97
100
98
101
### Dimensions, splitting, and filtering
99
102
100
-
Metrics are captured for each individual resource. However, the level at which the metrics are collected, stored, and able to be charted may vary. This level is represented by additional metrics available in **metrics dimensions**. Each individual resource provider gets to define how detailed the data they collect is. Azure Monitor only defines how such detail should be presented and stored.
103
+
Metrics are captured for each individual resource. However, the level at which the metrics are collected, stored, and able to be charted may vary. This level is represented by other metrics available in **metrics dimensions**. Each individual resource provider gets to define how detailed the data they collect is. Azure Monitor only defines how such detail should be presented and stored.
101
104
102
-
When you chart a metric in metric explorer, you have the option to "split" the chart by a dimension. Splitting a chart means that you are looking into the underlying data for more detail and seeing that data charted or filtered in metric explorer.
105
+
When you chart a metric in metric explorer, you have the option to "split" the chart by a dimension. Splitting a chart means that you're looking into the underlying data for more detail and seeing that data charted or filtered in metric explorer.
103
106
104
107
For example, [Microsoft.ApiManagement/service](./metrics-supported.md#microsoftapimanagementservice) has *Location* as a dimension for many metrics.
105
108
@@ -111,13 +114,13 @@ For example, [Microsoft.ApiManagement/service](./metrics-supported.md#microsofta
111
114
112
115
Check the Azure Monitor [metrics supported](./metrics-supported.md) article for details on each metric and the dimensions available. In addition, the documentation for each resource provider and type may provide additional information on the dimensions and what they measure.
113
116
114
-
You can use splitting and filtering together to dig into a problem. Below is an example of a graphic showing the *Avg Disk Write Bytes* for a group of VMs in a resource group. We have a rollup of all the VMs with this metric, but we may want to dig into see which are actually responsible for the peaks around 6AM. Are they the same machine? How many machines are involved?
117
+
You can use splitting and filtering together to dig into a problem. Below is an example of a graphic showing the *Avg Disk Write Bytes* for a group of VMs in a resource group. We have a rollup of all the VMs with this metric, but we may want to dig into see which are responsible for the peaks around 6AM. Are they the same machine? How many machines are involved?
115
118
116
119
:::image type="content" source="media/metrics-aggregation-explained/total-disk write-bytes-all-VMs.png" alt-text="Screenshot showing total Disk Write Bytes for all virtual machines in Contoso Hotels resource group" border="true" lightbox="media/metrics-aggregation-explained/total-disk write-bytes-all-VMs.png":::
117
120
118
121
*Click on the images in this section to see larger versions.*
119
122
120
-
When we apply splitting, we can see the underlying data, but it's a bit of a mess. Turns out there are 20 VMs being aggregated into the chart above. In this case, we've used our mouse to hover over the large peak at 6AM that tells us that CH-DCVM11 is the cause. but it's hard to see the rest of the data associated with that VM because of other VMs cluttering the chart.
123
+
When we apply splitting, we can see the underlying data, but it's a bit of a mess. Turns out there are 20 VMs being aggregated into the chart above. In this case, we've used our mouse to hover over the large peak at 6AM that tells us that CH-DCVM11 is the cause. But it's hard to see the rest of the data associated with that VM because of other VMs cluttering the chart.
121
124
122
125
:::image type="content" source="media/metrics-aggregation-explained/split-total-disk write-bytes-all-VMs.png" alt-text="Screenshot showing Disk Write Bytes for all virtual machines in Contoso Hotels resource group split by virtual machine name" border="true" lightbox="media/metrics-aggregation-explained/split-total-disk write-bytes-all-VMs.png":::
123
126
@@ -129,25 +132,25 @@ For more information on how to show split dimension data on a metric explorer ch
129
132
130
133
### NULL and zero values
131
134
132
-
When the system expects metric data from a resource but doesn't receive it, it records a NULL value. NULL is different than a zero value, which becomes important in the calculation of aggregations and charting. NULL values are not counted as valid measurements.
135
+
When the system expects metric data from a resource but doesn't receive it, it records a NULL value. NULL is different than a zero value, which becomes important in the calculation of aggregations and charting. NULL values aren't counted as valid measurements.
133
136
134
-
NULLs show up differently on different charts. Scatter plots skip showing a dot on the chart. Bar charts skip showing the bar. On line charts, NULL can show up as [dotted or dashed lines](../essentials/metrics-troubleshoot.md#chart-shows-dashed-line) like those shown in the screenshot in the previous section. When calculating averages that include NULLs, there are fewer data points to take the average from. This behavior can sometimes result in an unexpected drop in values on a chart, though usually less so than if the value was converted to a zero and used as a valid datapoint.
137
+
NULLs show up differently on different charts. Scatter plots skip showing a dot on the chart. Bar charts skip showing the bar. On line charts, NULL can show up as [dotted or dashed lines](../essentials/metrics-troubleshoot.md#chart-shows-dashed-line) like those shown in the screenshot in the previous section. When calculating averages that include NULLs, there are fewer data points to take the average from. This behavior can sometimes result in an unexpected drop in values on a chart, though less so than if the value was converted to a zero and used as a valid datapoint.
135
138
136
139
[Custom metrics](../essentials/metrics-custom-overview.md) always use NULLs when no data is received. With [platform metrics](../data-platform.md), each resource provider decides whether to use zeros or NULLs based on what makes the most sense for a given metric.
137
140
138
141
Azure Monitor alerts use the values the resource provider writes to the metric database, so it's important to know how the resource provider handles NULLs by viewing the data first.
139
142
140
143
## How aggregation works
141
144
142
-
The metrics charts in the previous system show different types of aggregated data. The system pre-aggregates the data so that the requested charts can show quicker without a lot of repeated computation.
145
+
The metrics charts in the previous system show different types of aggregated data. The system pre-aggregates the data so that the requested charts can show quicker without many repeated computations.
143
146
144
147
In this example:
145
148
146
-
- We are collecting a **fictitious** transactional metric called **HTTP failures**
149
+
- We're collecting a **fictitious** transactional metric called **HTTP failures**
147
150
-*Server* is a dimension for the **HTTP failures** metric.
148
151
- We have 3 servers - Server A, B, and C.
149
152
150
-
To simplify the explanation, we'll start with the SUM aggregation type only.
153
+
To simplify the explanation, we start with the SUM aggregation type only.
151
154
152
155
### Sub minute to 1-minute aggregation
153
156
@@ -209,7 +212,7 @@ Each of those subminute streams would then be aggregated into 1-minute time-seri
209
212
210
213
In addition, the following collapsed aggregations would also be stored:
211
214
212
-
- Server A, Adapter 1 (because there is nothing to collapse, it would be stored again)
215
+
- Server A, Adapter 1 (because there's nothing to collapse, it would be stored again)
213
216
- Server B, Adapter 1+2
214
217
- Server C, Adapter 1+2+3
215
218
- Servers ALL, Adapters ALL
@@ -218,11 +221,11 @@ This shows that metrics with large numbers of dimensions have a larger number of
218
221
219
222
### Aggregation with no dimensions
220
223
221
-
Because this metric has a dimension *Server*, you can get to the underlying data for server A, B, and C above via splitting and filtering, as explained earlier in this article. If the metric didn't have *Server* as a dimension, you as a customer could only access the aggregated 1-minute sums shown in black on the diagram. That is, the values of 3, 6, 6, 9, etc. The system also would not do the underlying work to aggregate split values it would never use them in metric explorer or send them out via the metrics REST API.
224
+
Because this metric has a dimension *Server*, you can get to the underlying data for server A, B, and C above via splitting and filtering, as explained earlier in this article. If the metric didn't have *Server* as a dimension, you as a customer could only access the aggregated 1-minute sums shown in black on the diagram. That is, the values of 3, 6, 6, 9, etc. The system also wouldn't do the underlying work to aggregate split values it would never use them in metric explorer or send them out via the metrics REST API.
222
225
223
226
## Viewing time granularities above 1 minute
224
227
225
-
If you ask for metrics at a larger granularity, the system uses the 1-minute aggregated sums to calculate the sums for the larger time granularities. Below, dotted lines show the summation method for the 2-minute and 5-minute time granularities. Again, we are showing just the SUM aggregation type for simplicity.
228
+
If you ask for metrics at a larger granularity, the system uses the 1-minute aggregated sums to calculate the sums for the larger time granularities. Below, dotted lines show the summation method for the 2-minute and 5-minute time granularities. Again, we're showing just the SUM aggregation type for simplicity.
226
229
227
230
:::image type="content" source="media/metrics-aggregation-explained/1-minute-to-2-min-5-min.png" alt-text="Screenshot showing multiple 1-minute aggregated entries across dimension of server aggregated into 2-min and 5-min time periods." border="false":::
228
231
@@ -251,14 +254,14 @@ Below is the larger diagram for the above 1-minute aggregation process, with som
251
254
252
255
## More complex example
253
256
254
-
Following is a larger example using values for a fictitious metric called HTTP Response time in milliseconds. Here we introduce additional levels of complexity.
257
+
Following is a larger example using values for a fictitious metric called HTTP Response time in milliseconds. Here we introduce other levels of complexity.
255
258
256
259
1. We show aggregation for Sum, Count, Min, and Max and the calculation for Average.
257
260
2. We show NULL values and how they affect calculations.
258
261
259
262
Consider the following example. The boxes and arrows show examples of how the values are aggregated and calculated.
260
263
261
-
The same 1-minute preaggregation process as described in the previous section occurs for Sums, Count, Minimum, and Maximum. However, Average is NOT pre-aggregated. It is recalculated using aggregated data to avoid calculation errors.
264
+
The same 1-minute preaggregation process as described in the previous section occurs for Sums, Count, Minimum, and Maximum. However, Average is NOT pre-aggregated. It's recalculated using aggregated data to avoid calculation errors.
262
265
263
266
:::image type="content" source="media/metrics-aggregation-explained/full-aggregation-example-all-types.png" alt-text="Screenshot showing complex example of aggregation and calculation of sum, count, min, max and average from 1 minute to 10 minutes." border="false" lightbox="media/metrics-aggregation-explained/full-aggregation-example-all-types.png":::
0 commit comments