Skip to content

Commit 917e076

Browse files
authored
Merge pull request #210776 from EdB-MSFT/autoscale-flapping
New Article: Flapping in autoscale
2 parents 476414f + 7da8e00 commit 917e076

10 files changed

+181
-35
lines changed

articles/azure-monitor/autoscale/autoscale-best-practices.md

Lines changed: 3 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
---
22
title: Best practices for autoscale
3-
description: Autoscale patterns in Azure for Web Apps, Virtual Machine Scale sets, and Cloud Services
3+
description: Autoscale patterns in Azure for Web Apps, virtual machine scale sets, and Cloud Services
44
ms.topic: conceptual
5-
ms.date: 04/22/2022
5+
ms.date: 09/13/2022
66
ms.subservice: autoscale
77
ms.reviewer: riroloff
88
---
@@ -42,39 +42,7 @@ In this example, you can have a situation in which the memory usage is over 90%
4242
### Choose the appropriate statistic for your diagnostics metric
4343
For diagnostics metrics, you can choose among *Average*, *Minimum*, *Maximum* and *Total* as a metric to scale by. The most common statistic is *Average*.
4444

45-
### Choose the thresholds carefully for all metric types
46-
We recommend carefully choosing different thresholds for scale-out and scale-in based on practical situations.
4745

48-
We *do not recommend* autoscale settings like the examples below with the same or similar threshold values for out and in conditions:
49-
50-
* Increase instances by 1 count when Thread Count >= 600
51-
* Decrease instances by 1 count when Thread Count <= 600
52-
53-
Let's look at an example of what can lead to a behavior that may seem confusing. Consider the following sequence.
54-
55-
1. Assume there are two instances to begin with and then the average number of threads per instance grows to 625.
56-
2. Autoscale scales out adding a third instance.
57-
3. Next, assume that the average thread count across instance falls to 575.
58-
4. Before scaling down, autoscale tries to estimate what the final state will be if it scaled in. For example, 575 x 3 (current instance count) = 1,725 / 2 (final number of instances when scaled down) = 862.5 threads. This means autoscale would have to immediately scale out again even after it scaled in, if the average thread count remains the same or even falls only a small amount. However, if it scaled up again, the whole process would repeat, leading to an infinite loop.
59-
5. To avoid this situation (termed "flapping"), autoscale does not scale down at all. Instead, it skips and reevaluates the condition again the next time the service's job executes. The flapping state can confuse many people because autoscale wouldn't appear to work when the average thread count was 575.
60-
61-
Estimation during a scale-in is intended to avoid "flapping" situations, where scale-in and scale-out actions continually go back and forth. Keep this behavior in mind when you choose the same thresholds for scale-out and in.
62-
63-
We recommend choosing an adequate margin between the scale-out and in thresholds. As an example, consider the following better rule combination.
64-
65-
* Increase instances by 1 count when CPU% >= 80
66-
* Decrease instances by 1 count when CPU% <= 60
67-
68-
In this case
69-
70-
1. Assume there are 2 instances to start with.
71-
2. If the average CPU% across instances goes to 80, autoscale scales out adding a third instance.
72-
3. Now assume that over time the CPU% falls to 60.
73-
4. Autoscale's scale-in rule estimates the final state if it were to scale-in. For example, 60 x 3 (current instance count) = 180 / 2 (final number of instances when scaled down) = 90. So autoscale does not scale-in because it would have to scale-out again immediately. Instead, it skips scaling down.
74-
5. The next time autoscale checks, the CPU continues to fall to 50. It estimates again - 50 x 3 instance = 150 / 2 instances = 75, which is below the scale-out threshold of 80, so it scales in successfully to 2 instances.
75-
76-
> [!NOTE]
77-
> If the autoscale engine detects flapping could occur as a result of scaling to the target number of instances, it will also try to scale to a different number of instances between the current count and the target count. If flapping does not occur within this range, autoscale will continue the scale operation with the new target.
7846

7947
### Considerations for scaling threshold values for special metrics
8048
For special metrics such as Storage or Service Bus Queue length metric, the threshold is the average number of messages available per current number of instances. Carefully choose the threshold value for this metric.
@@ -160,5 +128,6 @@ We recommend you do NOT explicit set your agent to only use TLS 1.2 unless absol
160128

161129

162130
## Next Steps
131+
- [Autoscale flapping](/azure/azure-monitor/autoscale/autoscale-flapping)
163132
- [Create an Activity Log Alert to monitor all autoscale engine operations on your subscription.](https://github.com/Azure/azure-quickstart-templates/tree/master/demos/monitor-autoscale-alert)
164133
- [Create an Activity Log Alert to monitor all failed autoscale scale in/scale out operations on your subscription](https://github.com/Azure/azure-quickstart-templates/tree/master/demos/monitor-autoscale-failed-alert)
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
---
2+
title: Autoscale flapping
3+
description: "Flapping in Autoscale"
4+
author: EdB-MSFT
5+
ms.author: edbaynash
6+
ms.service: azure-monitor
7+
ms.subservice: autoscale
8+
ms.topic: conceptual
9+
ms.date: 09/13/2022
10+
ms.reviewer: akkumari
11+
12+
#Customer intent: As a cloud administrator, I want understand flapping so that I can configure autoscale correctly.
13+
---
14+
15+
# Flapping in Autoscale
16+
17+
This article describes flapping in autoscale and how to avoid it.
18+
19+
Flapping refers to a loop condition that causes a series of opposing scale events. Flapping happens when a scale event triggers the opposite scale event.
20+
21+
Autoscale evaluates a pending scale-in action to see if it would cause flapping. In cases where flapping could occur, autoscale may skip the scale action and reevaluate at the next run, or autoscale may scale by less than the specified number of resource instances. The autoscale evaluation process occurs each time the autoscale engine runs, which is every 30 to 60 seconds, depending on the resource type.
22+
23+
To ensure adequate resources, checking for potential flapping doesn't occur for scale-out events. Autoscale will only defer a scale-in event to avoid flapping.
24+
25+
For example, let's assume the following rules:
26+
27+
* Scale out increasing by 1 instance when average CPU usage is above 50%.
28+
* Scale in decreasing the instance count by 1 instance when average CPU usage is lower than 30%.
29+
30+
In the table below at T0, when usage is at 56%, a scale-out action is triggered and results in 56% CPU usage across 2 instances. That gives an average of 28% for the scale set. As 28% is less than the scale-in threshold, autoscale should scale back in. Scaling in would return the scale set to 56% CPU usage, which triggers a scale-out action.
31+
32+
|Time| Instance count| CPU% |CPU% per instance| Scale event| Resulting instance count
33+
|---|---|---|---|---|---|
34+
T0|1|56%|56%|Scale out|2|
35+
T1|2|56%|28%|Scale in|1|
36+
T2|1|56%|56%|Scale out|2|
37+
T3|2|56%|28%|Scale in|1|
38+
39+
If left uncontrolled, there would be an ongoing series of scale events. However, in this situation, the autoscale engine will defer the scale-in event at *T1* and reevaluate during the next autoscale run. The scale-in will only happen once the average CPU usage is below 30%.
40+
41+
Flapping is often caused by:
42+
43+
* Small or no margins between thresholds
44+
* Scaling by more than one instance
45+
* Scaling in and out using different metrics
46+
47+
## Small or no margins between thresholds
48+
49+
To avoid flapping, keep adequate margins between scaling thresholds.
50+
51+
For example, the following rules where there's no margin between thresholds, cause flapping.
52+
53+
* Scale out when thread count >=600
54+
* Scale in when thread count < 600
55+
56+
:::image type="content" source="./media/autoscale-flapping/autoscale-flapping-example-2.png" alt-text="A screenshot showing autoscale rules with scale out when thread count is greater than or equal to 600 and scale in when thread count less than 600.":::
57+
58+
The table below shows a potential outcome of these autoscale rules:
59+
60+
|Time| Instance count| Thread count|Thread count per instance| Scale event| Resulting instance count
61+
|---|---|---|---|---|---|
62+
T0|2|1250|625|Scale out|3|
63+
T1|3|1250|417|Scale in|2|
64+
65+
* At time T0, there are two instances handling 1250 threads, or 625 treads per instance. Autoscale scales out to three instances.
66+
* Following the scale-out, at T1, we have the same 1250 threads, but with three instances, only 417 threads per instance. A scale-in event is triggered.
67+
* Before scaling-in, autoscale evaluates what would happen if the scale-in event occurs. In this example, 1250 / 2 = 625, that is, 625 threads per instance. Autoscale would have to immediately scale out again after it scaled in. If it scaled out again, the process would repeat, leading to flapping loop.
68+
* To avoid this situation, autoscale doesn't scale in. Autoscale skips the current scale event and reevaluates the rule in the next execution cycle.
69+
70+
In this case, it looks like autoscale isn't working since no scale event takes place. Check the *Run history* tab on the autoscale setting page to see if there's any flapping.
71+
72+
:::image type="content" source="./media/autoscale-flapping/autoscale-flapping-run-history.png" alt-text="A screenshot showing the autoscale run history tab with records showing flapping." lightbox="./media/autoscale-flapping/autoscale-flapping-run-history.png":::
73+
74+
Setting an adequate margin between thresholds avoids the above scenario. For example,
75+
76+
* Scale out when thread count >=600
77+
* Scale in when thread count < 400
78+
79+
:::image type="content" source="./media/autoscale-flapping/autoscale-flapping-example-3.png" alt-text="A screenshot showing autoscale rules with scale out when thread count greater than or equal to 600 and scale in when thread count less than 400." lightbox:::
80+
81+
If the scale-in thread count is 400, the total thread count would have to drop to below 1200 before a scale event would take place. See the table below.
82+
83+
|Time| Instance count| Thread count|Thread count per instance| Scale event| Resulting instance count
84+
|---|---|---|---|---|---|
85+
T0|2|1250|625|Scale out|3|
86+
T1|3|1250|417|no scale event|3|
87+
T2|3|1180|394|scale in|2|
88+
T3|3|1180|590|no scale event|2|
89+
90+
## Scaling by more than one instance
91+
92+
To avoid flapping when scaling in or out by more than one instance, autoscale may scale by less than the number of instances specified in the rule.
93+
94+
For example, the following rules can cause flapping:
95+
96+
* Scale out by 20 when the request count >=200 per instance.
97+
* OR when CPU > 70% per instance.
98+
* Scale in by 10 when the request count <=50 per instance.
99+
100+
:::image type="content" source="./media/autoscale-flapping/autoscale-flapping-example-1.png" alt-text="A screenshot showing an autoscale default scale condition with rules configured for the example." :::
101+
102+
The table below shows a potential outcome of these autoscale rules:
103+
104+
|Time|Number of instances|CPU |Request count| Scale event| Resulting instances|Comments|
105+
|---|---|---|---|---|---|---|
106+
|T0|30|65%|3000, or 100 per instance.|No scale event|30|
107+
|T1|30|65|1500| Scale in by 3 instances |27|Scaling-in by 10 would cause an estimated CPU rise above 70%, leading to a scale-out event.
108+
109+
At time T0, the app is running with 30 instances, a total request count of 3000, and a CPU usage of 65% per instance.
110+
111+
At T1, when the request count drops to 1500 requests, or 50 requests per instance, autoscale will try to scale in by 10 instances to 20. However, autoscale estimates that the CPU load for 20 instances will be above 70%, causing a scale-out event.
112+
113+
To avoid flapping, the autoscale engine estimates the CPU usage for instance counts above 20 until it finds an instance count where all metrics are with in the defined thresholds:
114+
115+
* Keep the CPU below 70%.
116+
* Keep the number of requests per instance is above 50.
117+
* Reduce the number of instances below 30.
118+
119+
In this situation, autoscale may scale in by 3, from 30 to 27 instances in order to satisfy the rules, even though the rule specifies a decrease of 10. A log message is written to the activity log with a description that includes *Scale down will occur with updated instance count to avoid flapping*
120+
121+
If autoscale can't find a suitable number of instances, it will skip the scale in event and reevaluate during the next cycle.
122+
123+
> [!NOTE]
124+
> If the autoscale engine detects that flapping could occur as a result of scaling to the target number of instances, it will also try to scale to a lower number of instances between the current count and the target count. If flapping does not occur within this range, autoscale will continue the scale operation with the new target.
125+
126+
## Log files
127+
128+
Find flapping in the activity log with the following query:
129+
130+
````KQL
131+
// Activity log, CategoryValue: Autoscale
132+
// Lists latest Autoscale operations from the activity log, with OperationNameValue =="Microsoft.Insights/AutoscaleSettings/Flapping/Action
133+
AzureActivity
134+
|where CategoryValue =="Autoscale" and OperationNameValue =="Microsoft.Insights/AutoscaleSettings/Flapping/Action"
135+
|sort by TimeGenerated desc
136+
````
137+
138+
Below is an example of an activity log record for flapping:
139+
140+
:::image type="content" source="./media/autoscale-flapping/autoscale-flapping-log.png" alt-text="A screenshot showing a log record from a flapping event." lightbox="./media/autoscale-flapping/autoscale-flapping-log.png":::
141+
142+
````JSON
143+
{
144+
"eventCategory": "Autoscale",
145+
"eventName": "FlappingOccurred",
146+
"operationId": "ffd31c67-1438-47a5-bee4-1e3a102cf1c2",
147+
"eventProperties":
148+
"{"Description":"Scale down will occur with updated instance count to avoid flapping.
149+
Resource: '/subscriptions/d1234567-9876-a1b2-a2b1-123a567b9f8767/ resourcegroups/ed-rg-001/providers/Microsoft.Web/serverFarms/ ScaleableAppServicePlan'.
150+
Current instance count: '6',
151+
Intended new instance count: '1'.
152+
Actual new instance count: '4'",
153+
"ResourceName":"/subscriptions/d1234567-9876-a1b2-a2b1-123a567b9f8767/resourcegroups/ed-rg-001/providers/Microsoft.Web/serverFarms/ScaleableAppServicePlan",
154+
"OldInstancesCount":6,
155+
"NewInstancesCount":4,
156+
"ActiveAutoscaleProfile":{"Name":"Auto created scale condition",
157+
"Capacity":{"Minimum":"1","Maximum":"30","Default":"1"},
158+
"Rules":[{"MetricTrigger":{"Name":"Requests","Namespace":"microsoft.web/sites","Resource":"/subscriptions/d1234567-9876-a1b2-a2b1-123a567b9f8767/resourceGroups/ed-rg-001/providers/Microsoft.Web/sites/ScaleableWebApp1","ResourceLocation":"West Central US","TimeGrain":"PT1M","Statistic":"Average","TimeWindow":"PT1M","TimeAggregation":"Maximum","Operator":"GreaterThanOrEqual","Threshold":3.0,"Source":"/subscriptions/d1234567-9876-a1b2-a2b1-123a567b9f8767/resourceGroups/ed-rg-001/providers/Microsoft.Web/sites/ScaleableWebApp1","MetricType":"MDM","Dimensions":[],"DividePerInstance":true},"ScaleAction":{"Direction":"Increase","Type":"ChangeCount","Value":"10","Cooldown":"PT1M"}},{"MetricTrigger":{"Name":"Requests","Namespace":"microsoft.web/sites","Resource":"/subscriptions/d1234567-9876-a1b2-a2b1-123a567b9f8767/resourceGroups/ed-rg-001/providers/Microsoft.Web/sites/ScaleableWebApp1","ResourceLocation":"West Central US","TimeGrain":"PT1M","Statistic":"Max","TimeWindow":"PT1M","TimeAggregation":"Maximum","Operator":"LessThan","Threshold":3.0,"Source":"/subscriptions/d1234567-9876-a1b2-a2b1-123a567b9f8767/resourceGroups/ed-rg-001/providers/Microsoft.Web/sites/ScaleableWebApp1","MetricType":"MDM","Dimensions":[],"DividePerInstance":true},"ScaleAction":{"Direction":"Decrease","Type":"ChangeCount","Value":"5","Cooldown":"PT1M"}}]}}",
159+
"eventDataId": "b23ae911-55d0-4881-8684-fc74227b2ddb",
160+
"eventSubmissionTimestamp": "2022-09-13T07:20:41.1589076Z",
161+
"resource": "scaleableappserviceplan",
162+
"resourceGroup": "ED-RG-001",
163+
"resourceProviderValue": "MICROSOFT.WEB",
164+
"subscriptionId": "D1234567-9876-A1B2-A2B1-123A567B9F876",
165+
"activityStatusValue": "Succeeded"
166+
}
167+
````
168+
169+
## Next steps
170+
171+
To learn more about autoscale, see the following resources:
172+
173+
* [Overview of common autoscale patterns](/azure/azure-monitor/autoscale/autoscale-common-scale-patterns)
174+
* [Automatically scale a virtual machine scale](/azure/virtual-machine-scale-sets/tutorial-autoscale-powershell)
175+
* [Use autoscale actions to send email and webhook alert notifications](/azure/azure-monitor/autoscale/autoscale-webhook-email)

articles/azure-monitor/autoscale/autoscale-get-started.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ You can make changes in JSON directly, if necessary. These changes will be refle
120120

121121
### Cool-down period effects
122122

123-
Autoscale uses a cool-down period to prevent "flapping," which is the rapid, repetitive up-and-down scaling of instances. For more information, see [Autoscale evaluation steps](autoscale-understanding-settings.md#autoscale-evaluation). For other valuable information on flapping and understanding how to monitor the autoscale engine, see [Autoscale best practices](autoscale-best-practices.md#choose-the-thresholds-carefully-for-all-metric-types) and [Troubleshooting autoscale](autoscale-troubleshoot.md), respectively.
123+
Autoscale uses a cool-down period to prevent "flapping," which is the rapid, repetitive up-and-down scaling of instances. For more information, see [Autoscale evaluation steps](autoscale-understanding-settings.md#autoscale-evaluation). For other valuable information on flapping and understanding how to monitor the autoscale engine, see [Flapping in Autoscale](autoscale-flapping.md) and [Troubleshooting autoscale](autoscale-troubleshoot.md), respectively.
124124

125125
## Route traffic to healthy instances (App Service)
126126

39.2 KB
Loading
33.9 KB
Loading
11 KB
Loading
94 KB
Loading
48.2 KB
Loading
99.8 KB
Loading

articles/azure-monitor/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -612,6 +612,8 @@ items:
612612
href: autoscale/autoscale-predictive.md
613613
- name: Best practices
614614
href: autoscale/autoscale-best-practices.md
615+
- name: Flapping in autoscale
616+
href: autoscale/autoscale-flapping.md
615617
- name: Common metrics
616618
href: autoscale/autoscale-common-metrics.md
617619
- name: Common patterns

0 commit comments

Comments
 (0)