You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: azure-stack/operator/azure-stack-capacity-planning-compute.md
+33-32Lines changed: 33 additions & 32 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,15 +18,15 @@ ms.lastreviewed: 03/02/2021
18
18
The [virtual machine (VM) sizes](../user/azure-stack-vm-sizes.md) supported on Azure Stack Hub are a subset of those supported on Azure. Azure imposes resource limits along many vectors to avoid overconsumption of resources (server local and service-level). Without imposing some limits on tenant consumption, the tenant experiences suffer when other tenants overconsume resources. For networking egress from the VM, there are bandwidth caps in place on Azure Stack Hub that match Azure limitations. For storage resources on Azure Stack Hub, storage IOPS limits avoid basic over consumption of resources by tenants for storage access.
19
19
20
20
> [!IMPORTANT]
21
-
> The [Azure Stack Hub Capacity Planner](https://aka.ms/azstackcapacityplanner) does not consider or guarantee IOPS performance. The administrator portal shows a warning alert when the total system memory consumption has reached 85%. This alert can be remediated by [adding additional capacity](azure-stack-add-scale-node.md), or by removing virtual machines that are no longer required.
21
+
> The [Azure Stack Hub Capacity Planner](https://aka.ms/azstackcapacityplanner) does not consider or guarantee IOPS performance. The administrator portal shows a warning alert when the total system memory consumption reaches 85%. This alert can be remediated by [adding more capacity](azure-stack-add-scale-node.md), or by removing virtual machines that are no longer required.
22
22
23
23
## VM placement
24
24
25
25
The Azure Stack Hub placement engine places tenant VMs across the available hosts.
26
26
27
27
Azure Stack Hub uses two considerations when placing VMs. One, is there enough memory on the host for that VM type? And two, are the VMs a part of an [availability set](/azure/virtual-machines/windows/manage-availability) or are they [virtual machine scale sets](/azure/virtual-machine-scale-sets/overview)?
28
28
29
-
To achieve high availability of a multi-VM production workload in Azure Stack Hub, virtual machines (VMs) are placed in an availability set that spreads them across multiple fault domains. A fault domain in an availability set is defined as a single node in the scale unit. Azure Stack Hub supports having an availability set with a maximum of three fault domains to be consistent with Azure. VMs placed in an availability set will be physically isolated from each other by spreading them as evenly as possible over multiple fault domains (Azure Stack Hub nodes). If there's a hardware failure, VMs from the failed fault domain are restarted in other fault domains. If possible, they're kept in separate fault domains from the other VMs in the same availability set. When the host comes back online, VMs are rebalanced to maintain high availability.
29
+
To achieve high availability of a multi-VM production workload in Azure Stack Hub, virtual machines (VMs) are placed in an availability set that spreads them across multiple fault domains. A fault domain in an availability set is defined as a single node in the scale unit. Azure Stack Hub supports having an availability set with a maximum of three fault domains to be consistent with Azure. VMs placed in an availability set are physically isolated from each other by spreading them as evenly as possible over multiple fault domains (Azure Stack Hub nodes). If there's a hardware failure, VMs from the failed fault domain are restarted in other fault domains. If possible, they're kept in separate fault domains from the other VMs in the same availability set. When the host comes back online, VMs are rebalanced to maintain high availability.
30
30
31
31
Virtual machine scale sets use availability sets on the back end and make sure each virtual machine scale set instance is placed in a different fault domain. This means they use separate Azure Stack Hub infrastructure nodes. For example, in a four-node Azure Stack Hub system, there might be a situation where a virtual machine scale set of three instances fails at creation due to the lack of the four-node capacity to place three virtual machine scale set instances on three separate Azure Stack Hub nodes. In addition, Azure Stack Hub nodes can be filled up at varying levels before trying placement.
32
32
@@ -36,7 +36,7 @@ Since placement algorithms don't look at the existing virtual to physical core o
36
36
37
37
## Consideration for total number of VMs
38
38
39
-
There is a limit on the total number of VMs that can be created. The maximum number of VMs on Azure Stack Hub is 700 and 60 per scale unit node. For example, an eight-server Azure Stack Hub VM limit would be 480 (8 * 60). For a 12 to 16 server Azure Stack Hub solution, the limit would be 700. This limit has been created keeping all the compute capacity considerations in mind, such as the resiliency reserve and the CPU virtual-to-physical ratio that an operator would like to maintain on the stamp.
39
+
There is a limit on the total number of VMs that can be created. The maximum number of VMs on Azure Stack Hub is 700 and 60 per scale unit node. For example, an eight-server Azure Stack Hub VM limit would be 480 (8 * 60). For a 12 to 16 server Azure Stack Hub solution, the limit would be 700. This limit waws created with all the compute capacity considerations in mind, such as the resiliency reserve and the CPU virtual-to-physical ratio that an operator would like to maintain on the stamp.
40
40
41
41
If the VM scale limit is reached, the following error codes are returned as a result: `VMsPerScaleUnitLimitExceeded`, `VMsPerScaleUnitNodeLimitExceeded`.
42
42
@@ -45,15 +45,15 @@ If the VM scale limit is reached, the following error codes are returned as a re
45
45
46
46
## Consideration for batch deployment of VMs
47
47
48
-
In releases prior to and including 2002, 2-5 VMs per batch with 5 mins gap in between batches provided reliable VM deployments to reach a scale of 700 VMs. With the 2005 version of Azure Stack Hub onwards, we are able to reliably provision VMs at batch sizes of 40 with 5 mins gap in between batch deployments. Start, Stop-deallocate, and update operations should be done at a batch size of 30, leaving 5 mins in between each batch.
48
+
In releases before and including 2002, 2-5 VMs per batch with 5 mins gap in between batches provided reliable VM deployments to reach a scale of 700 VMs. With the 2005 version of Azure Stack Hub onwards, we are able to reliably provision VMs at batch sizes of 40 with 5 mins gap in between batch deployments. Start, Stop-deallocate, and update operations should be done at a batch size of 30, leaving 5 mins in between each batch.
49
49
50
50
## Consideration for GPU VMs
51
51
52
52
Azure Stack hub reserves memory for the infrastructure and tenant VMs to failover. Unlike other VMs, GPU VMs run in a non-HA (high availability) mode and therefore do not failover. As a result, reserved memory for a GPU VM-only stamp is what is required by the infrastructure to failover, as opposed to accounting for HA tenant VM memory too.
53
53
54
54
## Azure Stack Hub memory
55
55
56
-
Azure Stack Hub is designed to keep VMs running that have been successfully provisioned. For example, if a host is offline because of a hardware failure, Azure Stack Hub will attempt to restart that VM on another host. A second example during patching and updating of the Azure Stack Hub software. If there's a need to reboot a physical host, an attempt is made to move the VMs executing on that host to another available host in the solution.
56
+
Azure Stack Hub is designed to keep VMs running that were successfully provisioned. For example, if a host is offline because of a hardware failure, Azure Stack Hub attempts to restart that VM on another host. A second example during patching and updating of the Azure Stack Hub software. If there's a need to reboot a physical host, an attempt is made to move the VMs executing on that host to another available host in the solution.
57
57
58
58
This VM management or movement can only be achieved if there's reserved memory capacity to allow for the restart or migration to occur. A portion of the total host memory is reserved and unavailable for tenant VM placement.
59
59
@@ -66,7 +66,7 @@ Used memory is made up of several components. The following components consume t
66
66
-**Host OS usage or reserve:** The memory used by the operating system (OS) on the host, virtual memory page tables, processes that are running on the host OS, and the Spaces Direct memory cache. Since this value is dependent on the memory used by the different Hyper-V processes running on the host, it can fluctuate.
67
67
-**Infrastructure services:** The infrastructure VMs that make up Azure Stack Hub. As discussed previously, these VMs are part of the 700 VM maximum. The memory utilization of the infrastructure services component may change as we work on making our infrastructure services more scalable and resilient. For more information see the [Azure Stack Hub capacity planner](https://aka.ms/azstackcapacityplanner)
68
68
-**Resiliency reserve:** Azure Stack Hub reserves a portion of the memory to allow for tenant availability during a single host failure as well as during patch and update to allow for successful live migration of VMs.
69
-
-**Tenant VMs:** The tenant VMs created by Azure Stack Hub users. In addition to running VMs, memory is consumed by any VMs that have landed on the fabric. This means that VMs in "Creating" or "Failed" state, or VMs shut down from within the guest, will consume memory. However, VMs that have been deallocated using the stop deallocated option from portal/powershell/cli won't consume memory from Azure Stack Hub.
69
+
-**Tenant VMs:** The tenant VMs created by Azure Stack Hub users. In addition to running VMs, memory is consumed by any VMs that have landed on the fabric. This means that VMs in "Creating" or "Failed" state, or VMs shut down from within the guest, consume memory. However, VMs that were deallocated using the stop deallocated option from portal/powershell/cli won't consume memory from Azure Stack Hub.
70
70
-**Value-add resource providers (RPs):** VMs deployed for the value-add RPs like SQL, MySQL, App Service, and so on.
71
71
72
72
The best way to understand memory consumption on the portal is to use the [Azure Stack Hub Capacity Planner](https://aka.ms/azstackcapacityplanner) to see the impact of various workloads. The following calculation is the same one used by the planner.
@@ -75,31 +75,32 @@ This calculation results in the total available memory that can be used for tena
75
75
76
76
Available memory for VM placement = total host memory - resiliency reserve - memory used by running tenant VMs - Azure Stack Hub Infrastructure Overhead <sup>1</sup>
77
77
78
-
* Total host memory = Sum of memory from all nodes
79
-
* Resiliency reserve = H + R * ((N-1) * H) + V * (N-2)
80
-
* Memory used by tenant VMs = Actual memory consumed by tenant workload, does not depend on HA configuration
> - R = The operating system reserve for OS overhead, which is .15 in this formula<sup>2</sup>
87
-
> - V = Largest HA VM in the scale unit
83
+
Where:
84
+
85
+
- H = Size of single server memory
86
+
- N = Size of Scale Unit (number of servers)
87
+
- R = The operating system reserve for OS overhead, which is .15 in this formula<sup>2</sup>
88
+
- V = Largest HA VM in the scale unit
88
89
89
90
<sup>1</sup> Azure Stack Hub Infrastructure overhead = 268 GB + (4 GB x # of nodes). Approximately 31 VMs are used to host Azure Stack Hub's infrastructure and, in total, consume about 268 GB + (4 GB x # of nodes) of memory and 146 virtual cores. The rationale for this number of VMs is to satisfy the needed service separation to meet security, scalability, servicing, and patching requirements. This internal service structure allows for the future introduction of new infrastructure services as they're developed.
90
91
91
-
<sup>2</sup> Operating system reserve for overhead = 15% (.15) of node memory. The operating system reserve value is an estimate and will vary based on the physical memory capacity of the server and general operating system overhead.
92
+
<sup>2</sup> Operating system reserve for overhead = 15% (.15) of node memory. The operating system reserve value is an estimate and varies based on the physical memory capacity of the server and general operating system overhead.
92
93
93
-
The value V, largest HA VM in the scale unit, is dynamically based on the largest tenant VM memory size. For example, the largest HA VM value is a minimum of 12 GB (accounting for the infrastructure VM) or 112 GB or any other supported VM memory size in the Azure Stack Hub solution. Changing the largest HA VM on the Azure Stack Hub fabric will result in an increase in the resiliency reserve and also to the increase in the memory of the VM itself. Remember that GPU VMs run in non-HA mode.
94
+
The value V, largest HA VM in the scale unit, is dynamically based on the largest tenant VM memory size. For example, the largest HA VM value is a minimum of 12 GB (accounting for the infrastructure VM) or 112 GB or any other supported VM memory size in the Azure Stack Hub solution. Changing the largest HA VM on the Azure Stack Hub fabric results in an increase in the resiliency reserve and also to the increase in the memory of the VM itself. Remember that GPU VMs run in non-HA mode.
94
95
95
96
### Sample calculation
96
97
97
-
We have a small four-node Azure Stack Hub deployment with 768 GB RAM on each node. We plan to place a virtual machine for SQL server with 128GB of RAM (Standard_E16_v3). What will be the available memory for VM placement?
98
+
We have a small four-node Azure Stack Hub deployment with 768 GB RAM on each node. We plan to place a virtual machine for SQL server with 128GB of RAM (Standard_E16_v3). What is the available memory for VM placement?
98
99
99
-
* Total host memory = Sum of memory from all nodes = 4 * 768 GB = 3072 GB
Available memory for VM placement = total host memory - resiliency reserve - memory used by running tenant VMs - Azure Stack Hub Infrastructure Overhead
105
106
@@ -126,43 +127,43 @@ There are three ways to deallocate memory for VM placement using the formula **R
126
127
127
128
### Reduce the size of the largest VM
128
129
129
-
Reducing the size of the largest VM to the next smallest VM in stamp (24 GB) will reduce the size of the resiliency reserve.
130
+
Reducing the size of the largest VM to the next smallest VM in stamp (24 GB) reduces the size of the resiliency reserve.
130
131
131
132

132
-
133
+
133
134
Resiliency reserve = 384 + 172.8 + 48 = 604.8 GB
134
-
135
+
135
136
| Total memory | Infra GB | Tenant GB | Resiliency reserve | Total memory reserved | Total GB available for placement |
**Q**: What state do tenant VMs have to be in to consume memory?
194
195
195
-
**A**: In addition to running VMs, memory is consumed by any VMs that have landed on the fabric. This means that VMs that are in a "Creating" or "Failed" state will consume memory. VMs shut down from within the guest as opposed to stop deallocated from portal/powershell/cli will also consume memory.
196
+
**A**: In addition to running VMs, memory is consumed by any VMs that have landed on the fabric. This means that VMs that are in a "Creating" or "Failed" state consume memory. VMs shut down from within the guest as opposed to stop deallocated from portal/powershell/cli also consume memory.
196
197
197
198
**Q**: I have a four-host Azure Stack Hub. My tenant has 3 VMs that consume 56 GB of RAM (D5_v2) each. One of the VMs is resized to 112 GB RAM (D14_v2), and available memory reporting on dashboard resulted in a spike of 168 GB usage on the capacity blade. Subsequent resizing of the other two D5_v2 VMs to D14_v2 resulted in only 56 GB of RAM increase each. Why is this so?
0 commit comments