Skip to content

Commit febe9e4

Browse files
authored
Merge pull request #129560 from Blackmist/aks-autoscaler
Adding information on front-end scaling
2 parents 82121a8 + 28b9a6e commit febe9e4

File tree

2 files changed

+97
-39
lines changed

2 files changed

+97
-39
lines changed

articles/machine-learning/how-to-deploy-azure-kubernetes-service.md

Lines changed: 78 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -55,14 +55,49 @@ When deploying to Azure Kubernetes Service, you deploy to an AKS cluster that is
5555

5656
- If you want to deploy models to GPU nodes or FPGA nodes (or any specific SKU), then you must create a cluster with the specific SKU. There is no support for creating a secondary node pool in an existing cluster and deploying models in the secondary node pool.
5757

58+
## Understand the deployment processes
59+
60+
The word "deployment" is used in both Kubernetes and Azure Machine Learning. "Deployment" has different meanings in these two contexts. In Kubernetes, a `Deployment` is a concrete entity, specified with a declarative YAML file. A Kubernetes `Deployment` has a defined lifecycle and concrete relationships to other Kubernetes entities such as `Pods` and `ReplicaSets`. You can learn about Kubernetes from docs and videos at [What is Kubernetes?](https://aka.ms/k8slearning).
61+
62+
In Azure Machine Learning, "deployment" is used in the more general sense of making available and cleaning up your project resources. The steps that Azure Machine Learning considers part of deployment are:
63+
64+
1. Zipping the files in your project folder, ignoring those specified in .amlignore or .gitignore
65+
1. Scaling up your compute cluster (Relates to Kubernetes)
66+
1. Building or downloading the dockerfile to the compute node (Relates to Kubernetes)
67+
1. The system calculates a hash of:
68+
- The base image
69+
- Custom docker steps (see [Deploy a model using a custom Docker base image](https://docs.microsoft.com/azure/machine-learning/how-to-deploy-custom-docker-image))
70+
- The conda definition YAML (see [Create & use software environments in Azure Machine Learning](https://docs.microsoft.com/azure/machine-learning/how-to-use-environments))
71+
1. The system uses this hash as the key in a lookup of the workspace Azure Container Registry (ACR)
72+
1. If it is not found, it looks for a match in the global ACR
73+
1. If it is not found, the system builds a new image (which will be cached and pushed to the workspace ACR)
74+
1. Downloading your zipped project file to temporary storage on the compute node
75+
1. Unzipping the project file
76+
1. The compute node executing `python <entry script> <arguments>`
77+
1. Saving logs, model files, and other files written to `./outputs` to the storage account associated with the workspace
78+
1. Scaling down compute, including removing temporary storage (Relates to Kubernetes)
79+
80+
### Azure ML router
81+
82+
The front-end component (azureml-fe) that routes incoming inference requests to deployed services automatically scales as needed. Scaling of azureml-fe is based on the AKS cluster purpose and size (number of nodes). The cluster purpose and nodes are configured when you [create or attach an AKS cluster](how-to-create-attach-kubernetes.md). There is one azureml-fe service per cluster, which may be running on multiple pods.
83+
84+
> [!IMPORTANT]
85+
> When using a cluster configured as __dev-test__, the self-scaler is **disabled**.
86+
87+
Azureml-fe scales both up (vertically) to use more cores, and out (horizontally) to use more pods. When making the decision to scale up, the time that it takes to route incoming inference requests is used. If this time exceeds the threshold, a scale-up occurs. If the time to route incoming requests continues to exceed the threshold, a scale-out occurs.
88+
89+
When scaling down and in, CPU usage is used. If the CPU usage threshold is met, the front end will first be scaled down. If the CPU usage drops to the scale-in threshold, a scale-in operation happens. Scaling up and out will only occur if there are enough cluster resources available.
90+
5891
## Deploy to AKS
5992

6093
To deploy a model to Azure Kubernetes Service, create a __deployment configuration__ that describes the compute resources needed. For example, number of cores and memory. You also need an __inference configuration__, which describes the environment needed to host the model and web service. For more information on creating the inference configuration, see [How and where to deploy models](how-to-deploy-and-where.md).
6194

6295
> [!NOTE]
6396
> The number of models to be deployed is limited to 1,000 models per deployment (per container).
6497
65-
### Using the SDK
98+
<a id="using-the-cli"></a>
99+
100+
# [Python](#tab/python)
66101

67102
```python
68103
from azureml.core.webservice import AksWebservice, Webservice
@@ -86,7 +121,7 @@ For more information on the classes, methods, and parameters used in this exampl
86121
* [Model.deploy](https://docs.microsoft.com/python/api/azureml-core/azureml.core.model.model?view=azure-ml-py#&preserve-view=truedeploy-workspace--name--models--inference-config-none--deployment-config-none--deployment-target-none--overwrite-false-)
87122
* [Webservice.wait_for_deployment](https://docs.microsoft.com/python/api/azureml-core/azureml.core.webservice%28class%29?view=azure-ml-py#&preserve-view=truewait-for-deployment-show-output-false-)
88123

89-
### Using the CLI
124+
# [Azure CLI](#tab/azure-cli)
90125

91126
To deploy using the CLI, use the following command. Replace `myaks` with the name of the AKS compute target. Replace `mymodel:1` with the name and version of the registered model. Replace `myservice` with the name to give this service:
92127

@@ -98,36 +133,57 @@ az ml model deploy -ct myaks -m mymodel:1 -n myservice -ic inferenceconfig.json
98133

99134
For more information, see the [az ml model deploy](https://docs.microsoft.com/cli/azure/ext/azure-cli-ml/ml/model?view=azure-cli-latest#ext-azure-cli-ml-az-ml-model-deploy) reference.
100135

101-
### Using VS Code
136+
# [Visual Studio Code](#tab/visual-studio-code)
102137

103138
For information on using VS Code, see [deploy to AKS via the VS Code extension](tutorial-train-deploy-image-classification-model-vscode.md#deploy-the-model).
104139

105140
> [!IMPORTANT]
106141
> Deploying through VS Code requires the AKS cluster to be created or attached to your workspace in advance.
107142
108-
### Understand the deployment processes
143+
---
144+
145+
### Autoscaling
109146

110-
The word "deployment" is used in both Kubernetes and Azure Machine Learning. "Deployment" has different meanings in these two contexts. In Kubernetes, a `Deployment` is a concrete entity, specified with a declarative YAML file. A Kubernetes `Deployment` has a defined lifecycle and concrete relationships to other Kubernetes entities such as `Pods` and `ReplicaSets`. You can learn about Kubernetes from docs and videos at [What is Kubernetes?](https://aka.ms/k8slearning).
147+
The component that handles autoscaling for Azure ML model deployments is azureml-fe, which is a smart request router. Since all inference requests go through it, it has the necessary data to automatically scale the deployed model(s).
111148

112-
In Azure Machine Learning, "deployment" is used in the more general sense of making available and cleaning up your project resources. The steps that Azure Machine Learning considers part of deployment are:
149+
> [!IMPORTANT]
150+
> * **Do not enable Kubernetes Horizontal Pod Autoscaler (HPA) for model deployments**. Doing so would cause the two auto-scaling components to compete with each other. Azureml-fe is designed to auto-scale models deployed by Azure ML, where HPA would have to guess or approximate model utilization from a generic metric like CPU usage or a custom metric configuration.
151+
>
152+
> * **Azureml-fe does not scale the number of nodes in an AKS cluster**, because this could lead to unexpected cost increases. Instead, **it scales the number of replicas for the model** within the physical cluster boundaries. If you need to scale the number of nodes within the cluster, you can manually scale the cluster or [configure the AKS cluster autoscaler](/azure/aks/cluster-autoscaler).
113153
114-
1. Zipping the files in your project folder, ignoring those specified in .amlignore or .gitignore
115-
1. Scaling up your compute cluster (Relates to Kubernetes)
116-
1. Building or downloading the dockerfile to the compute node (Relates to Kubernetes)
117-
1. The system calculates a hash of:
118-
- The base image
119-
- Custom docker steps (see [Deploy a model using a custom Docker base image](https://docs.microsoft.com/azure/machine-learning/how-to-deploy-custom-docker-image))
120-
- The conda definition YAML (see [Create & use software environments in Azure Machine Learning](https://docs.microsoft.com/azure/machine-learning/how-to-use-environments))
121-
1. The system uses this hash as the key in a lookup of the workspace Azure Container Registry (ACR)
122-
1. If it is not found, it looks for a match in the global ACR
123-
1. If it is not found, the system builds a new image (which will be cached and registered with the workspace ACR)
124-
1. Downloading your zipped project file to temporary storage on the compute node
125-
1. Unzipping the project file
126-
1. The compute node executing `python <entry script> <arguments>`
127-
1. Saving logs, model files, and other files written to `./outputs` to the storage account associated with the workspace
128-
1. Scaling down compute, including removing temporary storage (Relates to Kubernetes)
154+
Autoscaling can be controlled by setting `autoscale_target_utilization`, `autoscale_min_replicas`, and `autoscale_max_replicas` for the AKS web service. The following example demonstrates how to enable autoscaling:
155+
156+
```python
157+
aks_config = AksWebservice.deploy_configuration(autoscale_enabled=True,
158+
autoscale_target_utilization=30,
159+
autoscale_min_replicas=1,
160+
autoscale_max_replicas=4)
161+
```
129162

130-
When you're using AKS, the scaling up and down of the compute is controlled by Kubernetes, using the dockerfile built or found as described above.
163+
Decisions to scale up/down is based off of utilization of the current container replicas. The number of replicas that are busy (processing a request) divided by the total number of current replicas is the current utilization. If this number exceeds `autoscale_target_utilization`, then more replicas are created. If it is lower, then replicas are reduced. By default, the target utilization is 70%.
164+
165+
Decisions to add replicas are eager and fast (around 1 second). Decisions to remove replicas are conservative (around 1 minute).
166+
167+
You can calculate the required replicas by using the following code:
168+
169+
```python
170+
from math import ceil
171+
# target requests per second
172+
targetRps = 20
173+
# time to process the request (in seconds)
174+
reqTime = 10
175+
# Maximum requests per container
176+
maxReqPerContainer = 1
177+
# target_utilization. 70% in this example
178+
targetUtilization = .7
179+
180+
concurrentRequests = targetRps * reqTime / targetUtilization
181+
182+
# Number of container replicas
183+
replicas = ceil(concurrentRequests / maxReqPerContainer)
184+
```
185+
186+
For more information on setting `autoscale_target_utilization`, `autoscale_max_replicas`, and `autoscale_min_replicas`, see the [AksWebservice](https://docs.microsoft.com/python/api/azureml-core/azureml.core.webservice.akswebservice?view=azure-ml-py) module reference.
131187

132188
## Deploy models to AKS using controlled rollout (preview)
133189

@@ -219,7 +275,6 @@ endpoint.delete_version(version_name="versionb")
219275

220276
```
221277

222-
223278
## Web service authentication
224279

225280
When deploying to Azure Kubernetes Service, __key-based__ authentication is enabled by default. You can also enable __token-based__ authentication. Token-based authentication requires clients to use an Azure Active Directory account to request an authentication token, which is used to make requests to the deployed service.

0 commit comments

Comments
 (0)