You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-deploy-azure-kubernetes-service.md
+78-23Lines changed: 78 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -55,14 +55,49 @@ When deploying to Azure Kubernetes Service, you deploy to an AKS cluster that is
55
55
56
56
- If you want to deploy models to GPU nodes or FPGA nodes (or any specific SKU), then you must create a cluster with the specific SKU. There is no support for creating a secondary node pool in an existing cluster and deploying models in the secondary node pool.
57
57
58
+
## Understand the deployment processes
59
+
60
+
The word "deployment" is used in both Kubernetes and Azure Machine Learning. "Deployment" has different meanings in these two contexts. In Kubernetes, a `Deployment` is a concrete entity, specified with a declarative YAML file. A Kubernetes `Deployment` has a defined lifecycle and concrete relationships to other Kubernetes entities such as `Pods` and `ReplicaSets`. You can learn about Kubernetes from docs and videos at [What is Kubernetes?](https://aka.ms/k8slearning).
61
+
62
+
In Azure Machine Learning, "deployment" is used in the more general sense of making available and cleaning up your project resources. The steps that Azure Machine Learning considers part of deployment are:
63
+
64
+
1. Zipping the files in your project folder, ignoring those specified in .amlignore or .gitignore
65
+
1. Scaling up your compute cluster (Relates to Kubernetes)
66
+
1. Building or downloading the dockerfile to the compute node (Relates to Kubernetes)
67
+
1. The system calculates a hash of:
68
+
- The base image
69
+
- Custom docker steps (see [Deploy a model using a custom Docker base image](https://docs.microsoft.com/azure/machine-learning/how-to-deploy-custom-docker-image))
70
+
- The conda definition YAML (see [Create & use software environments in Azure Machine Learning](https://docs.microsoft.com/azure/machine-learning/how-to-use-environments))
71
+
1. The system uses this hash as the key in a lookup of the workspace Azure Container Registry (ACR)
72
+
1. If it is not found, it looks for a match in the global ACR
73
+
1. If it is not found, the system builds a new image (which will be cached and pushed to the workspace ACR)
74
+
1. Downloading your zipped project file to temporary storage on the compute node
75
+
1. Unzipping the project file
76
+
1. The compute node executing `python <entry script> <arguments>`
77
+
1. Saving logs, model files, and other files written to `./outputs` to the storage account associated with the workspace
78
+
1. Scaling down compute, including removing temporary storage (Relates to Kubernetes)
79
+
80
+
### Azure ML router
81
+
82
+
The front-end component (azureml-fe) that routes incoming inference requests to deployed services automatically scales as needed. Scaling of azureml-fe is based on the AKS cluster purpose and size (number of nodes). The cluster purpose and nodes are configured when you [create or attach an AKS cluster](how-to-create-attach-kubernetes.md). There is one azureml-fe service per cluster, which may be running on multiple pods.
83
+
84
+
> [!IMPORTANT]
85
+
> When using a cluster configured as __dev-test__, the self-scaler is **disabled**.
86
+
87
+
Azureml-fe scales both up (vertically) to use more cores, and out (horizontally) to use more pods. When making the decision to scale up, the time that it takes to route incoming inference requests is used. If this time exceeds the threshold, a scale-up occurs. If the time to route incoming requests continues to exceed the threshold, a scale-out occurs.
88
+
89
+
When scaling down and in, CPU usage is used. If the CPU usage threshold is met, the front end will first be scaled down. If the CPU usage drops to the scale-in threshold, a scale-in operation happens. Scaling up and out will only occur if there are enough cluster resources available.
90
+
58
91
## Deploy to AKS
59
92
60
93
To deploy a model to Azure Kubernetes Service, create a __deployment configuration__ that describes the compute resources needed. For example, number of cores and memory. You also need an __inference configuration__, which describes the environment needed to host the model and web service. For more information on creating the inference configuration, see [How and where to deploy models](how-to-deploy-and-where.md).
61
94
62
95
> [!NOTE]
63
96
> The number of models to be deployed is limited to 1,000 models per deployment (per container).
64
97
65
-
### Using the SDK
98
+
<aid="using-the-cli"></a>
99
+
100
+
# [Python](#tab/python)
66
101
67
102
```python
68
103
from azureml.core.webservice import AksWebservice, Webservice
@@ -86,7 +121,7 @@ For more information on the classes, methods, and parameters used in this exampl
To deploy using the CLI, use the following command. Replace `myaks` with the name of the AKS compute target. Replace `mymodel:1` with the name and version of the registered model. Replace `myservice` with the name to give this service:
92
127
@@ -98,36 +133,57 @@ az ml model deploy -ct myaks -m mymodel:1 -n myservice -ic inferenceconfig.json
98
133
99
134
For more information, see the [az ml model deploy](https://docs.microsoft.com/cli/azure/ext/azure-cli-ml/ml/model?view=azure-cli-latest#ext-azure-cli-ml-az-ml-model-deploy) reference.
100
135
101
-
### Using VS Code
136
+
#[Visual Studio Code](#tab/visual-studio-code)
102
137
103
138
For information on using VS Code, see [deploy to AKS via the VS Code extension](tutorial-train-deploy-image-classification-model-vscode.md#deploy-the-model).
104
139
105
140
> [!IMPORTANT]
106
141
> Deploying through VS Code requires the AKS cluster to be created or attached to your workspace in advance.
107
142
108
-
### Understand the deployment processes
143
+
---
144
+
145
+
### Autoscaling
109
146
110
-
The word "deployment" is used in both Kubernetes and Azure Machine Learning. "Deployment" has different meanings in these two contexts. In Kubernetes, a `Deployment`is a concrete entity, specified with a declarative YAML file. A Kubernetes `Deployment` has a defined lifecycle and concrete relationships to other Kubernetes entities such as `Pods` and `ReplicaSets`. You can learn about Kubernetes from docs and videos at [What is Kubernetes?](https://aka.ms/k8slearning).
147
+
The component that handles autoscaling for Azure ML model deployments is azureml-fe, which is a smart request router. Since all inference requests go through it, it has the necessary data to automatically scale the deployed model(s).
111
148
112
-
In Azure Machine Learning, "deployment" is used in the more general sense of making available and cleaning up your project resources. The steps that Azure Machine Learning considers part of deployment are:
149
+
> [!IMPORTANT]
150
+
> ***Do not enable Kubernetes Horizontal Pod Autoscaler (HPA) for model deployments**. Doing so would cause the two auto-scaling components to compete with each other. Azureml-fe is designed to auto-scale models deployed by Azure ML, where HPA would have to guess or approximate model utilization from a generic metric like CPU usage or a custom metric configuration.
151
+
>
152
+
> ***Azureml-fe does not scale the number of nodes in an AKS cluster**, because this could lead to unexpected cost increases. Instead, **it scales the number of replicas for the model** within the physical cluster boundaries. If you need to scale the number of nodes within the cluster, you can manually scale the cluster or [configure the AKS cluster autoscaler](/azure/aks/cluster-autoscaler).
113
153
114
-
1. Zipping the files in your project folder, ignoring those specified in .amlignore or .gitignore
115
-
1. Scaling up your compute cluster (Relates to Kubernetes)
116
-
1. Building or downloading the dockerfile to the compute node (Relates to Kubernetes)
117
-
1. The system calculates a hash of:
118
-
- The base image
119
-
- Custom docker steps (see [Deploy a model using a custom Docker base image](https://docs.microsoft.com/azure/machine-learning/how-to-deploy-custom-docker-image))
120
-
- The conda definition YAML (see [Create & use software environments in Azure Machine Learning](https://docs.microsoft.com/azure/machine-learning/how-to-use-environments))
121
-
1. The system uses this hash as the key in a lookup of the workspace Azure Container Registry (ACR)
122
-
1. If it is not found, it looks for a match in the global ACR
123
-
1. If it is not found, the system builds a new image (which will be cached and registered with the workspace ACR)
124
-
1. Downloading your zipped project file to temporary storage on the compute node
125
-
1. Unzipping the project file
126
-
1. The compute node executing `python <entry script> <arguments>`
127
-
1. Saving logs, model files, and other files written to `./outputs` to the storage account associated with the workspace
128
-
1. Scaling down compute, including removing temporary storage (Relates to Kubernetes)
154
+
Autoscaling can be controlled by setting `autoscale_target_utilization`, `autoscale_min_replicas`, and `autoscale_max_replicas` for the AKS web service. The following example demonstrates how to enable autoscaling:
When you're using AKS, the scaling up and down of the compute is controlled by Kubernetes, using the dockerfile built or found as described above.
163
+
Decisions to scale up/down is based off of utilization of the current container replicas. The number of replicas that are busy (processing a request) divided by the total number of current replicas is the current utilization. If this number exceeds `autoscale_target_utilization`, then more replicas are created. If it is lower, then replicas are reduced. By default, the target utilization is 70%.
164
+
165
+
Decisions to add replicas are eager and fast (around 1 second). Decisions to remove replicas are conservative (around 1 minute).
166
+
167
+
You can calculate the required replicas by using the following code:
For more information on setting `autoscale_target_utilization`, `autoscale_max_replicas`, and `autoscale_min_replicas`, see the [AksWebservice](https://docs.microsoft.com/python/api/azureml-core/azureml.core.webservice.akswebservice?view=azure-ml-py) module reference.
131
187
132
188
## Deploy models to AKS using controlled rollout (preview)
When deploying to Azure Kubernetes Service, __key-based__ authentication is enabled by default. You can also enable __token-based__ authentication. Token-based authentication requires clients to use an Azure Active Directory account to request an authentication token, which is used to make requests to the deployed service.
0 commit comments